Binary arithmetic

Introduction

An important part of the use of logic circuits is for computing various mathematical operations such as addition, multiplication, trigonometric operations, etc.

We must therefore have a way of representing numbers as binary data.

Nonnegative integers

The easiest numbers to represent are the nonnegative integers. To see how this can be done, recall how we represent number in the decimal system. A number such as 2034 is interpreted as:
 2*103 + 0*102 + 3*101 + 4*100
But there is nothing special with the base 10, so we can just as well use base 2. In base 2, each digit value is either 0 or 1, which we can represent for instance by false and true, respectively.

In fact, we have already hinted at this possibility, since we usually write 0, and 1 instead of false and true.

All the normal algorithms for decimal arithmetic have versions for binary arithmetic, except that they are usually simpler.

For adding two numbers, it suffices to notice that there is a carry of 1 whenever we add 0 and 1, or 1 and 1:

     1   1 1
     -   - -
       1 0 1 1
   +   1 0 0 1
   -----------
     1 0 1 0 0
Subtraction is no harder:

         10 10
         -- --
       1  0  0  1
   -      1  1  0
   --------------
       0  0  1  1
For multiplying two number, the algorithm is simpler, since we only multiply by 0 (which gives all zeros), or 1 (which gives the original number):

	   1 1 0 1
       *     1 0 1
       -----------
	   1 1 0 1
	 0 0 0 0
       1 1 0 1
    --------------
     1 0 0 0 0 0 1
Finally, division is just repeated subtraction as in decimal arithmetic:

          0 1 1 0
         --------
    1 0 | 1 1 0 1
    ----  1 0
          ---
            1 0
            1 0
            ---
              0 1

Negative integers

Things are easy as long as we stick to nonnegative integers. They become more complicated when we want to represent negative integers as well.

It may seem that we can do just what we do in decimal representation, i.e., indicate with a special sign whether the number is negative or not. In binary arithmetic, we could simply reserve one bit to determine the sign. In the circuitry for addition, we would have one circuit for adding two number, and another for subtracting two numbers. The combination of signs of the two inputs would determine which circuit to use on the absolute values, as well as the sign of the output.

While this method works, it turns out that there is one that is much easier to deal with by electronic circuits. This method is called the "two's complement" method. It turns out that with this method, we do not need a special circuit for subtracting two numbers.

In order to explain this method, we first show how it would work in decimal arithmetic with infinite precision. Then we show how it works with binary arithmetic, and finally how it works with finite precision.

Infinite-precision ten's complement

Imagine the odometer of an automobile. It has a certain number of wheels, each with the ten digits on it. When one wheel goes from 9 to 0, the wheel immediately to the left of it, advances by one position. If that wheel already showed 9, it too goes to 0 and advances the wheel to its left, etc. Suppose we run the car backwards. Then the reverse happens, i.e. when a wheel goes from 0 to 9, the wheel to its left decreases by one.

Now suppose we have an odometer with an infinite number of wheels. We are going to use this infinite odometer to represent all the integers.

When all the wheels are 0, we interpret the value as the integer 0.

A positive integer n is represented by an odometer position obtained by advancing the rightmost wheel n positions from 0. Notice that for each such positive number, there will be an infinite number of wheels with the value 0 to the left.

A negative integer n is represented by an odometer position obtained by decreasing the rightmost wheel n positions from 0. Notice that for each such positive number, there will be an infinite number of wheels with the value 9 to the left.

In fact, we don't need an infinite number of wheels. For each number only a finite number of wheels is needed. We simply assume that the leftmost wheel (which will be either 0 or 9) is duplicated an infinite number of times to the left.

While for each number we only need a finite number of wheels, the number of wheels is unbounded, i.e., we cannot use a particular finite number of wheels to represent all the numbers. The difference is subtle but important (but perhaps not that important for this particular course). If we need an infinite number of wheels, then there is no hope of ever using this representation in a program, since that would require an infinite-size memory. If we only need an unbounded number of wheels, we may run out of memory, but we can represent a lot of numbers (each of finite size) in a useful way. Since any program that runs in finite time only uses a finite number of numbers, with a large enough memory, we might be able to run our program.

Now suppose we have an addition circuit that can handle nonzero integers with an infinite number of digits. In other words, when given a number starting with an infinite number of 9s, it will interpret this as an infinitely large positive number, whereas our interpretation of it will be a negative number. Let us say, we give this circuit the two numbers ...9998 (which we interpret as -2) and ...0005 (which we interpret as +5). It will add the two numbers. First it adds 8 and 5 which gives 3 and a carry of 1. Next, it adds 9 and the carry 1, giving 0 and a carry of 1. For all remaining (infinitely many) positions, the value will be 0 with a carry of 1, so the final result is ...0003. This result is the correct one, even with our interpretation of negative numbers. You may argue that the carry must end up somewhere, and it does, but in infinity. In some ways, we are doing arithmetic modulo infinity.

Some implementations of some programming languages with arbitrary precision integer arithmetic (Lisp for instance) use exactly this representation of negative integers.

Let us finish this section by giving a simple method for computing the absolute value of a negative integer in our representation. It suffices to take each individual digit, replace it by 9 minus its original value, and then at the end, add 1 to the number obtained. So for instance, the number ...9998 becomes 1 plus ...0001 which is ...0002. This method works both ways, i.e. you can also use it to negate a positive number.

Finite-precision ten's complement

What we have said in the previous section works almost as well with a fixed bounded number of odometer wheels. The only problem is that we have to deal with overflow and underflow.

Suppose we have only a fixed number of wheels, say 3. In this case, we shall use the convention that if the leftmost wheel shows a digit between 0 and 4 inclusive, then we have a positive number, equal to its representation. When instead the leftmost wheel shows a digit between 5 and 9 inclusive, we have a negative number, whose absolute value can be computed with the method that we have in the previous section.

We now assume that we have a circuit that can add positive three-digit numbers, and we shall see how we can use it to add negative numbers in our representation.

Suppose again we want to add -2 and +5. The representations for these numbers with three wheels are 998 and 005 respectively. Our addition circuit will attempt to add the two positive numbers 998 and 005, which gives 1003. But since the addition circuit only has three digits, it will truncate the result to 003, which is the right answer for our interpretation.

A valid question at this point is in which situation our finite addition circuit will not work. The answer is somewhat complicated. It is clear that it always gives the correct result when a positive and a negative number are added. It is incorrect in two situations. The first situation is when two positive numbers are added, and the result comes out looking like a negative number, i.e, with a first digit somewhere between 5 and 9. You should convince yourself that no addition of two positive numbers can yield an overflow and still look like a positive number. The second situation is when two negative numbers are added and the result comes out looking like a nonnegative number, i.e, with a first digit somewhere between 0 and 4. Again, you should convince yourself that no addition of two negative numbers can yield an underflow and still look like a negative number.

We now have a circuit for addition of integers (positive or negative) in our representation. We simply use a circuit for addition of only positive numbers, plus some circuits that check:

Finite-precision two's complement

So far, we have studied the representation of negative numbers using ten's complement. In a computer, we prefer using base two rather than base ten. Luckily, the exact method described in the previous section works just as well for base two. For an n-bit adder (n is usually 32 or 64), we can represent positive numbers with a leftmost digit of 0, which gives values between 0 and 2(n-1) - 1, and negative numbers with a leftmost digit of 1, which gives values between -2(n - 1) and -1.

The exact same rule for overflow and underflow detection works. If, when adding two positive numbers, we get a result that looks negative (i.e. with its leftmost bit 1), then we have an overflow. Similarly, if, when adding two negative numbers, we get a result that looks positive (i.e. with its leftmost bit 0), then we have an underflow.

Rational numbers

Integers are useful, but sometimes we need to compute with numbers that are not integer.

An obvious idea is to use rational numbers. Many algorithms, such as the simplex algorithm for linear optimization, use only rational arithmetic whenever the input is rational.

There is no particular difficulty in representing rational numbers in a computer. It suffices to have a pair of integers, one for the numerator and one for the denominator.

To implement arithmetic on rational numbers, we can use some additional restrictions on our representation. We may, for instance, decide that:

Such a set of rules makes sure that our representation is canonical, i.e., that the representation for a value is unique, even though, a priori, many representations would work.

Circuits for implementing rational arithmetic would have to take such rules into account. In particular, the last rule would imply dividing the two integers resulting from every arithmetic operation with their largest common factor to obtain the canonical representation.

Rational numbers and rational arithmetic is not very common in the hardware of a computer. The reason is probably that rational numbers don't behave very well with respect to the size of the representation. For rational numbers to be truly useful, their components, i.e., the numerator and the denominator, both need to be arbitrary-precision integers. As we have mentioned before, arbitrary precision anything does not go very well with fixed-size circuits inside the CPU of a computer.

Programming languages, on the other hand, sometimes use arbitrary-precision rational numbers. This is the case, in particular, with the language Lisp.

Floating-point numbers

Instead of using the obvious representation of rational numbers presented in the previous section, most computers use a different representation of a subset of the rational numbers. We call these numbers floating-point numbers.

Floating-point numbers use inexact arithmetic, and in return require only a fixed-size representation. For many computations (so-called scientific computations, as if other computations weren't scientific) such a representation has the great advantage that it is fast, while at the same time usually giving adequate precision.

There are some (sometimes spectacular) exceptions to the "adequate precision" statement in the previous paragraph, though. As a result, an entire discipline of applied mathematics, called numerical analysis, has been created for the purpose of analyzing how algorithms behave with respect to maintaining adequate precision, and of inventing new algorithms with better properties in this respect.

The basic idea behind floating-point numbers is to represent a number as mantissa and an exponent, each with a fixed number of bits of precision. If we denote the mantissa with m and the exponent with e, then the number thus represented is m * 2e.

Again, we have a problem that a number can have several representations. To obtain a canonical form, we simply add a rule that m must be greater than or equal to 1/2 and strictly less than 1. If we write such a mantissa in binal (analogous to decimal) form, we always get a number that starts with 0.1. This initial information therefore does not have to be represented, and we represent only the remaining "binals".

The reason floating-point representations work well for so-called scientific applications, is that we more often need to multiply or divide two numbers. Multiplication of two floating-point numbers is easy to obtain. It suffices to multiply the mantissas and add the exponents. The resulting mantissa might be smaller than 1/2, in fact, it can be as small as 1/4. In this case, the result needs to be canonicalized. We do this by shifting the mantissa left by one position and subtracting one from the exponent. Division is only slightly more complicated. Notice that the imprecision in the result of a multiplication or a division is only due to the imprecision in the original operands. No additional imprecision is introduced by the operation itself (except possibly 1 unit in the least significant digit). Floating-point addition and subtraction do not have this property.

To add two floating-point numbers, the one with the smallest exponent must first have its mantissa shifted right by n steps, where n is the difference of the exponents. If n is greater than the number of bits in the representation of the mantissa, the second number will be treated as 0 as far as addition is concerned. The situation is even worse for subtraction (or addition of one positive and one negative number). If the numbers have roughly the same absolute value, the result of the operation is roughly zero, and the resulting representation may have no correct significant digits.

The two's complement representation that we mentioned above is mostly useful for addition and subtraction. It only complicates things for multiplication and division. For multiplication and division, it is better to use a representation with sign + absolute value. Since multiplication and division is more common with floating-point numbers, and since they result in multiplication and division of the mantissa, it is more advantageous to have the mantissa represented as sign + absolute value. The exponents are added, so it is more common to use two's complement (or some related representation) for the exponent.

Usually, computers manipulate data in chunks of 8, 16, 32, 64, or 128 bits. It is therefore useful to fit a single floating-point number with both mantissa and exponent in such a chunk. In such a chunk, we need to have room for the sign (1 bit), the mantissa, and the exponent. While there are many different ways of dividing the remaining bits between the mantissa and the exponent, in practice most computers now use a norm called IEEE-???, which mandates the following formats: ???

Real numbers

One sometimes hears variations on the phrase "computers can't represent real numbers exactly". This, of course is not true. Nothing prevents us from representing (say) the square-root of two as the number two and a bit indicating that the value is the square root of the representation. Some useful operations could be very fast this way. It is true, though that we cannot represent all real numbers exactly. In fact, we have a similar problem that we have with rational numbers, in that it is hard to pick a useful subset that we can represent exactly, other than the floating-point numbers.

For this reason, no widespread hardware contains built-in real numbers other than the usual approximations in the form of floating-point.