Saturday, May 15, 2010

Optimising pointer subtraction with 2-adic integers.

Here is a simple C type and a function definition:


struct A
{
char x[7];
};

int diff(struct A *a, struct A *b)
{
return a-b;
}


It doesn't seem like there could be much to say about that. The A structure is 7 bytes long so the subtraction implicitly divides by 7. That's about it. But take a look at the assembly language generated when it's compiled with gcc:


movl 4(%esp), %eax
subl 8(%esp), %eax
imull $-1227133513, %eax, %eax
ret


Where is the division by 7? Instead we see multiplication by -1227133513. A good first guess is that maybe this strange constant is an approximate fixed point representation of 1/7. But this is a single multiplication with no shifting or bit field selection tricks. So how does this work? And what is -1227133513? Answering that question will lead us on a trip through some suprising and abstract mathematics. Among other things, we'll see how not only can you represent negative numbers as positive numbers in binary using twos complements, but that we can also represent fractions similarly in binary too.

But first, some history.

Some n-bit CPUs



That's an Intel 4004 microprocessor, the first microprocessor completely contained on a single chip. It was a 4 bit processor equipped with 4 bit registers. With 4 bits we can represent unsigned integers from 0 to 15 in a single register. But what happens if we want to represent larger integers?

Let's restrict ourselves to arithmetic operations using only addition, subtraction and multiplication and using one register per number. Then a curious thing happens. Take some numbers outside of the range 0 to 15 and store only the last 4 bits of each number in our registers. Now perform a sequence of additions, subtractions and multiplications. Obviously we usually expect to get the wrong result because if the final result is outside of our range we can't represent it in a single register. But the result we do get will have the last 4 bits of the correct result. This happens because in the three operations I listed, the value of a bit in a result doesn't depend on higher bits in the inputs. Information only propagates from low bit to high bit. We can think of a 4004 as allowing us to correctly resolve the last 4 bits of a result. From the perspective of a 4004, 1 and 17 and 33 all look like the same number. It doesn't have the power to distinguish them. But if we had a more powerful 8 bit processor like the 6502 in this machine, we could distinguish them:

Introduction
This is analogous to the situation we have with distances in physical world. With our eyes we can resolve details maybe down to 0.5mm. If we want to distinguish anything smaller we need more powerful equipment, like a magnifying class. When that fails we can get a microscope, or an electron microscope, or these days even an atomic force microscope. The more we pay, the smaller we can resolve. We can think of the cost of the equipment required to resolve two points as being a kind of measure of how close they are.

We have the same with computers. To resolve 1 and 17 we need an 8-bit machine. To resolve 1 and 65537 we need a 32-bit machine. And so on. So if we adopt a measure based on cost like in the previous paragraph, there is a sense in which 1 is close to 17, but 1 is even closer to 257, and it's closer still to 65537. We have this inverted notion of closeness where numbers separated by large (in the usual sense) powers of two are close in this new sense.

We have an interesting relationship between computing machines with different 'resolving' power. If we take an arithmetical computation on an N-bit machine, and then take the last M bits of the inputs and result, we get exactly what the M-bit machine would have computed. So an M-bit machine can be thought of as a kind of window onto the last M-bits onto an N-bit machine. Here's a sequence of machines:

Each machine provides a window onto the low bits of the previous machine in the sequence. But what happens at the "..." on the left? That suggests the bizarre idea that maybe all of these finite machines could be thought of as window to some infinite bit machine. Does that idea make any kind of sense?

I'll try to convince you that's a sensible idea by pointing out that it's something familiar to anyone who's taken a rigorous analysis course. (And I'll mention in passing that the above diagram illustrates a limit in an appropriate category!)

Mathematicians (often) build the real numbers from the rational numbers by a process known as completion. Consider a sequence like 1, 14/10, 141/100, 1414/1000, ... . The nth term is the largest fraction, with 10n in the denominator, such that its square is less than 2. It's well known that there is no rational number whose square is 2. And yet it feels like this sequence ought to be converging to something. It feels this way because successive terms in the sequence get as close to each other as you like. If you pick any ε there will be a term in the series, say x, with the property that later terms never deviate from x by more than ε. Such a sequence is called a Cauchy sequence. But these sequences don't all converge to rational numbers. A number like √2 is a gap. What are we to do?

Mathematicians fill the gap by defining a new type of number, the real number. These are by definition Cauchy sequences. Now every Cauchy sequence converges to a real number because, by definition, the real number it converges to is the sequence. For this to be anything more than sleight of hand we need to prove that we can do arithmetic with these sequences. But that's just a technical detail that can be found in any analysis book. So, for example, we can think of the sequence I gave above as actually being the square root of two. In fact, the decimal notation we use to write √2, 1.414213..., can be thought of as shorthand for the sequence (1, 14/10, 141/100, ...).

The notion of completeness depends on an an idea of closeness. I've described an alternative to the usual notion of closeness and so we can define an alternative notion of Cauchy sequence. We'll say that the sequence x1, x2, ... is a Cauchy sequence in the new sense if all the numbers from xn onwards agree on their last n bits. (This isn't quite the usual definition but it'll do for here.) For example, 1, 3, 7, 15, 31, ... define a Cauchy sequence. We consider a Cauchy sequence equal to zero if xn always has zero for its n lowest bits. So 2, 4, 8, 16, 32, ... is a representation of zero. We can add, subtract and multiply Cauchy sequences pointwise, so, for example, the product and sum of xn and yn has terms xnyn. Two Cauchy sequences are considered equal if their difference is zero. These numbers are called 2-adic integers.

Exercise: prove that if x is a 2-adic integer then x+0=x and that 0x=0.

There's another way of looking at 2-adic integers. They are infinite strings of binary digits, extending to the left. The last n digits are simply given by the last n digits of xn. For example we can write 1, 3, 7, 31, ... as ...1111111. Amazingly we can add subtract and multiply these numbers using the obvious extensions of the usaul algorithms. Let's add ...1111111 to 1:


...11111111
...00000001
-----------
...00000000


We get a carry of 1 that ripples off to infinity and gives us zeroes all the way.

We can try doing long multiplication of ...111111 with itself. We get:


...11111111
...1111111
...111111
...11111
...
-----------
...00000001


It's important to notice that even though there are an infinite number of rows and columns in that multiplication you only need to multiply and add a finite number of numbers to get any digit of the result. If you don't like that infinite arrangement you can instead compute the last n digits of the product by multiplying 11...n digits...111 by itself and taking the last n digits. The infinite long multiplication is really the same as doing this for all n and organising it in one big table.

So ...1111111 has many of the properties we expect of -1. Added to 1 we get zero and squaring it gives 1. It is -1 in the 2-adic integers. This gives us a new insight into twos complement arithmetic. The negative twos-complements are the truncated last n digits of the 2-adic representations of the negative integers. We should properly be thinking of twos-complement numbers as extending out to infinity on the left.

The field of analysis makes essential use of the notion of closeness with its δ and ε proofs. Many theorems from analysis carry over to the 2-adic integers. We find ourselves in a strange alternative number universe which is a sort of mix of analysis and number theory. In fact, people have even tried studying physics in p-adic universes. (p-adics are what you get when you repeat the above for base p numbers, but I don't want to talk about that now.) One consequence of analysis carrying over is that some of our intuitions about real numbers carry over to the 2-adics, even though some of our intuitive geometric pictures seem like they don't really apply. I'm going to concentrate on one example.

The Newton-Raphson Method
I hope everyone is familiar with the Newton-Raphson method for solving equations. If we wish to solve f(x)=0 we start with an estimate xn. We find the tangent to y=f(x) at x=xn. The tangent line is an approximation to the curve y=f(x) so we solve the easy problem of finding where the tangent line crosses the x-axis to get a new estimate xn+1. This gives the formula

xn+1 = xn-f(xn)/f'(xn).

With luck the new estimate will be closer than the old one. We can do some analysis to get some sufficient conditions for convergence.

The surprise is this: the Newton-Raphson method often works very well for the 2-adic integers even though the geometric picture of lines crossing axes doesn't quite make sense. In fact, it often works much better than with real numbers allowing us to state very precise and easy to satisfy conditions for convergence.

Now let's consider the computation of reciprocals of real numbers. To find 1/a we wish to solve f(x)=0 where f(x)=1/x-a. Newton's method gives the iteration xn+1 = xn(2-axn). This is a well know iteration that is used internally by CPUs to compute reciprocals. But for it to work we need to start with a good estimate. The famous Pentium divide bug was a result of it using an incorrect lookup table to provide the first estimate. So let's say we want to find 1/7. We might start with an estimate like 0.1 and quickly get estimates 0.13, 0.1417, 0.142848, ... . It's converging to the familiar 0.142857...

But what happens if we start with a bad estimate like 1. We get the sequence:


1
-5
-185
-239945
-403015701065
-1136951587135200126341705


It's diverging badly. But now let's look at the binary:


00000000000000000000000000000000000000000000000000000000000000000000000000000000001
11111111111111111111111111111111111111111111111111111111111111111111111111111111011
11111111111111111111111111111111111111111111111111111111111111111111111111101000111
11111111111111111111111111111111111111111111111111111111111111111000101011010110111
11111111111111111111111111111111111111111111010001000101010011001000110110110110111
11100001111001111011011101100100000010000000011011010110110110110110110110110110111


Our series may be diverging rapidly in the usual sense, but amazingly it's converging rapidly in our new 2-adic sense!

If it's really converging to a meaningful reciprocal we'd expect that if we multiplied the last n digits of these numbers by 7 then we'd get something that agreed with the number 1 in the last 7 digits. Let's take the last 32 digits:


10110110110110110110110110110111


and multiply by 7:


10100000000000000000000000000000001


The last 32 bits are


00000000000000000000000000000001.


So if we're using 32 bit registers, and we're performing multiplication, addition and subtraction, then this number is, to all intents and purposes, a representation of 1/7. If we interpret as a twos complements number, then in decimal it's -1227133513. And that's the mysterious number gcc generated.

Epilogue
There are many things to follow up with. I'll try to be brief.

Try compiling C code with a struct of size 14. You'll notice some extra bit shifting going on. So far I've only defined the 2-adic integers. But to get he reciprocal of every non-zero number we need numbers whose digits don't just extend leftwards to infinity but also extend a finite number of steps to the right of the "binary point". These are the full 2-adic numbers as opposed to merely the 2-adic integers. That's how the extra shifts can be interpreted. Or more simply, if you need to divide by 14 you can divide by 2 first and then use the above method to divide by 7.

I don't know how gcc generates its approximate 2-adic reciprocals. Possibly it uses something based on the Euclidean GCD algorithm. I wasn't able to find the precise line of source in a reasonable time.

An example of a precise version of the Newton-Raphson method for the p-adics is the Hensel lemma.

The last thing I want to say is that all of the above is intended purely to whet your appetite and point out that a curious abstraction from number theory has an application to compiler writing. It's all non-rigourous and hand-wavey. Recommend reading further at Wikipedia. I learnt most of what I know on the subject from the first few chapters of Koblitz's book p-adic Numbers, p-adic Analysis, and Zeta functions. The proof of the von Staudt–Clausen theorem in that book is mindblowing. It reveals that the real numbers and the p-adic numbers are equally valid ways to approximately get a handle on rational numbers and that there are whole alternative p-adic universes out there inhabited by weird versions of familiar things like the Riemann zeta function.

(Oh, and please don't take the talk of CPUs too literally. I'm fully aware that you can represent big numbers even on a 4 bit CPU. But what what I say about a model of computation restricted to multiplication, addition and subtraction in single n-bit registers holds true.)

Some Challenges
1. Prove from first principles that the iteration for 1/7 converges. Can you prove how many digits it generates at a time?
2. Can you find a 32 bit square root of 7? Using the Newton-Raphson method? Any other number? Any problems?

Acknowledgements
Update: I have replaced the images with images that are, to the best of my knowledge, public domain, or composited from public domain images. (Thanks jisakujien. My bad.)

Update
If you want to play with some p-adics yourself, there is some code to be found here. That also has code for transcendental functions applied to p-adics.

Here's some C code to compute inverses of odd numbers modulo 232 (assuming 32 bit ints). Like the real valued Newton method, it doubles the number of correct digits at each step so we only need 5 iterations to get 32 bits. (C code as I think it's traditional to twiddle one's bits in C rather than Haskell.)

#include <stdio.h>
#include <assert.h>

typedef unsigned int uint;

uint inverse(uint x)
{
uint y = 2-x;
y = y*(2-x*y);
y = y*(2-x*y);
y = y*(2-x*y);
y = y*(2-x*y);

return y;
}

int main()
{
uint i;
for (i = 1; i<0xfffffffe; i += 2)
{
assert (i*inverse(i) == 1);
}
}




24 comments:

  1. Great post Dan, and very interesting.

    First of all, it is important to note that the GCC code is not a general answer to the division by 7, the formula only works for numbers exactly divisible by 7.

    You can find this without the need of Newton. I actually was asked that question in my first interview in the U.S. and below is how I quickly derived it during the interview. Funnily enough, the interviewers did not expect that such a solution could exist :)

    x = 7y
    x = 8y - y

    y = 8y - x = 2^3*y - x (1)

    Multiplying (1) by 2^3 gives

    2^3*y = 2^6*y - 2^3*x (2)

    Using (2) in (1) gives

    y = 2^6*y - x - 2^3*x

    Repeating the same procedure 10 times gives

    y = 2^33*y - x - 2^3*x - 2^6*x - 2^9*x - 2^12*x - 2^15*x - 2^18*x - 2^21*x - 2^24*x - 2^27*x - 2^30*x

    Since we are working with 32 bits numbers we are working modulo 2^32 and therefore 2^33*y = 0 [2^32]

    Therefore y = - x - 2^3*x - 2^6*x - 2^9*x - 2^12*x - 2^15*x - 2^18*x - 2^21*x - 2^24*x - 2^27*x - 2^30*x

    And then y = x * -( 1 + 2^3 + 2^6 + 2^9 + 2^12 + 2^15 + 2^18 + 2^21 + 2^24 + 2^27 + 2^30 )

    y = x * -1227133513

    You can use the same method to derive a method that works for all integers. And I leave the challenge to the readers that works for all integers without the use of the multiplication :)

    Cedrick

    ReplyDelete
  2. I thought you were going to say that -1227133513 is the inverse of 7 in the ring Z/(2^32 Z) of integers modulo 2^32, i.e., since -1227133513 * 7 == 1 mod 2^32 dividing by 7 is the same as multiplying by -1227133513. The number -1227133513 may be computed with Euclid's algorithm.

    P.S. I think that - in the source code should be /.

    ReplyDelete
  3. Andrej, yes you could use Euclid algorithm but I think my calculation is easier and less error prone.

    Since Dan mentioned first principles, it led me to try to find the simplest and easiest method to find the positive number n so that n * 7 = 1 [2^32] and I found that the easiest way could be done by kids in 4th or 5th grade without knowledge of Euclid or Bezout.

    Looking for the number in hexadecimal is the easiest way due to the modulo. Write the products of 1 to F with 7 you get 7, e, 15, 1c, 23, 2a, 31, 38, 3f, 46, 4d, 54, 5b, 62 and 69.

    Since the last digit of the product number with 7 is 1 only 7 works, so you have the last digit.
    Now ?7 x 7 = 01, given that 7x7 was 31, you need the last digit of the multiplication of ? and 7 to be 10 - 3 = D which comes from Bx7.
    You repeat the same procedure with ?B7 x 7 = 001, and you finally reach B6DB6DB7.

    I am interested by any simpler method.

    Dan, not sure if I understood the challenge about the square root since with the method above I can prove that there is no square root for 7.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. Cedrick,

    I did drop a big hint when I said "any problems?" :-) The situation roughly mimics the situation with the real numbers. Some of the real numbers have a square root, some have two, and zero has just one.

    Anyway, I like your method as it's very easy to understand. But the Newton method here is very easy to code up (one short line of code), and runs fast.

    ReplyDelete
  6. Just linking to a page about the GFDL (which isn't even the text of the GFDL--not that that would be sufficient either) is not enough. There are quite a lot of hoops that you have to jump through that not even Wikipedia itself gets right.

    You also have the problem that the images used in Wikipedia are not necessarily available under the GFDL in the first place.

    Thankfully, Wikipedia has transitioned to also using CC-BY-SA-3.0 which is much simpler to understand and meet the requirements of (with the caveat that the transition itself complicates things to some extent). Because of this, you really should take the small amount of effort to at least read the non-legal text of the license and attempt to follow it.

    Just saying 'oh, it is too hard and confusing' while making no real effort is quite dishonest. Saying that you took various images from Wikipedia and they might be available under some license isn't even a nominal attempt. Is it really that hard to have a link to the original image pages with their licensing info? That would at least be a good faith effort.

    ReplyDelete
  7. jisakujien

    > Is it really that hard to have a link to the original image pages with their licensing info?

    Yes it is. Not because linking is hard, but because I don't have a lawyer to tell me whether that is an acceptable solution either. I'll replace these images with public domain images within 12 hours and I appreciate that with your prompting I will then be 100% legal. Thanks for setting me straight.

    ReplyDelete
  8. Great post, thank you. I had never studied p-adic numbers and had only encountered them a few times; I thought they were just a curiosity and didn't realise they had applications. Hensel's lemma is really cool.

    [BTW, I'm not a lawyer, but as I understand it, in general, the only requirements of images under free licenses (like most images on Wikipedia, except a few that are used as "fair use") are that

    1. You credit the author. GFDL or Creative Commons aren't very clear on how to do this, but Wikipedia simply arranges so that clicking on the image will take you to a page that has author information, and if you link to the same page, it should be ok. (Or better, also add a line under the image crediting the author credited on Wikipedia.)

    2. Copyleft. "If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one." (The strictest and possibly incorrect interpretation would be that including the image makes your blog post freely licensed as well, but I don't think so.)]

    ReplyDelete
  9. I received a post from a user called Gery but it was attached to the wrong blog post so I'm posting it here:

    The colimit of your diagram is simply the Intel 4004 while the limit of the same diagram is an hypothetical infinite machine which projects onto every finite machines.

    So, I guess you meant to talk about limits, instead of colimits.

    ReplyDelete
  10. @Gery I'll fix that. When first I learnt about p-adics I was using direct/inverse limit language but these days I use limit/colimit language. It drives me crazy that *co*limit corresponds to *inverse* limit.

    ReplyDelete
  11. There is also a rather more 'hacky' introduction to this in the book and site
    'Hacker's Delight'

    http://www.hackersdelight.org/

    See the section on 'Magic Numbers'

    ReplyDelete
  12. @ardencaple You can see the Newton method if you view the source to this page.

    ReplyDelete
  13. I'll repeat the "Great post, Dan" comment. You continue to amaze me with the examples and situations you use to motivate abstract concepts.

    One problem, though: you say the rationals are constructed using Cauchy sequences. Do you, perhaps, mean the reals?

    ReplyDelete
  14. Cory,

    Which exact sentence do you think I wrote incorrectly? I'm a bit blind when it comes to spotting my own typos.

    ReplyDelete
  15. @sigfpe, in your reply to Cedrick, dated 17 May 2010, where you said, "Some of the real numbers have a square root, some have two, and zero has just one." should the "a" above be a "no", perhaps?

    ReplyDelete
  16. @BlackMeph Yes, well spotted. Substitute 'no' for 'a'.

    ReplyDelete
  17. A simpler way to find the number, based on Cedrick's method, is as follows:

    x = 7y
    y = (a+1)y - ay = (a+1)y - (a/7)x

    We want to choose a such that (1) a+1=0 [mod 2^32], and (2) a is divisible by 7. It turns out that a=2^33-1 is such a number, and a/7=1227133513. This means that mod 2^32,

    y = -(a/7)x = -1227133513x

    ReplyDelete
  18. @R,

    This

    "a+1=0 [mod 2^32], and (2) a is divisible by 7"

    is essentially a restatement of the original problem: we're trying to find a such that 7a=1 mod 2^32. 2^33-1 is a solution, so that explains why gcc chose the constant for this particular example. But what method are you proposing to find this solution?

    ReplyDelete
  19. You are correct, of course, I now realize that what I wrote is trivial. I used trial and error, which didn't take too long, obviously - 2^33-1 is the second choice, right after 2^32-1 which didn't work. Note that I'm trying to describe how a human being might reach this particular number, not how a computer could solve the general case.

    This did get me thinking, however, about long division: Divide 11111... by b in binary (in this case, b=7) and stop where the remainder is zero (after at least 32 bits). I don't know if this will always work when b is odd (it will never work when b is even, of course).

    Apologies if I'm being trivial again, or just plain wrong.

    ReplyDelete
  20. Quick comment: In this case, you can invert -7 pretty easily using the geometric series.

    -1/7 = 1/(1-8) = 1 + 8 + 8^2 + 8^3 + ...

    which converges 2-adically ....1001001001001001. Then get 1/7 by taking the "2's complement" negative.

    ReplyDelete
  21. tlawson,

    Nice one! BTW It works for any odd integer, but it goes faster for one less than a power of two. I should have picked a harder example than 7 :-)

    ReplyDelete
  22. tlawson this is the same method as me except that I did not feel comfortable using the second = sign in 1/(1-8) = 1 + 8 + 8^2 + 8^3 + ...
    hence my manual development.

    What makes you feel comfortable using that equality? You can justify it a posteriori but seems not obvious to me that 1/(1-x) =series(x^n) [2^32] when 1<x .

    Dan about R's method: 2^3 = 1 [7]
    And therefore 2^3k = 1 [7] so the result appears magically by itself.

    ReplyDelete
  23. Cedrick,

    (1)

    (1-x)(1+x+x^2+...+x^31)=1-x^32

    If x is even x^32=0 mod 2^32, and so 1/(1-x) = 1+x+x^2+...+x^31.

    Alternatively:

    (2)

    From the 2-adic perspective, if x is even, the terms of the series are getting 'smaller' and 'smaller', so the infinite series converges. This is really just another way of saying (1), but in disguise.

    ReplyDelete
  24. Dan,

    Sure for even numbers. I should have slept more, I did not even spot your original comment :)

    ReplyDelete