Let S be some finite set with a probability distribution on it. Here's a diagram showing some example probabilities for the set S={A, B, C, D, E}:
How can we generate lots of random samples from this distribution? A popular method involves first storing the cumulative distribution function in a table like so:
We then use any of a number of popular methods to generate uniform pseudorandom numbers in the range [0,1) and for each one walk through the table until we find the first entry greater than our pseudorandom number. The symbol above the number is the one we generate. So if we generated 0.512, we'd pick symbol C. It's straightforward to prove this gives the correct probability distribution.
As described this algorithm can be slow. If the size of the table is N we may have to walk through up to N entries to generate each sample.
One approach to accelerating this algorithm is to quantise our pseudorandom number and use it to look up, in a precomputed table, a jump into our cumulative distribution table. I've used this several times in my visual effects career. But today I'm going to take a different approach.
Another natural way to speed up sample generation is to use a binary search to find the appropriate point in the cumulative distribution, for example the C++ standard template library's upper_bound method will do the job.
But what is a binary search algorithm going to do? Typically it's going to start by comparing our pseudorandom number with the midpoint of the table. If our number is bigger then it will recursively use binary search on the left (looking next at the midpoint of the left half), otherwise on the right, and so on. If we're generating many samples from the same distribution we're going to be repeatedly looking up the same midpoints in the table. At the end of the day, the process can be described by a decision tree. So we may as well throw away the table and build the decision tree up front. Here's what it might look look like:
But maybe we can do better. For example C is three times as likely as A but they both take the same amount of time to generate as they both require walking down to depth 2. Meanwhile D has a probability of 0.25 and requires walking to depth 3. We'd like to rebalance the tree. Note also that there's no reason to list our original PDF in the order I gave. Different orderings might give different trees.
It's straightforward to describe what we want to optimise. We want to place sample i at depth di so that the expected value of di is as small as possible. In other words we want to minimise Σpidi. But this is precisely the problem solved by Huffman coding.
And that's the conclusion I wanted to reach: Huffman coding gives the optimal decision tree to generate random samples. It also tells us this interesting consequence: if we use a decision tree method then the performance of our algorithm is bounded by the entropy of the probability distribution. I find this connection between entropy and algorithmic complexity pretty surprising.
I learnt the above during my interview at Google!
Why is there this connection between a compression algorithm and the generation of random samples? It took me a little longer to realise why but it's quite straightforward.
Huffman coding tries to compress text one letter at a time on the assumption that each letter comes from some fixed and known probability distribution. If the algorithm is successful then we'd expect the compressed text to look like a uniformly distributed sequence of bits. If it didn't then there'd be patterns that could be used for further compression.
So when we're decompressing Huffman encoded data we have a machine that takes as input uniformly distributed bits and which outputs letters sampled from some probability distribution. But that's exactly the same problem that I posed above. So, at least from some perspective, decompression is precisely the same thing as generating samples from a probability distribution.
Or to put it another way: there is a class of algorithm whose purpose is to convert samples from one probability distribution into samples from another. When we use one of these algorithms to convert from samples from some distribution P that's easy to generate samples from, into samples from Q, we call this "an algorithm to sample from distribution Q". When we use one of these algorithms to convert from some distribution to a uniform distribution, and the function given by the algorithm is invertible, then we call it "lossless compression".
That's a really great insight! It goes to show the value in asking the right question :) At last I know when a raven is like a writing desk.
ReplyDeleteNice! This is similar to a Galois tech talk I gave in May:
ReplyDeletehttp://corp.galois.com/blog/2011/5/16/tech-talk-video-empirical-sampling-with-haskell.html
This concept of a decoder as a random sample generator is explained in MacKay's book, my personal reference book in the subject.
ReplyDeleteYou can download it for free at
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html
It appears at chapter 6.3, page 118 of the book, 130 of the PDF file.
@lvps1000vm
ReplyDeleteI really need to read that book!
(PS I was a student at the same time and place as MacKay. It was interesting suddenly seeing his name everywhere years later.)
There'es a faster algorithm for selecting elements at random from a finite distribution known as the Alias Method that doesn't suffer from the O(log n) worse case. Keith Schwarz has a good description written up called Darts, Dice, and Coins: Sampling from a Discrete Distribution.
ReplyDelete@Justin,
ReplyDeleteI've had links to that article on both G+ and Twitter now! Looks like I missed out on a nice article a few weeks ago. But does the alias method imply anything interesting about compression algorithms?
@sigfpe, I don't believe so, it's a rather clever way of doing random sampling. That said, I haven't looked at the literature beyond the article linked.
ReplyDeleteI see how this tells us about the optimal number of decisions you have to make while generating a random number, but we still cannot generate the numbers by just taking a random bit string and decompressing it with the Huffman tree over the desired distribution. E.g. consider a distribution {a:0.4, b:0.6} - it will have a Huffman tree with just 2 leaves, and decompression will give {a:0.5, b:0.5}. Am I missing something?
ReplyDeletejkff,
ReplyDeleteYou use the tree given by the Huffman algorithm but you don't have a 50/50 decision at each branch. You derive the probability of going each way from the sum of the probabilities in each branch below.
Another elegant solution (Walker's method) is described by Mihai here-- http://infoweekly.blogspot.com/2011/09/follow-up-sampling-discrete.html
ReplyDeleteNeat! Had this problem before and wasn't sure how to solve it efficiently.
ReplyDeleteNeat! Had this problem before and wasn't sure how to solve it. Thanks for the explanation.
ReplyDeleteWhat if we are sampling/compression observations from a latent variable model? The bits-back method makes a connection. See p353 of MacKay's textbook for a sketch and references.
ReplyDeleteBear in mind that if this code is a bottleneck, branch instructions whose conditions depend on random numbers are effectively unpredictable, and hence expensive.
ReplyDeleteHere's a neat little trick you should know. If the number of symbols is a power of 2, there's also this solution:
double x = random_number();
unsigned s = 0;
for (i = 0; i < log2(numberOfSymbols); ++i)
{
s = (s << 1) | (a[s] < x);
}
return s;
I came to this same realization when I started studying data compression (almost 30 years ago -- yikes!) seriously. Build a decoder for the appropriate arithmetic code, and feed it random bits. You get your output distributed according to the probability model used to build the arithmetic code.
ReplyDeleteHi I hate to jump to a different subject. But a while back you had some wonderful notes on Target Enumeration and a Haskell implementation.
ReplyDeleteMy question is how could you extend what you did such that our boundary points lie on real number coordinates...so to refer to the pixel examples our area of coverage would contain the counted 'holes' plus the partially contained 'holes' or say pixel areas? How could we get the exact coverage based on the way you implemented? I appreciate any information as I know this was a while back---but I JUST found the blog if you have any resources please feel free to contact me at:
jcarrola@swri.org ---thanks for any time you can spare. Regards
As I read the problem statement I immediately wanted to mention Optimal Binary Search Trees (http://en.wikipedia.org/wiki/Binary_search_tree#Optimal_binary_search_trees).
ReplyDeleteAs it turns out, the publication "Darts, Dice, and Coins" already does it. Argh!
I don't think Huffman coding yields suitable trees for this problem, since the trees for random variate generation have the additional requirement of being ordered. Optimal alphabetic binary trees or optimal BSTs require a different algorithm that is not a variation of Huffman coding:
ReplyDelete"This is also known as the Hu-Tucker problem, after the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of [Huffman coding]. These optimal alphabetic binary trees are often used as binary search trees."
This comment has been removed by the author.
ReplyDeleteVery interesting insight. This is essentially an application of the source coding theorem which says that one can find a code that is arbitrarily close the entropy (which also is the lower bound).
ReplyDeleteComputing the entropy of p (=1.54 in this case) then tells us that, using this method, we have to do 1.5 evaluations on average on the tree to generate one sample.
How about primenumber distribution. What's the entropy, and how could they be stored with minimal bits.
ReplyDelete(Question might seam simple but its quite fundamental math)