Friday, May 27, 2005

Derivatives of Regular Expressions

This paper from 1964 has an interesting algorithm for converting regular expressions into finite state automata. I've just figured out what it's all about.

The idea is that if you have a regular expression over some alphabet of symbols you can differentiate it with respect to a symbol. Basically, if R is a regular expression then dR/da is a regular expression that matches all of the strings that match R when prepended with an a. For example da/da={1} (where 1 = empty string), db/da = 0 (where 0 = the empty set and b!=a), d(a*)/da = a* and d(abc)/da = bc. You can write down a complete set of rules for differentiating regular expressions that are analogous to the Leibniz rules for differentiation.

Now here's the first crucial point: Given a regular epxression, R, you can compute repeated derivatives in an infinite number of ways by repeatedly differentiating with respect to different symbols. Many of these expressions turn out to match the same sets of strings, and when they do they are said to be equivalent. So we may form equivalences classes of derivatives. It turns out that the set of equivalent classes is finite.

Here's the next crucial point: differentiating an element of one of these classes with respect to a symbol gives you a representative of another class. So make a directed graph whose vertices are these classes and whose edges are labelled by symbols. We draw an edge from [S] to [T], with label a, if dS/da = T. This graph is none other than the graph of the finite state automaton to recognise matches with R.

The catch is recognising when [S]=[T], ie. when S and T match the same strings. But if we're building a machine we don't have to get this exactly right. We can use sloppy methods to try to identify [S]=[T]. It turns out that it's easy to produce an algorithm that doesn't guarantee to spot when they recognise the same strings, but that gives us a weaker equivalence which still results in a finite number of classes. This might give us a suboptimal automaton, but it still works.

This approach also leads to an elegant way to extend regular expressions with some other useful operations such as 'interleave' and 'complement'. ('interleave' should be familiar to anyone who's read Hoare's book on Communicating Sequential Processes.)

For further information see JH Conway's book on regular machines. Mark Hopkins posted an implementation in C on comp.compilers in 1994. Mark Hopkins also made this post to comp.theory which fills in the details I left out above. (Search USENET for postings by Mark Hopkins, they are frequently very insightful.)

No comments:

Blog Archive