Russ Cox re-discovered and popularized an old (originating in a 1968 paper by Thompson) technique for implementing regular expressions: http://swtch.com/~rsc/regexp/regexp1.html. The article offers a wonderfully simple to understand presentation, and I just could not resist trying it in F#.
First, the regular expressions themselves: the source language can be nicely described with a union type, encompassing the empty string, choice, concatenation, the Kleene star, and a token parser. The algorithm is general enough to accept the Filter case, which generalizes over character classes and the like.
The first step is to compile this definition to a non-deterministic finite automaton (NFA). As a first cut, the state of an NFA could look like this:
When writing the compiler function, however, I realized that I would want two more features for the NFA state. First, compiling the Kleene star makes it hard to tie the knot, making one wish the language was Haskell. As a workaround, I simulate mutable union fields with a helper record. Second, it is nice to be able to compare state values to use binary search trees later in the game. To do this, I label every state with an integer and use it for comparison:
The compiler looks like this:
The key insight is that the NFA corresponds to a deterministic (DFA) machine. Whenever our machine encounters choice (Branch), it follows both options simultaneously. So then, the state of this machine (DFA state) is a set of NFA states. You can easily write a transition function that takes an input token and produces a transformation on the DFA state. Moreover, this function can be cached by token and input state. Once caching is introduced, we obtain a simple yet fascinating, lazily-constructed DFA machine.
The definition of the DFA state is not surprising:
In addition to token-based transitions, I also cache IsMatch information which can be obtained by traversing the NFA states and testing if any of them contains a Match case.
Here is the sketch of the transition function:
What remains is to write an interpreter for the DFA. I will post it in the next article when it is more tested. Perhaps I will also simplify the above definitions a bit. It is typical for me to find that the first code that I write for any problem is much more complicated than necessary.