Wednesday, February 9, 2011

Home-made Regular Expressions in F#: Thompson NFA

Russ Cox re-discovered and popularized an old (originating in a 1968 paper by Thompson) technique for implementing regular expressions: http://swtch.com/~rsc/regexp/regexp1.html. The article offers a wonderfully simple to understand presentation, and I just could not resist trying it in F#.

First, the regular expressions themselves: the source language can be nicely described with a union type, encompassing the empty string, choice, concatenation, the Kleene star, and a token parser. The  algorithm is general enough to accept the Filter case, which generalizes over character classes and the like.

The first step is to compile this definition to a non-deterministic finite automaton (NFA). As a first cut, the state of an NFA could look like this:

When writing the compiler function, however, I realized that I would want two more features for the NFA state. First, compiling the Kleene star makes it hard to tie the knot, making one wish the language was Haskell. As a workaround, I simulate mutable union fields with a helper record. Second, it is nice to be able to compare state values to use binary search trees later in the game. To do this, I label every state with an integer and use it for comparison:

The compiler looks like this:

The key insight is that the NFA corresponds to a deterministic (DFA) machine. Whenever our machine encounters choice (Branch), it follows both options simultaneously. So then, the state of this machine (DFA state) is a set of NFA states. You can easily write a transition function that takes an input token and produces a transformation on the DFA state. Moreover, this function can be cached by token and input state. Once caching is introduced, we obtain a simple yet fascinating, lazily-constructed DFA machine.

The definition of the DFA state is not surprising:

In addition to token-based transitions, I also cache IsMatch information which can be obtained by traversing the NFA states and testing if any of them contains a Match case.

Here is the sketch of the transition function:

What remains is to write an interpreter for the DFA. I will post it in the next article when it is more tested. Perhaps I will also simplify the above definitions a bit. It is typical for me to find that the first code that I write for any problem is much more complicated than necessary.

4 comments:

  1. Nice. I'm fairly new to F# and functional programming (currently reading "Programming F#" from O'Reilly) and just finished a course on "Formal Languages and Automata" where I've implemented the Thompson algorithm and others in C# to construct something similar to a "parser" but without using a formal grammar to analyze some text, just checking if a given string is accepted by the DFA-minimized which was obtained from the original regex.

    Steps that I did:
    0) Convert the Regex to Infix notation (add explicit concatenation, i.e. dot operator)
    1) Convert the Regex to Postfix notation
    2) Regex to NFA-lambda/epsilon using Thompson’s Construction Algorithm
    3) NFA-lambda/eplsilon to DFA using Subset Construction
    4) DFA to DFA-Minimized by breaking the set of all states first into two groups, accepting states and non-accepting states and then partitioning the groups further more if any distinctive states are found.

    Looking forward to your implementation in F#, as I'm still learning how "to think functionally" ;-)

    ReplyDelete
  2. Your implementation has a chance of being much more efficient than this one. The concept is the same, with these differences:

    0) I do not use string notation but combinators to represent the expressions, as is common in FP.

    1) Subset construction is carried out lazily, upon request. This means that the DFA states that are never reached on your input will never be constructed.

    2) There is no minimization.

    ReplyDelete
  3. Nice post.
    for your information. the F# code from your blog is not displayed in my Firefox 3.6.8.
    I have also tried to implement the subset construction algorithm (http://ffogd.blogspot.com/2011/01/f-subset-construction-algorithm.html) in a bit of a different context.

    ReplyDelete
  4. oh sorry that was my fault firefox settings.

    >the F# code from your blog is not displayed in my Firefox 3.6.8.

    ReplyDelete