Wednesday, September 28, 2011

Transforming Large Unions in F#

Writing compilers is not a really a niche activity. If you think of it, every programmer is doing precisely that, defining a specific language for a given domain and then explaining its semantics to the computer. F#, ML and Haskell programmers know that union types are a life-saver for dealing with languages systematically. These types do, however, sometimes grow large, becoming less and less nice to work with.

If a 30-case recursive union type, or 2-3 co-recursive 20-30 case union types, is something you have to deal with, consider writing a transform function, a shallow map over the immediate children. This simple addition helps your more complex transformations focus on the cases they really need to focus on, leaving the boilerplate work to transform. The gist below provides an extended example.

(*
We start designing a representation for some functional language.
Keeping it simple, we use strings to represent identifiers (to be
changed later by something more fancy), define a few primitives,
constants, and finally expressions.
*)
type Id = string
type Constant =
| Bool of bool
| Double of double
| Int of int
| String of string
type Primitive =
| Add
| Divide
| Multiply
| Remove
type Expression =
| Apply of E * E
| Call of Primitive * list<E>
| Constant of Constant
| If of E * E * E
| Lambda of Id * E
| Let of Id * E * E
| Var of Id
and private E = Expression
(*
The compiler will have to do a quite a few transformations on this
data structure. Let us start with something simple, like getting the
set of free identifiers in an expression. Trying to keep it as simple
as possible, we just write a recursive function again:
*)
let rec getFreeIdentifiers (expr: Expression) =
let (!) = getFreeIdentifiers
let (!!) = List.map (!)
match expr with
| Apply (f, x) ->
!f + !x
| Constant _ ->
Set.empty
| Call (_, xs) ->
Set.unionMany !!xs
| If (c, t, e) ->
!c + !t + !e
| Let (id, value, body) ->
Set.union !value (Set.remove id !body)
| Lambda (id, body) ->
Set.remove id !body
| Var id ->
Set.singleton id
(*
Let us now consider a transformation of type `E -> E`. For example,
note that in our untyped language `Apply (Lambda (x, y), z)` is
semantically equivalent to `Let (x, z, y)`.
As an aside, this kind of reasoning is quite dangerous! You better be
sure you are not fooling yourself by imagining a semantic equivalence
where there is none. Using proof assistants such as Coq is a great
help.
We can normalize our representation by writing a transformation that
finds all reducible expressions and converts them to `Let`
expressions:
*)
let rec removeRedexes (expr: Expression) =
let (!) = removeRedexes
let (!!) = List.map (!)
match expr with
| Apply (f, x) ->
match !f, !x with
| Lambda (id, body), value -> Let (id, x, body)
| f, x -> Apply (f, x)
| Constant _ ->
expr
| Call (p, xs) ->
Call (p, !!xs)
| If (c, t, e) ->
If (!c, !t, !e)
| Let (id, value, body) ->
Let (id, !value, !body)
| Lambda (id, body) ->
Lambda (id, !body)
| Var _ ->
expr
(*
The above code is clear, but there is a maintainability problem. When
the `Expression` definition is changing over time, when it has 30 or
so cases, or when we write a lot of transformations, pattern-matching
becomes very, very tedious. Note also that in both examples only a
few cases were interesting, that is, relevant to the algorithm, and
the rest were processed in a generic fashion.
The practical solution is to extract the common pattern into a single
function that maps over the expression's immediate children. It
greatly reduces the boilerplate code.
*)
let transform (!) expr =
let (!!) = List.map (!)
match expr with
| Apply (f, x) -> Apply (!f, !x)
| Call (p, xs) -> Call (p, !!xs)
| Constant _ -> expr
| If (c, t, e) -> If (!c, !t, !e)
| Lambda (v, b) -> Lambda (v, !b)
| Let (v, x, y) -> Let (v, !x, !y)
| Var _ -> expr
(*
The `transform` function has to be touched every time `Expression`
definition changes, and is proportional to `Expression` in length.
But then it is very straightforward, and other transformations become
much shorter, for example:
*)
let rec removeRedexesShort (expr: Expression) =
let (!) = removeRedexesShort
match expr with
| Apply (f, x) ->
match !f, !x with
| Lambda (id, body), value -> Let (id, x, body)
| f, x -> Apply (f, x)
| _ ->
transform (!) expr
(*
A truly lazy programmer can define a `fold` over immediate
sub-expressions in terms of `transform`, so that `fold` does not have
to be touched when `Expression` changes:
*)
let fold (f: 'S -> E -> 'S) (z: 'S) (expr: E) : 'S =
let r = ref z
let g expr =
r := f !r expr
expr
transform g expr
|> ignore
!r
(*
The simple `fold` helps to collect free identifiers in
accumulator-fashion, again ignoring boring cases and focusing only on
what is relevant:
*)
let rec getFreeIdentifiersShort (expr: E) =
let rec free (acc: Set<Id>) (expr: E) =
let (!) = getFreeIdentifiersShort
match expr with
| Let (id, value, body) ->
let set = free acc body
free (Set.remove id set) value
| Lambda (id, body) ->
Set.remove id (free acc body)
| Var id ->
Set.add id acc
| _ ->
fold free acc expr
free Set.empty expr


There is, of course, a downside to using such shortcuts. The compiler no longer forces you to re-read and mentally check all transformations when you change the Expression definition, for example by adding new cases. If you add LetRecursive, the code will still compile but will be broken because of the catch-all pattern in getFreeIdentifiers.

Some readers will note that these transformations can be thought of cata-, ana-, para-, or some-such morphisms and can be treated directly as such. Others will perhaps suggest using reflection or meta-programming to automatically derive transform, or even more complex transforms. As always, you should use what works for you.

As for me, I find that I am not yet comfortable with climbing the abstraction ladder much further. It seems to inevitably involve some loss of control. It helps a lot to have explicit control over doing top-down, bottom-up, or mixed traversals, as evident in recursive function definitions. The technique also scales easily to languages with expression-statement distinction. Most importantly, my brain does not explode when I look at such code the way it does when dealing with monstrous morphisms.

To summarize, factoring out mapping over intermediate sub-expressions helps a lot to reduce boilerplate in compiler code while keeping this code relatively easy to read and understand. The technique, although presented in F#, applies equally well to any language with algebraic
datatypes.

Is all this practical? We found it very useful in WebSharper, the F#->JavaScript platform my company is developing. There we have union types to represent JavaScript syntax, simplified optimisation-friendly JavaScript-like language we call Core, and F# itself, as starting from version 2.4 we had to use our own quotation type in an attempt to avoid reflection.

Tuesday, September 27, 2011

A 30-40% Faster and Leaner WebSharper 2.4 is Coming

We are currently finalizing the 2.4 release of WebSharper. This release is mainly concerned with bug fixes and quality-of-implementation improvements. The exciting part is that we are witnessing about 30%-35% generated code size reduction, and a 40% reduction in compilation time.

Getting there was rather tedious. One key refactoring was replacing System.Reflection with Mono.Cecil as the main workhorse for processing F# assemblies. To use Mono.Cecil, and also to avoid some of the nasty reflection bugs in the F# itself, I had to rewrite quotation metadata parsing by having a peek at F# compiler sources. I also had to refactor most of the data structures to stop assuming System.Reflection availability.

For getting better code generation, I revised our Core language for simplicity. Both Core and associated optimizer are made simpler. Compiler optimizations are more focused, targeting the common use-cases.

In particular, local tail-call elimination combined with uncurrying finally allows for automated unrolling of recursive functional code into JavaScript loops. The following code, for example, is now safe for stack use:

[<JavaScript>]
let Factorial (n: int) =
let rec fac n acc =
match n with
| 0 | 1 -> acc
| n -> fac (n - 1) (acc * n)
fac n 1
view raw Factorial.fs hosted with ❤ by GitHub


I am thankful for my colleagues and their effort at building the F# community site, FPish. This codebase provided an invaluable resource for testing and optimizing the compiler on a large project.

I feel that I have learned a lot on this project. It turns out that reading about compilers is not quite the same as writing compilers! You only appreciate this truth after trying. I hardly discovered anything you might call a research contribution, but there are some practical little techniques specialized to F#/.NET that made my life easier and might help you too. Stay tuned for discussions of an approach to maintain and traverse large F# union types, considerations for designing data structures, the memoizing fixpoint pattern, large F# codebase organization ideas, and other things along those lines.

Monday, September 26, 2011

Firefox fails on nested closures at 16 layers

How deep do you think you can nest your closures in JavaScript? Yes, we know there is no tail-recursion, probably no inlining either, we know we should not do this, but sometimes we still do. Before today I expected something like 64-128 levels of nesting to work out fine. In fact, Chrome chokes only at level 512 or so, due to stack overflow - fine.

EDIT: There is Bug #517781 reported 2009-09-20, and STILL NOT ADDRESSED. Please upvote it if you can.

But Firefox.. At level 16! Reference error: x16 is not defined

/// generated JavaScript for k=16
function bind(x,f) {return f(x)}
function test(x0) {
var k0;
k0 = function (x1) {
var k1;
k1 = function (x2) {
var k2;
k2 = function (x3) {
var k3;
k3 = function (x4) {
var k4;
k4 = function (x5) {
var k5;
k5 = function (x6) {
var k6;
k6 = function (x7) {
var k7;
k7 = function (x8) {
var k8;
k8 = function (x9) {
var k9;
k9 = function (x10) {
var k10;
k10 = function (x11) {
var k11;
k11 = function (x12) {
var k12;
k12 = function (x13) {
var k13;
k13 = function (x14) {
var k14;
k14 = function (x15) {
var k15;
k15 = function (x16) {
var k16;
k16 = function (x17) {
return x16 + 1;
};
return bind(x16,k16)
};
return bind(x15,k15)
};
return bind(x14,k14)
};
return bind(x13,k13)
};
return bind(x12,k12)
};
return bind(x11,k11)
};
return bind(x10,k10)
};
return bind(x9,k9)
};
return bind(x8,k8)
};
return bind(x7,k7)
};
return bind(x6,k6)
};
return bind(x5,k5)
};
return bind(x4,k4)
};
return bind(x3,k3)
};
return bind(x2,k2)
};
return bind(x1,k1)
};
return bind(x0,k0)
}
view raw generated.js hosted with ❤ by GitHub


This is on 6.0.2. Shame.

Sunday, September 25, 2011

Earley Parsing in Haskell

After a bit of a struggle I can run simple parsing examples using my Earley algorithm implementation [1] in Haskell. At this stage the implementation is likely to be flawed, with respect to correctness and likely performance. Regardless, the point I wanted to prove to myself is that it is perfectly plausible to define context-free grammars and their interpretation in plain Haskell, without resorting to code generation, or giving up full grammar analysis. I suspect it is a well-known method but it was a pleasure to discover. See for yourself:

{-# OPTIONS -XDoRec -XDeriveDataTypeable #-}
module Example where
import Text.Earley
import Data.Dynamic
import Control.Applicative
data E = Nat Int
| Add E E deriving (Show, Typeable)
grammar :: Grammar (Rule Char E)
grammar = do
nat <- rule "NAT"
[ fmap (\_ -> 0) (term '0')
, fmap (\_ -> 1) (term '1')
]
rec expr <- rule "EXPR"
[ fmap Nat $ var nat
, pure (\x _ y -> Add x y)
<*> var expr
<*> term '+'
<*> var expr
]
return expr
view raw CFGExample.hs hosted with ❤ by GitHub


Aycock and Horspool's paper [2] was my source for the algorithm. I admit I did not get very far, pretty much ignoring (for now) the bulk of their paper and the automaton construction, and focusing on their review of the basics. I also had to adjust things a bit to fit Haskell's purity, and devised a slightly modified (and possibly faulty) interaction of completer and predictor.

Earley's algorithm is beautiful. Very simple, fully general (all CFGs!), cubic in worst-case but close to linear on practical grammars, and, perhaps most importantly, incremental (think completion in an IDE). A highly recommended exercise.

[1] https://github.com/toyvo/haskell-earley
[2] http://webhome.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf