ha4: Optimizing the Heck Out of F#: HTTP Request Parsing

As part of the WebSharper web server effort, I have been writing an HTTP request parser. Tuning the parser for performance for the common simple case (small, correct HTTP request) has improved performance 8-fold, from 30K to 250K sample requests parsed every second on a single core of my Core i3. Let me review what I have learned from this.

Indexing

Accessing array elements goes through a bounds check. Unmanaged C++ code clearly wins here. C# has unsafe regions, but F# does not. So what can we do in F# to be competitive? The only option left is using bulk operations from the standard library. BCL is not at all helpful here - it is not clear from the documentation which functions drop the check. Also, many operations one would want are simply missing.

For an example where it matters, I was not able to match the performance of System.Array.FindIndex with any F# code I wrote to do the same job.

I imagine this is a killer problem for numerical computing. With unavoidable bounds checking, one really cannot hope to design numerical code in safe managed .NET that would match fortran routines.

Specialization

Generic code gets a staggering performance hit when certain simple operations such as equality do not get specialized to the simple value type you are using. Polymorphism has a cost. Inline F# functions sometimes help here. But it is unfortunate there is no flag to monomorphise some code. MLton users, I envy you here.

Value Types

Using value types such as structs and enums reduces the GC pressure. Note, however, that they still get boxed sometimes. For example, if a struct implements an interface, code that expects an interface implementation will receive the struct boxed. This code has to be specialized to structs.

Mutation

If we care about every bit of performance, mutation matters. However, I found myself wasting lots of time trying to wrap my head around a problem thinking about it in terms of mutation. Clearly, the diagnosis is premature optimization. What I found more helpful is writing a purely functional solution and then transforming it to eliminate allocations and introduce mutation.

Note also that the GC is good enough in most cases. One cannot afford to allocate on the heap per every byte, but allocating short-lived objects does not matter much until you need to do it 100K times a second.

Profiling

Profiling is a life-saver. I used a SlimTune profiler this time. My first discovery was that using System.Collections.Specialized.NameValueCollection for headers is really expensive. It spends a lot of time computing case-insensitive hash values for the header keys. What a bother, especially when the application does not look into the headers. I settled for queuing the headers instead and exposing them as a sequence.

The profiler helps to spend your time effectively - optimizing what really matters.

Specifics of HTTP Request Parsing

The problem is rather simple: HTTP requests keep arriving and need to be parsed and forwarded to the responder thread. In the keep-alive scenario many requests arrive on the same socket. If there is pipelining, they all come at once.

What I wanted to solve here is parsing the requests incrementally, so that if half of a request arrives we say OK and suspend in mid-air until more data is available.

Iteratees are the general solution here. However, iteratees are allocating on the heap, and F#, unlike Haskell, does not do any program transformation magic to simplify them. For this reason it seems that it is not the ideal solution, at least on the byte level.

What I ended up doing instead with incomplete requests is re-parsing. The parsing logic is expressed over a TextReader-like interface. Parser return codes are Done, Error, or Waiting. If the parser says Waiting, I keep the data in the buffer. If it succeeds, the data is discarded. Errors cannot be recovered from.

To some extent micro-parsers can be combined without using the heap. The trick here is to use mutation to return the result on success. Since the return code is an enum, I can join parsers with `&&&`:

parseMethod r req
&&& skipChar r ' '
&&& parseUntil r ' ' &req.uri
&&& parseVersion r req
&&& parseHeaders r req

In case of an early error, parsing does not stop, but there is no reason to care since most requests are well-formed.

To work with TextReader-like interface and avoid allocation, I use a constant-space ring buffer that acts as a limited-size queue for bytes. Most servers limit the size of the request head to 8192, this is what I do as well. It provides its own TextReader that assumes ASCII encoding.

The most rewarding optimization was adding custom methods to the buffer and its reader to make parseUntil and r.ReadLine possible. Instead of going byte-by-byte through several layers of indirection, I switched to System.Array.IndexOf. A ring buffer needs to do it at most twice per operation.

6 comments:

Arseny KapoulkineNovember 14, 2011 at 1:01 PM
With no practical evidence to prove it (as in, I did not do the tests myself, yet), CLR JIT does eliminate bounds checking when it can. See http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx for reference. So in theory it should be possible to match System.Array.IndexOf performance with pure F# code.
RyanNovember 23, 2011 at 1:29 PM
I so wish I could hook this into Fracture. :) I keep coming to the same conclusions as you about iteratee. It was a nice exercise, but I don't think it is really do-able in F#.
UnknownNovember 23, 2011 at 1:35 PM
@Arseny, thanks for the pointer, I did see this article. Traversing complete arrays without a bounds check sounds doable with a for loop. Unfortunately it does not reassure me very much that I can eliminate the bounds check when working with an ArraySegment structure.

@Ryan, do not despair, what really would matter is running two versions of the code side-by-side. Naive iteratee parsing that allocates something on the heap for every byte will obviously not be able to compete. But vectorized iteratee that allocates once per say 1024 bytes.. Might not be so bad..
RyanNovember 23, 2011 at 2:14 PM
It might be nice to build something like this underneath iteratee, making iteratee an api only, similar to your use of &&&.
Dave ThomasDecember 13, 2011 at 1:43 PM
Is none of this being open sourced :(
UnknownDecember 15, 2011 at 3:30 PM
There's been some talk of open-sourcing WebSharper. This webserver stuff will eventually be part of WebSharper so it applies to it as well. License terms will likely be draconian though, like AGPL, to encourage users to buy the commercial license.

Friday, November 11, 2011

Optimizing the Heck Out of F#: HTTP Request Parsing