ha4: optimization

Showing posts with label optimization. Show all posts

Saturday, March 23, 2013

Using Coq as a program optimization tool

Update: - a slightly more compelling example of where program computation shines is at Another take on F# generics.

I tend to complain a lot about the programming tools I work with. In one sentence, the world of software is just not a sane place. If left unchecked, this complaining attitude can grow into despair - so I enjoy a little escape from time to time. I need to dream of a better world, where programming is meaningful, fun, beautiful, efficient, mostly automated, and where programs are verified. This is why my escape activity is playing with Coq.

One fascinating trick that Coq does really well is computing and manipulating programs. Have you ever found yourself writing beautiful, general functional code and then wondering how to make it run faster? Spending lots of time inlining definitions, special-casing, performing simplifications and hoping you do not introduce bugs in the process?

This activity can be automated in a way that verifies meaning preservation.

Here is a rather silly example (I will try to think of a better and more convincing one):

Lines 20-24 should in theory be expressible as a single-word tactic. This is where program simplification happens.

Imagine a better world.. You write functional code, focus on clarity of specification, then drop into interactive theorem proving to compute (!) the equivalent huge, ugly, optimized, specialized program that runs a lot faster but is guaranteed to produce the same result.

EDIT: a reader has asked how manipulating programs in Coq is different from using an optimizing compiler such as GHC. The short answer (and I again realize just how lame my example is since it does not demonstrate it) is control.

With GHC, you get a black-box optimizer that does some transformations and gives you a result. The resulting program is usually better (faster, uses less space), but once in a while worse. It also may or may not do the essential optimization for your domain. I am not using GHC on a daily basis - so feel free to correct me, but I believe the amount of control you exert is fairly limited, even with REWRITE pragmas. In Coq, on the other hand, you get "proof mode" - an interactive environment where you can inspect and explore the intermediate versions of your program as it is being optimized, and guide the process. There is a spectrum of possibilities from fully manual to fully automated transformations, programmable in the Ltac tactic language.

Another advantage (which makes this attractive to me) is that if you are targeting ML (OCaml, F#), you are working with a compiler that refuses to do too much optimization. So GHC-like magic is simply not available. Here extracting from Coq may come handy.

Yet another advantage is that you are working in a more expressive logic. This opens the door to writing more general programs than your target compiler would admit. I have not yet explored this too much, but it seems to be an easy way to bypass the lack of, say, higher kind polymorphism in F#.

Finally, I should also mention that as a proof assistant, Coq makes it practical to construct proofs that your programs (and transformations) are correct with respect to a given semantics. I find this appealing, even though this is a bit beside the point. As a beginner my experience has been that proving any non-trivial lemma takes unbelievably more time than constructing a program, so for now I am just exploring the possibility of using Coq as a programming language without doing much proof development.

Thursday, July 19, 2012

Speeding up F# Printf.*

There recently was an interesting SO question on F# Printf.* family of functions:

http://stackoverflow.com/questions/11559440/how-to-manage-debug-printing-in-f

It is known that these functions are very slow. Slow enough for most people to avoid them entirely, despite the advantages they offer in verifying argument types.

What I did not know is that these functions are so slow that a few lines of simple user code can speed them up, without changing the interface:

With this code I get from 2x to 10x speedups. It is suspicous, I am afraid F# does not cache parsing of the format strings. There is no reason why it should not.

Exercise for the reader: fix the code above to be thread-safe.

Monday, December 5, 2011

Stack-Allocated Lists

Don Syme and several others sent tweets to correct my claims about boxing in F#. I made the mistake of assuming that structs cannot be used as interface implementations without being boxed and allocated on the heap. As it turns out, .NET constraints allow structs to implement interfaces without being boxed. Out of sheer curiosity, I attempted to (ab)use this facility to systematically remove heap allocation. I decided to write a simple functional program that would not use the heap at all, performing all computation on the stack.

I went for encoding lists and implementing list reversal:

This method for getting rid of heap allocation would not work in C as this particular use of generics implies runtime code generation. As far as I understand, the CLR here generates new code paths for every possible list length.

My little benchmark has stack-allocated lists of 5 elements each being reversed a bit slower than heap-allocated lists. But it is fun to try. FSI reports zero GC usage.

Friday, November 11, 2011

Optimizing the Heck Out of F#: HTTP Request Parsing

As part of the WebSharper web server effort, I have been writing an HTTP request parser. Tuning the parser for performance for the common simple case (small, correct HTTP request) has improved performance 8-fold, from 30K to 250K sample requests parsed every second on a single core of my Core i3. Let me review what I have learned from this.

Indexing

Accessing array elements goes through a bounds check. Unmanaged C++ code clearly wins here. C# has unsafe regions, but F# does not. So what can we do in F# to be competitive? The only option left is using bulk operations from the standard library. BCL is not at all helpful here - it is not clear from the documentation which functions drop the check. Also, many operations one would want are simply missing.

For an example where it matters, I was not able to match the performance of System.Array.FindIndex with any F# code I wrote to do the same job.

I imagine this is a killer problem for numerical computing. With unavoidable bounds checking, one really cannot hope to design numerical code in safe managed .NET that would match fortran routines.

Specialization

Generic code gets a staggering performance hit when certain simple operations such as equality do not get specialized to the simple value type you are using. Polymorphism has a cost. Inline F# functions sometimes help here. But it is unfortunate there is no flag to monomorphise some code. MLton users, I envy you here.

Value Types

Using value types such as structs and enums reduces the GC pressure. Note, however, that they still get boxed sometimes. For example, if a struct implements an interface, code that expects an interface implementation will receive the struct boxed. This code has to be specialized to structs.

Mutation

If we care about every bit of performance, mutation matters. However, I found myself wasting lots of time trying to wrap my head around a problem thinking about it in terms of mutation. Clearly, the diagnosis is premature optimization. What I found more helpful is writing a purely functional solution and then transforming it to eliminate allocations and introduce mutation.

Note also that the GC is good enough in most cases. One cannot afford to allocate on the heap per every byte, but allocating short-lived objects does not matter much until you need to do it 100K times a second.

Profiling

Profiling is a life-saver. I used a SlimTune profiler this time. My first discovery was that using System.Collections.Specialized.NameValueCollection for headers is really expensive. It spends a lot of time computing case-insensitive hash values for the header keys. What a bother, especially when the application does not look into the headers. I settled for queuing the headers instead and exposing them as a sequence.

The profiler helps to spend your time effectively - optimizing what really matters.

Specifics of HTTP Request Parsing

The problem is rather simple: HTTP requests keep arriving and need to be parsed and forwarded to the responder thread. In the keep-alive scenario many requests arrive on the same socket. If there is pipelining, they all come at once.

What I wanted to solve here is parsing the requests incrementally, so that if half of a request arrives we say OK and suspend in mid-air until more data is available.

Iteratees are the general solution here. However, iteratees are allocating on the heap, and F#, unlike Haskell, does not do any program transformation magic to simplify them. For this reason it seems that it is not the ideal solution, at least on the byte level.

What I ended up doing instead with incomplete requests is re-parsing. The parsing logic is expressed over a TextReader-like interface. Parser return codes are Done, Error, or Waiting. If the parser says Waiting, I keep the data in the buffer. If it succeeds, the data is discarded. Errors cannot be recovered from.

To some extent micro-parsers can be combined without using the heap. The trick here is to use mutation to return the result on success. Since the return code is an enum, I can join parsers with `&&&`:

parseMethod r req
&&& skipChar r ' '
&&& parseUntil r ' ' &req.uri
&&& parseVersion r req
&&& parseHeaders r req

In case of an early error, parsing does not stop, but there is no reason to care since most requests are well-formed.

To work with TextReader-like interface and avoid allocation, I use a constant-space ring buffer that acts as a limited-size queue for bytes. Most servers limit the size of the request head to 8192, this is what I do as well. It provides its own TextReader that assumes ASCII encoding.

The most rewarding optimization was adding custom methods to the buffer and its reader to make parseUntil and r.ReadLine possible. Instead of going byte-by-byte through several layers of indirection, I switched to System.Array.IndexOf. A ring buffer needs to do it at most twice per operation.