ha4: Making Async 5x Faster

In this article I discuss why F# Async is a good thing for writing concurrent software on .NET and show how to implement your own Async specialized for low-concurrency use. As a sample application, I look at a simple CML-style blocking channel. 30-50 lines of custom async and threadpool implementation increase the throughput from 100-400 K to 1M messages a second.

Concurrency? Again?

It is hard to believe that after so many man-years of computer science research anyone would still have quetions about how to write concurrent software. But I do, if only out of ignorance.

My specific problem right now is writing a decently scalable web server. It is really nothing more than developing a sensible bridge API between a web application and the OS, with a bit of parsing thrown in. All heavy lifting is already done by IO completion ports available to .NET code through System.Net.Socket and SocketAsyncEventArgs types.

Blocking and Non-Blocking IO

In a web server, there are two flows of data: one flow comes from the client via the server to the application, and the other goes in the reverse direction. Imperative programming answer to this the System.IO.Stream object which your app uses to read and write. It is easy to use, but there are problems.

The first obvious problem is that writes cannot really block. If you attempt to write some data, the client socket may not be ready. In fact, it can not become ready for a while. Making the thread wait for it would be lovely, but .NET threads are expensive. Not a good idea if one is to cope with thousands of concurrent clients.

Fine, how about Non-Blockig IO? It seems to solve the problem as application code resumes immediately. But what then, where does the data go? We need a buffer of unbounded size somewhere to hold the data. What happens here is the application receives no feedback when it produces data too quickly. A pathological case would be generating an 8G file.

Reading from a System.IO.Stream poses even more problems. What if the data is not there yet? Blocking wastes a thread, not blocking makes the application retry but requires programming arbitrary timeouts or busy-waits. In light of this it seems much more reasonable to invert control and have the server write data to the application instead as it arrives.

First-Class Continuations

From now on let us consider designing a writer that does not block but also does not require an unbounded buffer, throttling the data producer if the consumer does not keep up. This is an instance of the classic tradeoff between asynchronous and synchronous channels.

Ideally the language would support efficient first-class continuations. Then one would reify the continuation and shove it somewhere, to resume computation later when the client is ready:


    (call-with-current-continuation save-for-later)

This, as far as I understand, is indeed how John Reppy's Concurrent ML was originally implemented. CML first-class channels are unbuffered: you can read and write on them, but both writers and readers block until they are matched. Sounds good, but is this not back to blocking threads again?

Enter Async

The answer greatly depends on what you define as a thread. CML used light-weight threads built on first-class continuations, benchmarking at millions of synchronizations per second. Without first-class continuations, what can .NET do? The answer is F# async: we can model threads as chains of callbacks, then they become first-class and capturable.

As a first cut, let us forget about Async. Just a simple channel that accepts continuation functions with its Read and Write operations. If a match is available, the continuation is invoked immediately, otherwise it is queued for later. Locking is designed to minimize the critical section region:

Note: that the code assumes a spawn : (unit -> unit) -> unit function that forks of the computation on a thread pool. It is just a form of trampolining: one could call it immediately but that makes it a non-tail call and can lead to stack overflow. You could probably work around the need to do this..

Benchmarking Message-Passing

Now let us check out the implementation with F# Async. Note that it is very easy to write adapters to Read and Write methods on the channel to give them an Async signature. F# Async syntax also makes it very easy to define new "threads" - these are exactly what they are conceptually. Although async code may jump between real .NET threads, it is logically a single thread and allows, for example, to safely use single-threaded data structures.

For the benchmark we define two async chains: one pushes a series of numbers on the channel and another one reads them and computes their sum. Then we run them in parallel.

Great. I have been getting 100, 200, sometimes if I am very lucky up to 400 K hops on this benchmark. That sounds like a good enough result at it might be for some applications. However, it looks like the benchmark is using all 4 of my cores. In fact, they seem to spend a jolly lot of time coordinating an essentially sequential computation.. More on this later.

Async is The Answer

Performance issues aside for the moment, the take home message I want you to get is this: Async is the answer for how to write concurrent code on .NET. The writer interface I talked about earlier would essentially be a form of channel passing byte array segments. The key is coordination: Async makes it easy to write and coordinate code, and does not block .NET-level threads.

But.. How About Performance?

Sceptics would say Async is a performance hog here. Just think of all the callbacks it allocates, the GC pressure, the thread hopping and work stealing, and what not! SocketAsyncEventArgs examples out there seem to offer plenty of evidence that one can design socket servers that do not allocate at all, somehow chaining the components together with preallocated callbacks.

I thought a lot about this and even tried to write some of those optimized solutions. Upon reflection, the criticism is probably valid but it is more a comment about the implementation, not the concept. Async provides a general, easy to read, understand and proofread way to orchestrate concurrent code: I simply do not see a viable general alternative. Attempting to specialize code by hand seems to yield impossible to maintain and fix spaghetti. If you know a competitive way to structure such code sensibly I will be very happy to hear about it.

For Real Performance Junkies

For real performance junkies like me, the lesson is this: if you are unhappy with the peformance of beautiful Async code, before rewriting it to spaghetti consider specializing Async implementaion. In fact, try to get the F# compiler expand the code to spaghetti instead of doing it manually. F# is not as good as GHC at this, but its inline facility does seem to help in this.

What exactly do we want from Async here? Not much. Throw cancellation and exception handling out the window. Throw thread hopping and parallelism out the window. Assume that all our callbacks execute quickly and without exceptions. Inline everything in hope that F# will reduce at least some of the intermediate lambdas at compile time, and lower GC pressure. With this in mind I got something like this:

The idea is that since we are not using all the features of F# async, our specialized implementation will be a better fit. Incidentally, the same applies to the thread pool. Instead of scheduling many tasks let us just run one (you can have N if you want) threads dedicated to running these short-lived callbacks in a tight loop. Forget work stealing, cancellation, and so on: our objective is to maximize the peformance of the mostly-sequential processes. Here is my humble attempt:

Drumroll..

With these changes, I managed to get up to 1M hops. A lot better. The key observation is that having a single thread in the pool is critical. As more threads fight for work, performance quickly decreases: the process is essentially sequential and likes to run as such. Inlining in async also seems to help but just a little. Having a simpler async helps too but the vanilla one can be tolerated.

Is this result relevant to web servers? I hope so. It seems to me that webservers run mostly-sequential processes, albeit a lot of them in parallel. Dumping all those on a shared pool would make the work jump around a lot and probably be detrimental to throughput. Is specializing Async necessary to help here? Probably not. I will be glad to hear of ways to set thread or core affinity or something like that to get similar improvements.

In conclusion, Async is the right tool for concurrent code, but you might want to tailor it a bit to your application's need.

Full code of the article:

4 comments:

Dave ThomasDecember 15, 2011 at 6:53 PM
Nice article, it reminds me of all the experimentation I done years ago with SocketAsyncEventArgs. This is going back to my C# days pre .Net 4.0 :-)

One of the best performing solutions was to use a chain of pipelines to do and transformation of the data using... not the thread pool but dedicated threads and slim user mode locking contracts and compare and exchange. Looks like you are walking down a similar road.
UnknownDecember 15, 2011 at 7:22 PM
Indeed :) I have been reading your pipelets articles and re-reading them a few times. That's where I learned about SemaphoreSlim class too :)

Is this code the one I should look at btw to get a good grasp of latest and greatest pipelets?

https://github.com/7sharp9/pipelets/blob/master/src/Pipelets/pipelets.fs

What bugs me about them is primarily I do not quite understand them too well. I will play around more, and we'll see if end up using pipelets after all :)

One particular thing that bugs me a lot is this bounded buffer idea in the pipelet. Buffer overrun occurs, and then what? The .NET thread waits on the semaphore? Can it do other work or it just gets stuck?

With channels implemented as above, you might say that the channel also has buffers. But the channel buffer is not explicitly bounded but it is worth noting that it can only grow up to the number of reader or writer async threads. It is not dependent on the number of messages. Semaphore above can be thrown out together with RunSynchronously (it's not really necessary). Can happily run everything on a single thread, after all Async works in single-threaded OCaml.. The only other place where buffering and synchronization needed is the scheduler (Pool).
Dave ThomasDecember 15, 2011 at 8:51 PM
I think the pipelets articles were unfinished, Im sure I have a part 4 to write for them. Im in the process of reviewing pipelets so the final post on the subject should surface sooner rather than later...

In that branch of code there was a problem with the closure retaining memory on each recursion. I quickly updated it so it doesn't now but I haven't tested it. The master branch in fractureIO is probably the same, I have been doing a lot of work in my local branch and its fixed in there, I just haven't checked it in. If you caught my last blog post tins mentioned in there.

Ok, the reason for blocking in the pipelets is if left unbounded and your processing rate is lower than the incoming rate, the queue grows and grows. You would want to put a finite limit on this as once the data in there gets to a certain age then it can become invalid, like if the data is GPS coordinate updates.

Another area that can become a problem is if your IO completion port is left unserviced for a length of time then an IO thread gets spawned to try and deal with it. This leads to an allocation of another stack for each thread, you can quickly run out of memory on a 32 bit machine if you are getting lots of simultaneous requests over a prolonged period.

One of the best ways I found to deal with these issues was to get the data out of the IO completion port as fast as possible, only store a certain amount of pending data. Any data surpassing these parameters is potentially causing a denial of service, once the send and receive buffers are blown in the listening socket it would have to be closed.

Everything in this area is a balancing act: managing memory buffers, managing socket buffers, time-outs, composability etc. You have to find the right balance for your application.

p.s One thing missing in pipelets is the overflow and exception pipelines, these are used to store the overflowed or exception data respectively. You can deal with this anyway you choose: retry at a later date, log.
Dave ThomasDecember 15, 2011 at 9:01 PM
This is the latest committed one: https://github.com/7sharp9/fracture/blob/master/src/lib/fracture/Pipelets.fs

Thursday, December 15, 2011

Making Async 5x Faster