I discovered a memory leak in
Async.StartChild and here discuss a workaround based on a Future abstraction.
I noticed what appears to be a memory leak in F# standard library
function Async.StartChild. This happened in a context of
a socket server, where I attempted to perform socket reading and
writing in parallel. It seems that memory use slowly grows and memory
profiler points to some CancellationTokenSource-related
objects not being released.
As a non-leaking alternative, I used my own abstractions. The basic
idea is to use synchronizable events. Unfortunately Event
is already used in F# to mean something different, so I will use the
word Future instead. If you know F# events, the key
problem is that subscribing to events after they happen is
meaningless, for example this code procuces nothing:
let test () =
let e = Event<_>()
e.Trigger(1)
e.Publish.Add(printf "%i")
In contrast, Future objects retain the value. For
simplicity, I allow subscribing and triggering only once. In
addition, the sample includes Executor code. I found by
experimentation that running short-lived coordination tasks on a
single thread instead of the ThreadPool is
beneficial. Enjoy:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When WebSharper compiles F# to JavaScript it preserves
namespaces, module and class nesting to make it easy to navigate
the compiled code from JavaScript shell. Roughly speaking,
A.B.C.D.E identifier in F# can be found by typing
A.B.C.D.E in JavaScript.
This poses a challenge: as you can imagine, emitting long
qualified identifiers everywhere is not a good idea for compact
code generation. To save space WebSharper 2.4 does class/module
interning. The basic idea is to say
L=Microsoft.FSharp.Core.ListModule once and then say
L.ofArray at every use site.
An example of this in action can be seen below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Recent FPish FPish discussion
focused on some hacks available in F# to write code that
resembles using Haskell type classes. I particularly enjoyed the
comments by Gustavo Leon and Loic Denuziere.
To cut the long story short, before compiling to .NET F#
expands methods delcared inline and does overload resolution.
This was intended to support flexible operator overloading, but
opens up the door for interesting hacks. Even code that
generalizes over higher kinds and therefore cannot exist at .NET
level can with these hacks exist at the inline F# level.
Although these remain hacks (ugly code, unreadable inferred
types, fragility to changes when type inference stops working,
difficulty in scaling to more complicated examples), I cannot
help but enjoy it. Here is a writeup that infers JSON
serialization automatically over lists at the inline level:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In this article I discuss why F# Async is a good thing for
writing concurrent software on .NET and show how to implement your
own Async specialized for low-concurrency use. As a sample
application, I look at a simple CML-style blocking channel. 30-50
lines of custom async and threadpool implementation increase the
throughput from 100-400 K to 1M messages a second.
Concurrency? Again?
It is hard to believe that after so many man-years of computer
science research anyone would still have quetions about how to write
concurrent software. But I do, if only out of ignorance.
My specific problem right now is writing a decently scalable web
server. It is really nothing more than developing a sensible bridge
API between a web application and the OS, with a bit of parsing
thrown in. All heavy lifting is already done by IO completion ports
available to .NET code through System.Net.Socket and
SocketAsyncEventArgs types.
Blocking and Non-Blocking IO
In a web server, there are two flows of data: one flow comes from
the client via the server to the application, and the other goes in
the reverse direction. Imperative programming answer to this the
System.IO.Stream object which your app uses to read and
write. It is easy to use, but there are problems.
The first obvious problem is that writes cannot really block. If
you attempt to write some data, the client socket may not be
ready. In fact, it can not become ready for a while. Making the
thread wait for it would be lovely, but .NET threads are expensive.
Not a good idea if one is to cope with thousands of concurrent
clients.
Fine, how about Non-Blockig IO? It seems to solve the problem as
application code resumes immediately. But what then, where does the
data go? We need a buffer of unbounded size somewhere to hold the
data. What happens here is the application receives no feedback
when it produces data too quickly. A pathological case would be
generating an 8G file.
Reading from a System.IO.Stream poses even more
problems. What if the data is not there yet? Blocking wastes a
thread, not blocking makes the application retry but requires
programming arbitrary timeouts or busy-waits. In light of this it
seems much more reasonable to invert control and have the server
write data to the application instead as it arrives.
First-Class Continuations
From now on let us consider designing a writer that does not
block but also does not require an unbounded buffer, throttling the
data producer if the consumer does not keep up. This is an instance
of the classic tradeoff between asynchronous and synchronous
channels.
Ideally the language would support efficient first-class
continuations. Then one would reify the continuation and shove it
somewhere, to resume computation later when the client is ready:
(call-with-current-continuation save-for-later)
This, as far as I understand, is indeed how John Reppy's Concurrent
ML was originally implemented. CML first-class channels are
unbuffered: you can read and write on them, but both writers and
readers block until they are matched. Sounds good, but is this not
back to blocking threads again?
Enter Async
The answer greatly depends on what you define as a thread. CML
used light-weight threads built on first-class continuations,
benchmarking at millions of synchronizations per second. Without
first-class continuations, what can .NET do? The answer is F# async:
we can model threads as chains of callbacks, then they become
first-class and capturable.
As a first cut, let us forget about Async. Just a
simple channel that accepts continuation functions with its Read and
Write operations. If a match is available, the continuation is
invoked immediately, otherwise it is queued for later. Locking is
designed to minimize the critical section region:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note: that the code assumes a spawn : (unit -> unit) ->
unit function that forks of the computation on a thread pool.
It is just a form of trampolining: one could call it immediately but
that makes it a non-tail call and can lead to stack overflow. You
could probably work around the need to do this..
Benchmarking Message-Passing
Now let us check out the implementation with F# Async. Note that
it is very easy to write adapters to Read and
Write methods on the channel to give them an
Async signature. F# Async syntax also makes it very
easy to define new "threads" - these are exactly what they are
conceptually. Although async code may jump between real .NET
threads, it is logically a single thread and allows, for example, to
safely use single-threaded data structures.
For the benchmark we define two async chains: one pushes a series
of numbers on the channel and another one reads them and computes
their sum. Then we run them in parallel.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Great. I have been getting 100, 200, sometimes if I am very lucky
up to 400 K hops on this benchmark. That sounds like a good enough
result at it might be for some applications. However, it looks like
the benchmark is using all 4 of my cores. In fact, they seem to
spend a jolly lot of time coordinating an essentially sequential
computation.. More on this later.
Async is The Answer
Performance issues aside for the moment, the take home message I
want you to get is this: Async is the answer for how to write
concurrent code on .NET. The writer interface I talked about
earlier would essentially be a form of channel passing byte array
segments. The key is coordination: Async makes it easy to write and
coordinate code, and does not block .NET-level threads.
But.. How About Performance?
Sceptics would say Async is a performance hog here. Just think
of all the callbacks it allocates, the GC pressure, the thread
hopping and work stealing, and what not!
SocketAsyncEventArgs examples out there seem to offer
plenty of evidence that one can design socket servers that do not
allocate at all, somehow chaining the components together with
preallocated callbacks.
I thought a lot about this and even tried to write some of those
optimized solutions. Upon reflection, the criticism is probably
valid but it is more a comment about the implementation, not the
concept. Async provides a general, easy to read, understand and
proofread way to orchestrate concurrent code: I simply do not see a
viable general alternative. Attempting to specialize code by hand
seems to yield impossible to maintain and fix spaghetti. If you
know a competitive way to structure such code sensibly I will be
very happy to hear about it.
For Real Performance Junkies
For real performance junkies like me, the lesson is this: if you
are unhappy with the peformance of beautiful Async code, before
rewriting it to spaghetti consider specializing Async implementaion.
In fact, try to get the F# compiler expand the code to spaghetti
instead of doing it manually. F# is not as good as GHC at this, but
its inline facility does seem to help in this.
What exactly do we want from Async here? Not much. Throw
cancellation and exception handling out the window. Throw thread
hopping and parallelism out the window. Assume that all our
callbacks execute quickly and without exceptions. Inline everything
in hope that F# will reduce at least some of the intermediate
lambdas at compile time, and lower GC pressure. With this in mind I
got something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The idea is that since we are not using all the features of F#
async, our specialized implementation will be a better
fit. Incidentally, the same applies to the thread pool. Instead of
scheduling many tasks let us just run one (you can have N if you
want) threads dedicated to running these short-lived callbacks in a
tight loop. Forget work stealing, cancellation, and so on: our
objective is to maximize the peformance of the mostly-sequential
processes. Here is my humble attempt:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
With these changes, I managed to get up to 1M hops. A lot
better. The key observation is that having a single thread in the
pool is critical. As more threads fight for work, performance
quickly decreases: the process is essentially sequential and likes
to run as such. Inlining in async also seems to help but just a
little. Having a simpler async helps too but the vanilla one can be
tolerated.
Is this result relevant to web servers? I hope so. It seems to me
that webservers run mostly-sequential processes, albeit a lot of
them in parallel. Dumping all those on a shared pool would make the
work jump around a lot and probably be detrimental to throughput. Is
specializing Async necessary to help here? Probably not. I will be
glad to hear of ways to set thread or core affinity or something
like that to get similar improvements.
In conclusion, Async is the right tool for concurrent
code, but you might want to tailor it a bit to your application's
need.
Full code of the article:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I found some more time to study Coq. One place where I stumble very frequently is case analysis of value-indexed inductive types. There are often cases that lead to contradiction. Other cases intuitively imply some variables to be equal. Vanilla match constructs gives you no help, leaving you stranded. In proof mode, inversion tactic helps a lot. However, generated proofs are huge and hard to navigate. Another option is to use dependent pattern matching directly (which is what inversion generates for you).
I was eager to figure this out because there are times when proof mode is too confusing for me as a beginner or I seem to lack control defining precisely what I want. In particular, if you do code extraction, you do not want to use inversion-generated convoluted code. But up until today I had little success. Here is the first, and very modest, success story: retrieving the head of a non-empty length-indexed list, and discharging the Nil case with dependent pattern-matching:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Don Syme and several others sent tweets to correct my claims about boxing in F#. I made the mistake of assuming that structs cannot be used as interface implementations without being boxed and allocated on the heap. As it turns out, .NET constraints allow structs to implement interfaces without being boxed. Out of sheer curiosity, I attempted to (ab)use this facility to systematically remove heap allocation. I decided to write a simple functional program that would not use the heap at all, performing all computation on the stack.
I went for encoding lists and implementing list reversal:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This method for getting rid of heap allocation would not work in C as this particular use of generics implies runtime code generation. As far as I understand, the CLR here generates new code paths for every possible list length.
My little benchmark has stack-allocated lists of 5 elements each being reversed a bit slower than heap-allocated lists. But it is fun to try. FSI reports zero GC usage.