We Don't Know How To Program…

…the current (and future!) crop of machines.  “Yeah, yeah, yeah,” I hear you say, “tell me something new”.  Instead I’m going to tell you something olde.

10 Years AgoGC was widely understood and rarely used.  People were skeptical that is was ready for production use.  Now GC is well understood and widely used.  It’s less buggy than malloc/free (e.g. no dangling pointer bugs, but still leaks) and in general is faster to do allocation (hard to beat parallel bump-pointer allocation).  The ‘Collection’ part of GC has gotten substantially faster in the last decade: parallel collectors are available from all major Java Virtual Machines.  Rarely do I see GC consuming more than 10% of total CPU time; usually it’s closer to 5%.  Concurrent & Incremental GC’s are mostly available and mostly work (e.g. Azul can demo sustained 40+ Gigabyte/sec allocation & collection with no GC pauses); hard-real-time GC’s are starting to see real applications.

10 Years Ago – JIT’ing – Just-In-Time Compiling – was considered controversial.  People widely claimed “the code quality will never approach static compilation” and “its too slow”.  These days it’s fairly easy to find programs where Java beats “gcc -O2” (and vice versa; we’re clearly at some kind of performance plateau).  I personally mostly laid these claims to rest: HotSpot remains “The JVM to Beat”; with world-beating records in a variety of areas.  My name still appears in HotSpot source base, buried down in the src/share/vm/opto directory files.

1__0 Years Ago – JVMs and Managed Runtimes (no .Net 10 years ago; v1.0 appeared in 2002) were unstable on dual-CPU machines (that’s being charitable: the bug list from that era was shocking; e.g. at one point we (the HotSpot team) discovered that emitting all memory barriers had been accidentally disabled on multi-cpu X86’s).  These days JVMs are rock solid on 4-8 CPU machines; this count of CPUs has been Sun’s bread-and-butter for years.  Indeed, HotSpot routinely runs well on 64 – 128 CPU counts from a variety of hardware vendors; Azul’s version of HotSpot runs well on 768 cpus.

10 Years Ago – The JDK was in it’s infancy; the libraries generally were broken on multi-threaded programs.  Then the Abstract Data Types (widely called the Collection Classes for Java) got synchronized (slow, but correct), then in 1.4 unsynchronized versions appeared (fast, but incorrect for multi-threaded usage), then in 1.5 fast & correct concurrent library routines appeared.  These days you can drop in a hash table that will scale linearly to 768 concurrent threads doing inserts & lookups, or write parallel-for loops with a little bit of stylized boiler-plate.

In Summary: The JVM, Collection Classes & GC are Ready for large-core-count machines.  The OS has been ready for a long time (Cray, SGI, IBM et al showed you could beat down all the multi-CPU OS bugs years ago).  In short: every CPU cycle the hardware guys make comes up through the OS, JVM (& JDK & GC) and is usefully exposed to the application writer.

But…

We Still Can’t Write the Million Line Concurrent Program.

I don’t mean the data-parallel (scientific) programming, where large datasets and large CPU counts have been the norm for years.  These guys generally have way fewer lines of code than the big complex Java apps; there’s some correlation there between bytes of dataset vs lines of code.  Both the scientific apps and big Java apps look at gigabytes of data; the problem is that the Java guys use 100x more lines of code – and it’s because they are doing something 100x more difficult to express in code.

I don’t mean non-main-stream programing languages; I mean programs as written by the millions of programmers today, using tools (both software and mental) that are widely available.  As for the specialty languages, Erlang is the only language I know of where million-line concurrent programs are written, maintained and used in a business-critical way on a daily basis.  Near as I can tell, all other specialty languages are used by a handful of academics or in a small-scale as experiments by desperate businesses.

Concurrent Programming is widely considered very hard; try googling “notoriously difficult concurrency” sometime.  Lots of folks are busy trying to figure it out – and it seems pretty clear to me: We Haven’t Figured It Out yet.  Here are some obvious differences between concurrent and serial programming that bear repeating:

  • “Parallelism” for Serial Programming is generally automatic and very fine-grained: e.g. out-of-order hardware or hit-under-miss cache.  Your normal code “just got faster” and the hardware auto-parallelized under the hood for you.  We mined out this kind of parallelism years ago, and just expect it (remember the speed-demon vs brainiac wars?)  “Parallelism” for Concurrent Programs generally means: task-level parallelism.  You gotta hack your code to expose it.
  • Testing provides Confidence for Serial Programming; you can get decent code coverage.  In practice, testing does little for concurrent programming.  Bugs that never appear in testing routinely appear in production – where the load issues are subtly different, or QA doesn’t have the same count of CPUs as production.  Indeed, in theory the situation is gruesome: there’s basically no chance Testing can cover the exponential explosion of possible code interleavings.  There’s a glimmer of hope here: there’s some evidence that the ‘n’ in the exponent can be reduced to 3 or 4 in many cases without missing too many real bugs.
  • Debugging serial programs can be made human-thought slow.  Time is NOT of the essence.  In concurrent programming, everything needs to be full-speed or the bug never appears – leading to the HeisenBug syndome.
  • Program reasoning is local for serial programs.  In concurrent programs, you have to reason about all the code in the system at once – because some other thread can be in every other piece of code concurrently with this thread.
  • Control flow is the main concern in serial programming; you ask questions like “how did I get here?” and “where do I go next?”.  Data flow is the main concern in concurrent programming; you ask questions like “who-the-flock changed this variable out from under me?”.
  • Timing is nothing in serial programming.  Nearly all languages have no way to express time, and program meaning is defined mathematically independent of time.  The same program can be faster or slower – which preserves it’s meaning but might change it’s usefulness.  Timing is everything in concurrent programming: correctness flows from well timed interleavings.  Changing that timing will often break programs that have functioned well for years.  Note that most languages (outside of Java) start out with a disclaimer like “in race-free programs, the meaning is this…”.  Most large concurrent programs have races, so most large concurrent (non-Java) programs have no theoretical meaning.
  • Nothing changes between statements in serial programs; it’s “as if” the Universe Stops and waits for you.  All of memory (could) change between “ticks” of the program’s clock in concurrent programming.  Just because you looked at some variables a nano-second ago, doesn’t mean those variables remain the same.
  • Serial programming has fundamental name-space control for code & data & their cross product.  Classes, Object-Oriented Programming, Functional Programming, lexical scopes, etc are all techniques to control access to either code, or data or both.  There is no equivalent notion for threads – any thread can touch any piece of data or run any piece of code.
  • The Tools are Mature, pretty much by definition.  It’s obvious that support for concurrent programming sucks right now, so therefore the existing (serial) tools must be mature.  Bleah.  Some examples: I don’t know of any widely used (and usable) tools for auto-extracting parallelism (“widely usable” excludes tools requiring a PhD, or a Cray).  I know of no tool which can do data-race detection on HotSpot: it’s too big, it’s too concurrent, it JIT’s code which has subtle race-invariants with the static code, it uses GC, it uses hardware exceptions, it uses lock-free idioms, etc.

What can we do, without changing our fundamental programming model, to use these extra cores to speed up our programs?  Some, but not much.

  • GC can become concurrent by default.  Other than the slowdown in the mutators to preserve the GC invariants, you’ll never see a GC pause.   Azul’s made a lot of hay here; “we solve GC problems”.  Stock JVMs on stock hardware are still “getting there” for production use – but clearly are heading that way.  This technique probably uses about 20% more CPUs to do the GC than you have mutators usefully producing garbage.
  • All codes are JIT’d in the best (most expensive) way.  JIT’ing parallelizes the same way as “pmake” parallelizes builds – you use a thread pool of background compiler threads to JIT furiously.  During a Big Application startup, Azul might see 50 compiler threads JIT’ing in the background, with many megabytes/sec of high quality code being produced.  Of course, post-startup this doesn’t speed up your programs anymore: there’s nothing more to compile.
  • Continuous profiling and introspection need only a fraction of a CPU.
  • More auto-parallelizing hardware tricks like the Run-Ahead Scout haven’t (yet) panned out: the “scout” is more like a blind idiot without warmed-up caches & branch-predictors.
  • Larger-scale speculation seems plausible for a bit, with combinations of hardware & runtime support.  While plain Speculative Lock Elision is a bit stiff on the hardware requirements, Azul’s version shows you can do this with a fairly small amount of hardware, plus some simple runtime support.  Azul’s SLE experience shows there’s a real upper bound to the amount of parallelism you can usefully expose here: the code really does communicate with data inside most locked regions.
  • There are also a bunch of tricks to auto-parallelize & speculate loop iterations, although these techniques probably only pay off for larger loop bodies running longer iteration counts than are normal for business Java.  i.e., this probably pays nicely for scientific apps but not XML-parsing or Web-Servers.

Moving outside of the “auto-parallelize” world, we see a bunch of coding styles to enable parallelism.  Transaction-level parallelism is common, with J2EE a large complex example of it.  This is really “chunk-o-data+work” kind of parallelism, where the “chunk-o-data” is much larger and more complex than in the standard scientific app.  Under this same vague notion of “chunk-o-data+work” we also see things like stream programming (GPGPU is the hardware-accelerated version of this).  Large Java uses a lot of pipeline-of-thread-pools style programming.

In general, performance analysis tools for complex concurrent programs remain very weak.  Tools to help programmers write in the pipeline-of-thread-pools style are totally lacking.  Classic tools tell me things like “your time goes in these routines; here’s the code being executed (alot)”.  What I want is things like “your threads can’t progress because they are blocked on this lock” (fairly easy: Azul shows this now), or “this thread pool is starved for work because that pool isn’t feeding it” (because that pools’ thread-count is capped too low, or it’s threads are being starved in turn from another pool, etc, etc).

But how well do these techniques work as we move from dozens of cores to hundreds (Azul’s already there!)?  Right now, the Big 3 Application Servers can rarely use more than 50-100 cores before they become choked up on internal locks- and that includes the 20% for GC and using SLE.  Maybe we go to individual programs communicating via sockets (SOA?).  But isn’t this just a very complex version of CSP?  Might we be better off just switching to CSP in the first place (or some hopefully more modern version)?

In short, the easy pickin’s have long gone, and now we need complex tools and complex coding styles to get more performance from more cores using the existing languages.  While I think we can go farther with what we have, it’s time to get serious about exploring serious language changes.  Somewhere, Out There, There’s a Way to write large programs without me having to sweat out Every Bloody Little Detail about how my parallel program communicates internally.  Screw Java: I got a JVM with a super GC, fantastic JIT and decent concurrent libraries; it can do loads more stuff than just run Java.  I got reasonable OS’s and an ever-growing mountain of (parallel) CPU cycles.  Isn’t there Some Way to beat back the Concurrent-Programming Demon ?

Cliff