# Tiered Compilation

Gadzooks, time flies!  Been too long since I blogged…  no time for something deep so here’s another quicky brain dump.

Azul Systems has been running HotSpot with a Tiered Compilation system since 2007 in our production sites, and I thought I’d blog a little about how that works.

Jargon terms first:

•  C0 – a mythical bytecode-straight-to-asm dumb translator.  Typically used instead of an Interpreter.  HotSpot uses an Interpreter and so does Azul’s VM (and so we do not have a C0).  Our Interpreter is structured different from the HotSpot norm in a bunch of interesting ways (but that’s for another blog).

•  C1 – also known as the client compiler.  A fast & dumb compiler.  You get this compiler from a normal HotSpot with the “-client” flag.  Code quality suffers in exchange for faster compile times.  This makes for a quicker startup times in exchange for lower peak throughput.  The client compiler typically uses less memory to compile than the server compiler, and typically makes bulkier code per-line-of-Java compiled… BUT: it also typically inlines much less aggressively than server compiler and so does not replicate so much code.  Code frames are typically fairly large.  Performance is typically roughly 5x to 10x the interpreter speed.

•  C2 – also known as the server compiler.  A slow & smart compiler, with opposite tradeoffs from C1: slower compiles, slower startup, better code quality, higher peak throughput (after the slow compiles).  C2 typically uses more memory to compile, makes much denser code – but inlines much more aggressively (so ends up compiling a lot more lines of Java code).  Code frames are typically very dense (C2 uses a graph-coloring allocator even for stack frames).  Performance is roughly 30% (on X86) to 50% (on RISCs like Sparc & Azuls’ Vega chips) better than C1.  Mileage varies of course, according to which benchmark and which rev of which compiler is being used.

•  Tiered compilation.  BOTH compilers are used; C1 for fast startup and C2 for peak performance.  Obtained on Azul’s VMs with any of the “-tiered”, “-client” or “-server” flags, and also is the default.  (We made the “-client” and “-server” flags turn on Tiered compilation because so many customers have either flag baked into their launch scripts, which would immediately deny the benefit of tiered compilation.  We use other flags for turning off either compiler).

•  I2C adapter.  A tiny snippet of code that shuffles arguments from the Interpeter’s calling convention (which maps pretty directly to the Java Virtual Machine calling stack) to a compiled-code convention.  The compiled-code convention is machine specific, and generally passes lots of arguments in registers.  I2C adapters are generated lazily based on need, and you need a different one for each unique argument signature.  In a typical largish program you have several hundred unique argument signatures so you get a few hundred I2C adapters.  On Azul/TXU we pass the first 10 arguments in registers.  On X86 we match the standard X86-64 ABI and pass the first 6 arguments in registers.  Arguments that do not fit in registers are passed on the stack, and are shuffled to match how  the compiled convention would normally make a call from compiled-code-to-compiled-code.  i.e., when JIT’d code calls JIT’d code, we use the best possible calling convention for speed.  The Interpreter finds this convention hard to work with (it uses a classic stack convention for all arguments).  The I2C adapter bridges the gap.

•  C2I adapter.  In contrast to a I2C adapter, this one allows compiled code to call into the Interpreter by doing the reverse argument shuffle.  You might think that once code is JIT’d you’d never get back to the Interpreter, but actually code is lazily JIT’d as it gets hot and programs have large complex control flow paths.  It’s common enough for hot code to call cold (or luke-warm code) on various off-side paths.  One other refinement: both I2C and C2I adapters are true in-place argument shuffles.  No stack-space is consumed by invoking back-to-back I2C then C2I adapters.

Sun’s classic HotSpot uses the Interpreter to do profiling.  Sun’s Interpreter gathers stats call site frequency and branch targets, whether or not various Java safety checks have ever failed and etc.  Azul uses our C1 compiler to do the same job for three main reasons:

1. We get to gather stats about 5x to 10x faster than using the Interpreter, because C1 code is so much faster than interpreting.
2. C1 does some modest inlining; each inlined copy has it’s own private profiling data.  When C2 later comes along to JIT this code, if it also does at least this much inlining then it can use this private-per-inlining profiling data.
3. The actual engineering is much simpler: when we make the C1 code we pepper it with counter increments, incrementing counters in a large profile-data structure.  The offset into this profile structure is a compile-time constant so we can issue plain load/inc/store ops interleaved in the generated C1 code.  For the Interpreter implementation, the offset into the profile-data structure varies with the exact bytecode being profiled.  I.e., as the bytecode pointer in the Interpreter advances, so does the matching profile-data structure… except that it’s not a 1-to-1 mapping between bytecodes and counters.  Most bytecodes need no profiling (e.g. ‘swap’ or ‘aload0’) and some bytecodes need lots and lots of profiling (e.g. invokevirtual).  So if you want to scale the bytecode pointer by a constant to get the profile counter pointer – you need to make all bytecodes have a fixed size profile area, but that area needs to be large (invokevirtual) yet most bytecodes (aload0) will ignore it.  It becomes a tremendous waste of space.  So instead Sun engineered a complex variable-sized mapping from the Interpreter’s bytecode pointer to the profile data structure…and debugged it for years and years.   JIT’ing the variable-sized offset into C1 is way the heck simpler.

Ok, so now we know enough to be dangerous.  Lets apply our knowledge to a mythical 2-function-call (plus main) program, where_main_ calls A in a loop and A calls B in a loop.  Each invoke of A calls B 7 times.  B loops internally 10 times, but has no more calls (for simplicity). Futhermore we’ll assume no inlining (suppose A and B are each very large methods), and that eventually we want both A and B to be JIT’d with C2, using C1-generated profiling data.

 **static void main() { while(true) { A(); } }**
 **static void A() { for( int i=0; i<7; i++ ) { blah; B(); blahblah; } }**
 **static void B() { for( int j=0; j<10; j++ ) { if(blah) { blah; } else { blahblah; } } }**

Our program starts out interpreted, with main calling A.  Turns out our Interpreter DOES do some limited profiling – just enough to trigger a C1 compile when needed.  We gather function invocation counts and backwards-branch counts; both are gathered on a per-method basis without regard to bytecodes.  So when main calls A the first time, A calls B 7 times

per call to A, we have these counts:

Interpreter Counts: A (1 invokes, 7 backedges), B (7 invokes, 70 backedges).

The compile threshold for C1 is 100 counts, we’ll trigger it either on an invoke or a backedge any time the sum of the two counters crosses the 100 threshold.  So main calls A for the second time, about mid-way through the execution of A on some invoke of B we have:

Interpreter Counts: A (2 invokes, 10 backedges), B (10 invokes, 90 backedges).

The Interpreter now calls into the JVM runtime, the JVM decides that B has 100 total counts, does not have any C1 or C2 code handy, nor is a C1 (nor C2) compile in-progress, and that furthermore method A is too large to guarantee that it will inline B – and it’s method B that needs to be JIT’d.  So the JVM fires off a C1 JIT of method B.  This JIT’ing hits a work-queue of to-be-compiled tasks, and a spare C1 compiler thread picks it up and starts JIT’ing.  Meanwhile, the JVM returns back to the normal running Java thread and wants to continue execution… but it doesn’t want to keep triggering C1 compiles – so the existing counts are moved “off to the side”; not forgotten but not triggering more calls back into the JVM.  In our example, let’s assume that the C1 code isn’t ready until we return all the way back to main.  Our counts are now:

Interpreter Counts: A (2 invokes, 14 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

You’ll notice that the counts for B are now split into “current counts” and “old counts”.  Generally we just sum them all together, but the Interpreter just looks at the “current counts” to avoid endlessly calling back into the JVM only to discover that a C1 compile is in-progress.  Meanwhile the program executes, and the JVM installs the C1 code, and sets a code-pointer up that the Interpreter inspects each time it makes a function call (e.g. invokevirtual).  So let’s continue our example (with main calling A again) until just prior to the next call to B (so one more invoke of A but no backedges nor invokes of B).

Interpreter Counts: A (3 invokes, 14 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

C1 – B Counts:  0 invokes, 0 backedges, plus lots of other bytecode counters

Interpreter Counts: A (3 invokes, 21 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

C1 – B Counts:  7 invokes, 70 backedges, plus lots of other bytecode counters

Notice that the Interpreter counts for B have stalled; we’re counting in C1 JIT’d code now.  We continue until about midway through the dozen’th invoke of A, at a backedge on the loop in A:

Interpreter Counts: A (13 invokes, 86 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

C1 – B Counts:  72 invokes, 720 backedges, plus lots of other bytecode counters

As we take the backedge and bump the backedge counter to 87, the Interpreter notices A crosses the magic total-count threshold of 100.  So again we bounce in and out of the JVM, triggering a C1 JIT of A.  Again for simplicity, let’s assume it’s not ready until we return back to main and main again invokes A:

Interpreter Counts: A (13 invokes, 91 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

C1 – B Counts:  77 invokes, 770 backedges, plus lots of other bytecode counters

Now we start running A in C1 code also, and basically the whole program (except for a tiny bit in main) is running at C1 speeds.  Before executing C1-for-A the first time we again shuffle our Interpreter counts around:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:   0 invokes,  0 backedges, plus lots of other bytecode counters

C1 – B Counts:  77 invokes, 770 backedges, plus lots of other bytecode counters

So now C1-for-A executes and on the first time through it hits a call-site that is not resolved.  To match Java semantics, all call-sites are resolved lazily, as late as possible (for C++ programmers: we essentially do dynamic linking always – but with zero overhead once the link is established).  This call-site is wired to trampoline into the JVM for resolution; and the JVM will figure out that it’s a call to B.  The trampoline will have saved the return address (back to method A) in a place where the JVM can find it, and the JVM will use this return address to find the unresolved call site – and the actual call instruction.  The call instruction is patched to now call B.  Yes, this is classic self-modifying code at work: the call instruction USED to target a trampoline and NOW targets the “best target” for method B.  Since B has been JIT’d with C1 code, the call is patched to the C1-for-B code.

OLD: …lots of C1-for-A-code… call _unresolved_ ….more C1-for-A-code…

NEW: …lots of C1-for-A-code… call C1-for-B ….more C1-for-A-code…

Of course, machine instructions are nothing more than a little bit of data in memory, so the patching isn’t much more than a write to memory.  It has to be atomic (because other threads can be running the same C1-for-A code and will see the call mid-patch if that’s possible), and so it has to not span an ICache-line boundary (only an issue for X86), and is updated with a CAS op.

Back to our example, the C1-for-A code finally ends up calling C1-for-B code.  After 1 invoke of the shiney new C1-for-A code the counts are:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:   1 invokes,  7 backedges, plus lots of other bytecode counters

C1 – B Counts:  84 invokes, 840 backedges, plus lots of other bytecode counters

Much later the counts are:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:  119 invokes,  832 backedges, plus lots of other bytecode counters

C1 – B Counts:  909 invokes, 9090 backedges, plus lots of other bytecode counters

and on the next invoke of C1-for-B, we trigger the next threshold: 10000 counts triggers a C2 compile.  Same story as before: the C1-for-B JIT’d code makes a trampoline call into the JVM, the JVM decides it’s time for a C2 compile of B (and not A).  Method B gets JIT’s by C2, using the profiling done on the last 10000 or so trips around the loop in the C1-for-B code.  This takes awhile, so we go back to running in the C1 code until the C2 code is ready.  Again the counts are shuffled to avoid calling into the JVM until we are ready:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:  119 invokes,  832 backedges, plus lots of other bytecode counters

C1 – B Counts:  0 invokes, 0 backedges, old: 909+9090, plus lots of other bytecode counters

Let’s assume we run on until here:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:  200 invokes,  1400 backedges, plus lots of other bytecode counters

C1 – B Counts:  414 invokes, 4140 backedges, old: 909+9090, plus lots of other bytecode counters

C2 – B Code

Now the C1 code doesn’t check for the presence of C2 code directly, it just looks at the counts.  As part of installing the C2 code we reset the C1 counts to trigger a call back into the JVM:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:  200 invokes,  1400 backedges, plus lots of other bytecode counters

C1 – B Counts:  1323 invokes, 13230 backedges, old: 0+9090, plus lots of other bytecode counters

C2 – B Code

The next call to C1-for-B from C1-for-A triggers the overflow check, which trampolines into the JVM.  The JVM finds the C2 code is ready and uses the trampoline to again find the call-site that used to call C1-for-B and patches it yet again:

OLD: …lots of C1-for-A-code… call C1-for-B ….more C1-for-A-code…

NEW: …lots of C1-for-A-code… call C2-for-B ….more C1-for-A-code…

Now as we run along, the Interpreter is executing main, which calls C1-for-A in a tight loop (via a I2C adapter), and C1-for-A both profiles, and calls C2-for-B directly (no adapter).

Eventually, C1-for-A will trigger a C2 compile of A, which will get installed.  This new C2-for-A code will have it’s own not resolved call site which will eventually resolve to the C2-for-B code, and the system will settle down to running all C2 code (except for a tiny bit of the interpeter).  The C2 methods have no profiling overheads, but are JIT’d with good profiling information.

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

C1 – A Counts:  1300 invokes,  13000 backedges, plus lots of other bytecode counters

C1 – B Counts:  1323 invokes, 13230 backedges, old: 0+9090, plus lots of other bytecode counters

C2 – A Code, calls C2-for-B internally

C2 – B Code

That’s the Grand 1st-order bit for Tiered Compilation.  Within this framework there are a lot of refinements.  I’ll add a little more detail, but then I’ll just have to run out of space lest this blog go on forever….

Suppose method B goes through a phase-shift, and starts executing code it’s never executed before.  This code will have zero counts in the C1-for-B generated profile.  In the C2 JIT’d code, code with zero counts gets re-directed back into the Interpreter, generally at the branch bytecode which has never gone down that path before.  Now that we are taking this new path, we’d like to profile a bit more, then re-JIT with C2.

This transition from C2 to the Interpreter is called an “uncommon trap” in HotSpot jargon (and is very similar to a deoptimization event).  I’ll skip the gory details on how the transition works, but basically the running C2 is replaced with the inlined-stack of Interpreter frames.  Also we bump the profile counters for this new path (since the Interpreter does not bump path-specific counts).  A call-stack at this point in time might look like:

Interpreter-frame (running main)

C2-for-A frame (running A)

Interpreter-frame (running B down some new code path).

The Interpreter’s counts for B get shuffled back, so that the next invoke of B will trigger the JVM:

Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (14 invokes, 140 backedges, old: 0+0).

The JVM will in turn, decide: B is hot enough to be profiled and C1-for-B already exists, but no suitable C2-for-B exists (the existing one has been declared stale) and so the call-site in C2-for-A gets repatched:

OLD: …lots of C2-for-A-code… call C2-for-B ….more C2-for-A-code…

NEW: …lots of C2-for-A-code… call C1-for-B ….more C2-for-A-code…

Now the C1-for-B code gets to profile the new paths, and eventually it will trigger a new C2-for-B compile which will use the new profile data.

I hope this has given you a good flavor of how tiered compilation works!

Cliff