JIT'd code calling conventions, or "answering the plea for geekyness"

A blog reader writes:

And here my heart palpitated a little when I saw there was a new Cliff Click blog entry.  Only to find it wasn’t full of obscure technical geekery, but a list of conferences.

With a plea like that, how can I refuse?    🙂

Here’s another random tidbit of Azul/HotSpot implementation details.

Performance of Java programs depends in some part on the JIT’d code (and some on GC and some on the JVM runtime, etc).  Performance of JIT’d code depends in part on calling conventions: how does JIT’d code call JIT’d code?  HotSpot’s philosophy has always been that JIT’d code calls other code in much the same way as C/C++ code calls other C/C++ code.  There is a calling convention (most arguments are passed in registers) and the actual ‘call’ instruction directly calls from JIT’d code to JIT’d code.

Let’s briefly contrast this to some other implementation options: some VM’s arrange the JIT’d code to pass arguments in a canonical stack layout – where the stack layout matches what a purely interpreted system would do.  This allows JIT’d code to directly intercall with non-JIT’d (i.e. interpreted) code.  This makes the implementation much easier because you don’t have to JIT ALL the code (very slow and bulky;there’s a lot of run-once code in Java where even lite-weight JIT’ing it is a waste of time).  However passing all the arguments on the stack makes the hot/common case of JIT’d code calling JIT’d code pay a speed penalty.  Compiled C/C++ code doesn’t pay this price and neither do we.

How do we get the best of both worlds – hot-code calls hot-code with arguments in registers, but warm-code can call cold-code and have the cold-code run in the interpreter… and the interpreter is going to pass all arguments in a canonical stack layout (matching the Java Virtual Machine Spec in almost every detail, surprise, surprise)?  We do this with ‘frame adapters’ – short snippets of code which re-pack arguments to and from the stack and registers, then trampoline off to the correct handler (the JIT’d code or the interpreter).  Time for a hypothetical X86 example…

Suppose we have some hot Java code:
  static void foo(this,somePtr,4);

And the JIT happens to have the ‘this’ pointer in register RAX and the ‘somePtr’ value in register RBX.  Standard 64-bit X86 calling conventions require the first 3 arguments in registers RDI, RSI, and RDX.  The JIT produces this code:

**  mov8  RDI,RAX  // move ‘this’ to RDI
  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  foo.code // call the JIT’d code for ‘foo’
**

Alas method ‘foo’ is fairly cold (we must have come here from some low-frequency code path) and ‘foo’ is not JIT’d.  Instead, the interpreter is going to handle this call.  So where does the interpreter expect to find call arguments?  The interpreter has to run all possible calls with all possible calling signatures and arguments – so it wants an extremely generic solution.  All arguments will be passed on the JVM’s “Java Execution Stack” – see the JVM bytecode spec – but basically its a plain stack kept in memory somewhere.  For standard Sun HotSpot this stack is usually interleaved with the normal C-style control stack; for Azul Systems we hold the interpreter stack off to one side.  For implementation geeks: it’s a split-stack layout; both stacks grow towards each other from opposite directions, but the interpreter-side stack only grows when a new interpreted frame is called.  ASCII-gram stack layout:

+———+——————————————-+
| Thread  | Interpreter                    Normal “C” |
| Local   | Stack                          Stack      |
| Storage |   Grows–>                 <–Grows       |
| 32K     |                                           |
+———+——————————————-+

Another tidbit: the interpreter’s state (e.g. it’s stack-pointer or top-of-stack value) is kept in the Thread Local Storage area when the interpreter isn’t actively running; i.e. we do not reserve a register for the interpreter’s stack, except when the interperter is actively running.  Also, all our stacks are power-of-2 sized and aligned; we can get the base of Thread Local Storage by masking off from the normal “C/C++” stack pointer – on X86 we mask the RSP register.

The interpreter expects all its incoming arguments on the interpreter-side stack, and will push a small fixed-size control frame on the normal “C” side stack.  But right now, before we actually start running the interpreter, the arguments are in registers – NOT the interpeter’s stack.  How do we get them there?  We make a ‘frame adapter’ to shuffle the arguments and the ‘frame adapter’ will call into the interpreter.  And here’s the code:

  // frame adapter for signature (ptr,ptr,int)   // First load up the interpreter’s top-of-stack
//  from Thread Local Storage
mov8  rax,rsp           // Copy RSP into RAX **
  and8i rax,#0xFFFFF      // Mask RAX to base of TLS
  ld8   rbx,[rax+#jexstk] // load Java Execution Stack
  // Now move args from RDI,RSI & RDX into JEX stack
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24  // Bump Java Execution stack pointer
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // All args passed on the JEX stack
  jmp   #interpreter
**

Note that the structure of a ‘frame adapter’ only depends on the method’s calling signature.  We do indeed share ‘frame adapters’ based solely on signatures.  When running a very large Java app we typically see something on the order of 1000 unique signatures, and the adapter for each signature is generally a dozen instructions.  I.e., we’re talking maybe 50K of signatures to run the largest Java programs; these programs will typically JIT 1000x more code (50Megs of JIT’d code).

We need one more bit of cleverness: the interpreter needs to know which method is being called.  JIT’d code “knows” which method is currently executing – because the program counter is unique per JIT’d method.  If we have a PC we can reverse it (via a simple table lookup) to the Java method that the code implements.  Not so for the interpreter; the interpreter runs all methods – and so the ‘method pointer’ is variable and kept in a register – and has to be passed to the interpreter when calling it.  Our ‘frame adapter’ above doesn’t include this information.  Where do we get it from?  We use the same trick that JIT’d code uses: a unique PC that ‘knows’ which method is being called.  We need 1 unique PC for each method that can be called from JIT’d code and will run interpreted (i.e. lots of them) so what we do per-PC is really small: we load the method pointer and jump to the right ‘frame adapter’:

  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)

And now we put it all together.  What instructions run when warm-code calls the cold-code for method ‘foo’?  First we’re running inside the JIT’d code, but the call instruction is patched to call our tiny  stub above:

// running inside JITd code about to call foo()
mov8  RDI,RAX  // move ‘this’ to RDI

  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  method_stub_for_foo
// now we run the tiny stub:
  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)
// now we run the frame adapter
  mov8  rax,rsp         
  and8i rax,#0xFFFFF    
  ld8   rbx,[rax+#jexstk]
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24          
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // RCX – method pointer
  // All args passed on the JEX stack
  jmp   #interpreter

Voila’!  In less than a dozen instructions any JIT’d call site can call into the interpreter with arguments where the interpreter expects them…. OR, crucially, call hot JIT’d code with arguments in registers where the JIT’d code expects them.

And this is how Java’s actually implemented calling convention matches compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT’d code.

Cliff