Inline Caches and Call Site Optimization

Inline Caches solve a problem particular to Java – extremely frequent virtual calls.  C++ has virtual calls as well, but you have to ask for them (using the virtual keyword).  By default C++ calls are static.  The situation is reversed in Java: virtual calls are the default and you have to ask for static calls (with the final keyword).  However, even though most Java calls are declared as virtual, in practice very very few use the full virtual-call mechanism.  Instead, nearly all Java calls are as fast as C and C++ static calls (including being inlined when appropriate).  Here’s how it works:

At JIT’ing time, the JIT will first attempt some sort of analysis to determine the call target.  This works surprisingly well: a great many call targets can be determined with a straightforward inspection of the class hierarchy.  Note that this inspection happens are JIT-time (runtime) so only is concerned with the classes loaded at the moment (contrast this to a normal C++ situation where all possible classes linked into the program have to be inspected).  Here’s a common situation:


  abstract class Picture {
    String foo();
  }
  class JPEG extends Picture {
    String foo() { …foo-ish stuff… }
  }
  abstract class PictureOnDisk extends Picture {
    String foo();
  }

….
  void somecall( Picture pic ) {
    pic.foo();
  }

——————————————————–

When JIT’ing somecall(), what can we say about the call to foo()?  foo() is not declared final, so by default it’s a virtual call.  The type of pic is statically declared as the abstract class ‘Picture’ – but there are never any direct Picture objects (because the class is abstract!) so every ‘Picture’ is really some concrete subclass of ‘Picture’.  Loaded in the system right now there is only one such subclass: JPEG.  The abstract class ‘PictureOnDisk’ doesn’t count – there are no instances of PictureOnDisk and there are no declared concrete subclasses of PictureOnDisk.

So with a little Class Hierarchy Analysis (CHA), the JIT can determine that the only possible values for ‘pic’  are instances of class JPEG or null.  After a null-check, the virtual call to foo() can be optimized into a static call to JPEG.foo() and even inlined.  Note that this optimization works until another concrete subclass of Picture is loaded, e.g. when subclass JPEGOnDisk is loaded into the system, the JIT’d code for somecall() will have to be thrown away and regenerated.

CHA is often successful, is cheap to run, and has a high payoff (allowing a static calls or even inlining).  HotSpot (and Azul) have extended the basic algorithm several times to cover even more cases (large trees of abstract-classes with but a single concrete class in the middle will be optimized; so will many interface calls).


What happens when CHA “fails” (i.e. reports multiple targets)?

The most common answer is to use an “inline cache”.  An inline-cache is a classic 1-entry cache compiled directly into the code.  The Key is the expected class of the ‘this’ pointer and the Value is the method to call.  The Key test is typically done by loading the actual class out of the ‘this’ pointer (a header-word load), and then comparing it to some expected class.  For a 32-bit X86 system the test looks something like this:
  cmp4i [RDI+4],#expected_klass
jne fail
**
For Azul’s TXU ops we can skip the load as the klass is encoded directly in the pointer, and we have an instruction to do the test&branch in one go:
**  cmpclass R0,#expected_klass

The Value is encoded directly into the call instruction:
  call foos_JITed_code **</span>
The failure case needs the address of the Inline Cache, because the code needs to be modified (e.g. to fill in an empty cache).  The easy way to get the address is to do the test **after
the X86 call has pushed the return PC on the stack…. but the test needs to cache an expected class on a per-call-site basis so the expected class is inlined before the X86 call:

The X86 total sequence is:
  mov4i RAX,#expected_klass
call foos_JITed_code
foos_JITed_code:
cmp4i [RDI+4],RAX
jne fail

Performance: The entire sequence is only 2 to 4 instructions long (HotSpot/X86 uses 4 ops plus alignment NO-OPs, 5 on Sparc, more for 64-bit pointers; Azul TXU uses 2 ops).  95% of (static) call sites that use an inline-cache never fail the cache check for the entire duration of the program.  The remaining 5% typically fail it repeated, i.e. the call site goes megamorphic.  For the 5% case we further patch the Inline Cache to call a stub to do a full virtual-call sequence.

Back to the 95% case: the IC costs a load/compare/branch.  The branch is entirely predictable.  The load has an unfortunately miss rate (often the first use of an object’s header word), but on an O-O-O X86 processor can issue past the miss and the predicted branch and start executing the called method.  These handful of extra ops represent the entire cost of 95% of the not-analyzable virtual calls sites.  Dynamically, nearly all calls (>99%) fall into the statically-analyzable or monomorphic-at-runtime camps.  Only a tiny handful at runtime actually take a path.

IC’s also work for interface calls for essentially the same reason: interface call-sites are also almost always single-klass at runtime and once you’ve correctly guessed the ‘this’ pointer you can compute the correct target of an interface call and cache it as easily as a virtual call.


What about using larger caches?  In the 5% case, does it help to have more Key/Value pairs in the cache?  Typically no: once a call-site fails to be monomorphic it’s almost always “megamorphic” – many many targets are being called.  Instead of 1 common target, there’s frequently 20 or more.

Can we do something with profiling and the JIT?  Indeed we can: if we have profiled the call site and at JIT_time discovered that one or two targets dominate we can inline the inline-cache test with the JIT.  Inlining the test gives us more options: we can now inline on the expected-path and either slow-call on the unexpected-path or bail out to
the interpreter (here I show the control-flow diamond inlined):   cmp4i [RDX+4],#expected_klass
jne   go_slow
…foo-ish stuff…
post_call:

go_slow:
call full-on-virtual-call
jmp  post_call

Bailing out to the interpreter has the further advantage that there’s no unknown-path being folded back into the main stream.  Once we know that RDX holds an instance of JPEG we know it for the rest of the compilation.

HotSpot also implements the 2-hot-targets in the JIT; the JIT inlines 2 guard tests and then does its normal optimizations for the 2 static calls (including separate inlining decisions for each call).


Inline Caches have a ‘lifecycle’.  At Azul, we control them with a simple 5-state machine.  ICs start out life in the clean state.  Any thread executing a clean IC trampolines into the VM and the IC is patched to either the caching state or the static state.  The caching state is described above; the static state is reserved for times when the JIT did not determine a single target using CHA but the runtime can prove it.  This can happen e.g. when a class unloads during the JIT process (so multiple possible targets when the JIT begins but only a single possible target by the time the JIT finishes).  If a cached state fails even once HotSpot flips the IC into the v-call (or i-call for interfaces) state.

Many call sites are mega-morphic during startup, but the interpreter (or C1 on Azul’s tiered VM) handles startup.  Also aggressive inlining by C2 means that many call sites are replicated and each site has a chance to cache a different expected klass.  The end result is that when a cached IC misses in the cache, that call site is nearly always going to go megamorphic and no reasonably sized cache is going to hold all the targets.  i.e., the Right Thing To Do is to patch the IC into the v-call state.

As a guard against program phase-changes where a v-call is being made where a cache would work, a background thread will periodically flip all v-call state ICs back to the clean state and give them a chance to start caching again.

Due to the requirement that ICs are always executable by racing other threads, patching ICs is tricky.  It turns out to be Not Too Hard to patch from the clean state to either static or caching, and from caching to either v-call or i-call.  But it is never safe to patch from caching, v-call or i-call back to the clean state – except when no mutator can see 1/2 of a patched IC (usually a full safepoint).  So the patch-to-clean step is done by GC threads as part of a normal GC safepoint.

And this blog exceeds my Daily Dose of geekiness so I better code for for awhile…

Cliff

Published by

wpengine

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.