# What the heck is OSR and why is it Bad (or Good)?

What puzzles me is why this isn't already done.  It's essentially equivalent to just pulling the inner block out into a separate tail recursive method.   If the compiler has decent support for inlining JIT methods and associated opportunity for intraprocedural optimizations this approach comes basically for free.  Yet the OSR compilation you describe seems to generate substantially less efficient code.  Sure entering/exiting the interpreter is expensive but either it's quite rare as only the inner loop is hot or it gets eliminated when the outer method gets JITed.

I guess the trouble is that java bytecode makes the business of converting a block to a procedure very inconvenient.  In SSA form it's trivial but if the inner loop is accessing parts of the stack, parameters and instance variables then the method call code would be unnecessarily ugly. - id: 184   author: nksingh   author_email: nksingh85@gmail.com   author_url: ''   date: '2011-11-25 10:07:31 -0800'   date_gmt: '2011-11-25 18:07:31 -0800'   content: |-
Hi Cliff,
From my understanding of HotSpot, it seems that you have sufficient information to map the execution context from a normally JIT'ed function to an interpreted version of the same code for deoptimization purposes. What makes it difficult to apply the mapping the other way to transition from the interpreted code directly to the 'normal' JIT compilation at a safepoint? - id: 185   author: cliff   author_email: cliffc@acm.org   author_url: ''   date: '2011-11-25 11:04:14 -0800'   date_gmt: '2011-11-25 19:04:14 -0800'   content: |-
For "simple" loops with fixed known termination conditions (e.g. most scientific loops; String.hashCode, etc), there's NO mapping kept in the loop body.  Removing the maintenance of the mapping speeds up these kinds of loops alot, typically 2x or so (for small loops) or maybe 20-30% (for larger loops).  Since there's no mapping, you can't transit into the middle of the loop.  Also, the compiler will track other values than what the interpreter directly needs or produces; jumping into the middle of such code means computing those values from the interpreter's state.  Such computations get arbitrarily complex.  Example from String.hashCode: "hash = hash*31 + ary[i];"  The interpreter has tracked 'hash' and 'ary' and 'i', but the JIT will often track some 'p = A+12+i*2' and do a 'p += 2' in the loop... except the loop  is unrolled, and any entry point will come around only every 4th or 8th value for 'i'.  So we'd have to JIT the code; then tell the interpreter to keep executing until 'i mod 8 == 0', then compute 'p=A+12+i*2' and place it in some such register (and 'this', hash, ary and 'i' probably in other registers) before we could start up the optimized loop.  In short, the mapping in this direction is complicated and isn't something I'd like to record as a 'mapping' per-se.  Instead I'd just have the optimizer JIT code for the mapping... which is how the problem is solved now.  Except you better have only 1 such loop-entry or else your loop is now 'irreducible' and all the classic loop optimizations no longer apply.

Cliff ---


What the heck is OSR?  It's HotSpot jargon for On-Stack-Replacement – and it's used to convert a running function's interpreter frame into a JIT'd frame – in the middle of that method.  It crops up in micro-benchmarks all the friggin' time, but only rarely in other apps (or at least only rarely in a performance critical way).  What happens is this:

1. The JVM starts executing some method for the first time ever, in the interpreter, e.g. main().
2. That method has a long running loop, that is now being executed in the interpreter
3. The interpreter figures out that the method is hot and triggers a normal compilation
4. That compilation will be used the NEXT time this method is called, but e.g. it's main() and there is no next time
5. Eventually the interpreter triggers an OSR compilation.  The OSR is specialized by some bytecode it will be called at, which is typically the loop back-edge branch.
6. Eventually the OSR compilation completes.
7. At this point the interpreter jumps to the OSR code when it crosses the specialized entry bytecode – in the middle of the method.

Martin Thompson is asking me questions about writing micro-benchmarks.   I'm replaying some of that conversation here (with permission!).

On Fri, Nov 18, 2011 at 11:20 AM, Cliff Click wrote:

Do more & shorter warmup runs.  100K iterations is fine, but loop around the 'runTest' 10 times during warmup.  Otherwise you risk ending up testing an 'OSR' compile instead of a normal one.

On Mon, Nov 21, 2011 at 12:29 PM, Martin Thompson wrote:

Standard Jit'ing and On Stack Replacement are not equal.  I have been operating under the assumption that they were.  Just to ensure my understanding is correct.  The JVM implements a counter for each method that gets incremented when the method returns and also on branch back within a loop.  If this counter exceeds 10K on the server VM then it will trigger a compilation of the method.  If the method is called again the compiled code is used.  If the method was continuing to loop more than the 10K increments then at approximately 14K increments the method will be swapped via OSR.  I do not understand why OSR can be a less efficient result but will take your word for it.  So I should re-organise the code to avoid OSR and do a number of shorter warm up runs.  This is very interesting feedback for main loop design.  Can you point me about more detail on how the main loops should be best structured?

OK, Martin, here goes…

A number of shorter warm-up loops will do better, or calling the warm-up loop itself in a loop.  Do not nest the warm-up loops in the same function, that defeats the purpose.  It probably helps to understand what's going on in an OSR compile.

In an OSR the code has entered some Java method IN THE INTERPRETER.  It's now spinning in some hot loop IN THE INTERPRETER.  The interpreter is busy gathering backedge counts (which will soon hit 14000) and function-entry counts (will only ever be 1).

When the OSR compilation starts, we start parsing bytecodes IN THE MIDDLE OF THE HOT LOOP.  For singly-nested loops the effect is to partially peel one loop iteration, then enter the loop proper.  To make it clear what the problem(s) are, my examples are written closer to the bytecodes actually experienced by the JVM, as opposed to clean Java.  Single-nested-loop example:

  public static void main(String args[]) {
S1; i=0;
loop:
if(P) goto done
S3; A[i++];
goto loop; // <<--- OSR HERE
done:
S2;
}

Suppose the interpreter has already triggered a normal compile (after 10,000 back-edge branches) and decides that since this loop as run 14,000 back-edges already and we might never exit this function (and thus re-enter it via the normal compilation), and now we need an OSR compilation.  This OSR compilation will only be validly entered at the backwards goto bytecode (and also it needs to do some horrible stack-mangling to remove the interpreter's frame and insert it's own before entering the main function proper).  What program semantics does the OSR compilation have to deal with?  The OSR compile “sees” a function that looks like we started executing bytecodes at that backwards goto; this function might look something like this:

  void OSR_main() {
A=value on entry from interpreter;
i=value on entry from interpreter;
goto loop;
loop:
if(P) goto done
S3; A[i++];
goto loop;
done:
...never reached...
}

This typical optimizes fairly well.  Variably 'i' is a well-formed array loop index (whose starting value is unknown, but it is a stride-1 increasing value capped at the smaller of 'A.length' and '!P').   However for a doubly nested loop things get ugly.

  public static void main(String args[]) {
S0;
loop1:
if(P1) goto done1
S1; i=0;
loop2:
if(P2) goto done2
S3; A[i++];
goto loop2; // <<--- OSR HERE
done2:
S2;
goto loop1;
done1:
S4;
}

Here we (again) decide to OSR at the backedge – although there are two backedges to choose from.  In practice, the inner one is typically much more frequent than the outer one, so that's the typical OSR choice.  Not that the interpreter can tell; it only tracks backward branches and has no clue about any possible looping structure.  In any case, our example OSR compilation starts parsing bytecodes at the 'goto loop2' … leading to this structure:

  void OSR_main() {
A=...value on entry from interpreter...
i=...value on entry from interpreter...
goto loop2;
loop2:
if(P2) goto done2
S3; A[i++];
goto loop2
done2:
S2;
goto loop1;
loop1:
if(P1) goto done1
S1; i=0;
goto loop2
done1:
...never reached...
}

Blah!  What happened?  We lost the nested-loop structure!  After the optimizer cleans up the control-flow a little we end up with something that looks kinda like this:

  void OSR_main() {
A=...value on entry from interpreter...
i=...value on entry from interpreter...
loop2:
if(P2) {
S2;
if(P1) goto done1
S1; i=0;
} else {
S3; A[i++];
}
goto loop2
done1:
...never reached...
}

Predicate 'P2' probably reports a very low but non-zero frequency (since it's really the original inner-loop exit test).  Since P2 is non-zero, variable i gets reset during the loop.  Now variable i is no longer a proper array loop index.  There's no simple affine function describing i, and we no longer can range-check-eliminate A[i++] from the original inner loop.  Performance of the OSR version can be as bad as 1/2 that of a normal compilation (although it's not usually that bad… and 1/2 normal-speed is still 5x interpreted speed!).

Recovering nicely nested loops from here is tough.  And what happens above is the GOOD case.  There's another bytecode parsing pattern/technique HS used to use here, and that one leads to having IRREDUCIBLE loops… which pretty much blows all chance of any loop optimizations in any case.

I hope this little chat has given you some idea of what goes on in an OSR compile… and when OSR's kick in.  If you want to test some little function or other in a tight loop… then the Right Thing To Do is to wrap it in a loop in a function in a loop in a function.  The outermost loop might OSR, but the inner-most loop(s) will be normal function entries so the inner-loop code-gen will Do The Right Thing.

Cliff