How does Java Both Optimize Hot Loops and Allow Debugging

This blog came about because an old friend is trying to figure out how Java can do aggressive loop and inlining optimizations, while allowing the loading of new code and setting breakpoints… in that optimized code.

On 2/21/2015 11:04 AM, IG wrote:

Safepoint. I’m still confused because I don’t understand what is the required state at a safepoint that can’t be ensured at any random point of execution. What is in-flight that must be permitted to complete? Given a cycle-by-cycle map you can find all the pointers at any cycle – is safe pointing just a way to reduce the volume of pointer-locating info?

Reducing OOP-map volume happens, is useful – but is not the main thing being
achieved.

It’s flipping backwards from the machine state to the language architected
state, unwinding from super-aggressive optimization to reach the theoretical
Java (or C/C++) language state at some points. Toooo expensive to preserve
that mapping line-by-line, call-by-call, execution-point-by-execution-point.
So you keep the mapping at a few places, and optimize the bejeebers out of
things in-between. If somebody wants to:

  • set a breakpoint in optimized code (yes, works for Java all the time)
  • inject new classes, new targets for some virtual-call

then you need to stop the physical machine (actual CPU/Mill) at some place
where you can find all the language architected pieces. This mapping is far
more complex than the oop/no-oop mapping for GC… but isn’t why you make it
rare. You make it rare because you have to hang onto all the language state at
these safepoints, but don’t need it between safepoints.

An example might help:

for( int i=0; i<N; i++ )
    sum += A[i]  // set a breakpoint here, for “i>1e6” or maybe “i>1e7”

So intent is to run awhile… then somebody sets a breakpoint… and wants to
inspect “sum”, or change “i” in the debugger, then continue execution. What’s
the machine code look like, with no safepoints? Something like this (bogus
X86-like assembly code, pardon non-perfection):

… some setup, zero-trip test, range-check on “N” vs “A.length”, etc
  ld  RA=&A        // Classic strength-reduced pointer into array A
  add RC,RA,#N*8   // Stopping value of A
  ld  RB=0         // sum
loop:
  ld  RD=[RA+off]  // Load from A
  add RB=RB,RD     // Sum into RB
  add RA=RA,8      // Advance our pointer into A
  cmp RA,RC        // Stop when at the end
  blt loop
  // RB contains sum

But actually, with some loop unrolling:

… some setup, zero-trip test, range-check on “N” vs “A.length”, etc
  ld  RA=&A        // Classic strength-reduced pointer into array A
  add RC,RA,#N*8 & (-1-3)  // mask of low 2 bits of iteration - unroll by 4
  ld  RB=0 // sum
loop:
  prefetch [R1+off+32]  // Fetch next cache line… overlap memory & compute
  ld  R0=[RA+off+ 0]    // Get 4 values array A
  ld  R1=[RA+off+ 8]
  ld  R2=[RA+off+16]
  ld  R3=[RA+off+24]
  add R4=R0,R1  // add-tree to allow more parallel adders
  add R5=R2,R3  // add-tree to allow more parallel adders
  add RB=RB,R4  // partial sum into RB
  add RB=RB,R5  // full sum into RB
  add RA=RA,8*4 // Skip over 4 elements of array A
  cmp RA,RC     // Stop when at the end
  blt loop
  … cleanup code for 0-3 more iterations
  // RB contains sum

Now I want to break on the “sum += A[i]” line… where is that, exactly, in
that unrolled loop?

And which register contains “i”? (Hint: none of them).

Do I even want to keep a shadow copy of “i” in parallel with my “A” pointer I’m
rolling through the loop? It’s pure overhead, unless I breakpoint and want to
have some place to read “i”…. but even then I cannot change this shadow “i”
because it won’t change my “A” pointer.

What about changing N? Do I have enough info in the register mapping to
describe all possible kinds of updates I might want to do to register RC when I
tweak N? This example is simple, but optimizing compilers can do hella complex
code changes.

How does a Safepoint help here? Well, for starters it points out that I do
indeed need some way to find “i”, and if I change it then propogate those
changes back into the loop. Lets look at this code with a Safepoint in it.

… some setup, zero-trip test, range-check on “N” vs “A.length”, etc
  ld  RA=&A        // Classic strength-reduced pointer into array A
  add RC,RA,#N*8 & (-1-3)  // mask of low 2 bits of iteration - unroll by 4
  ld  RB=0         // sum
  ld  RI=0         // Concession to Safepoint: keep “i” around
loop:
  prefetch [R1+off+32]  // Fetch next cache line… overlap memory & compute
  ld  R0=[RA+off+ 0]    // Get 4 values array A
  ld  R1=[RA+off+ 8]
  ld  R2=[RA+off+16]
  ld  R3=[RA+off+24]
  add R4=R0,R1  // add-tree to allow more parallel adders
  add R5=R2,R3  // add-tree to allow more parallel adders
  add RB=RB,R4  // partial sum into RB
  add RB=RB,R5  // full sum into RB
  add RA=RA,8*4 // Skip over 4 elements of array A
  add RI=RI,4   // Consession to Safepoint: keep “i” around
  SAFEPOINT: RI contains “i”, RB contains “sum”, “some place in the stack” contains “N”
  cmp RA,RC     // Stop when at the end
  blt loop
  … cleanup code for 0-3 more iterations
  // RB contains sum

So I added 1 op in my hot loop to compute “i” on the fly. Not too expensive; 1
clock per cache-line of data. In general we end up keeping alive what is
otherwise programatically dead “just in case” the user stops the running code
and wants to inspect (dead) variables – but this is cheap. Just spill ‘em off
to stack somewhere and flag ‘em in the Safepoints variable map.

How do we trigger the Safepoint? Easiest way is to have the safepoint code
test some thread-local bit for a “stop” bit, and if found… stop. If we have
to stop we might have a lot of cleanup to do, but stopping is rare so the cost
(of mostly not stopping) is low. There’s lots of ways to make the safepoint
check cheap. Here’s one that depends on keeping a 2Meg (TLB-sized) window
above the stack pointer mapped in or out – perfect for an X86 TLB:

ld RX,[RSP+2M] // will trap if we map this page out, then catch the SEGV & handle

How do we use the Safepoint? Once we’ve stopped the running thread mid-loop,
we expect the trap handler to have saved off all the registers. Later we can
inspect the register map to find “i” or “sum”. Note that a Java-level query
cannot ask for our “RA” register – as it’s the sum of “&RA” and “i*8” and isn’t
visible at the language level – so “RA” does not appear in the Safepoint’s map.

How do we change “i” or “N” and have it All Just Work? Basically, we re-enter
a clone of the loop from above…. a clone specially made to be re-entered at
any iteration. The loop setup code in the clone will rebuild “RA” and “RB” and
any other intermediate state as needed, before falling into essentially the
same code as the normal loop (and indeed they might be shared here!). Or in
other words – since the optimizer did something complex to build “RA” from “&A”
and “i” – have the optimized do it again for any arbitrary new “i”.

The general plan is simple (from the optimizing compiler’s point-of-view):
Safepoints are treated like a function-call only taken on a very very rare
execution path. Safepoints take as inputs all visible language-level state,
including the state of all inlined function calls, and generally are assumed to
update all memory state but not any local vars (if you are also hacking local
variable state via the debugger interface, then we need to go a step further
and deopt/reopt – leading to the loop clone mentioned above).

With this magical “low frequency function call” in hand, the optimizer attempts
the best optimizations it can, subject to the normal constraints that the call
arguments are available somehow.

Hope this helps, Cliff

Published by

wpengine

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.