# Fixing The Inlining "Problem"

Function inlining in JVMs is a solved problem, right?  It’s a key performance optimization routinely done by JIT’s everywhere (some might say: THE key optimization).  Inlining has more than a decade of fine tuning in the Java context and over 40 years of production experience in the compilers and systems before that.  JIT’s routinely inline large functions (thousands of bytecodes), inline nested functions many layers deep, and even inline functions which are only known dynamically – based on type profiling information.  JIT’s may use Class Hierarchy Analysis or aggressive type analysis to decide the possible targets of a function call, they might inline multiple targets and use a runtime test to decide the correct flavor, and/or have multiple fall-back positions if the actual runtime types don’t match the profiled types.  Non-final functions are routinely inlined, and the JVM Does The Right Thing if the function is later overridden.  With all this going on, what’s “The Problem” with inlining?

“The Problem” is simply this: new languages on the JVM (e.g. JRuby) and new programming paradigms (e.g. Fork Join) have exposed a weakness in the current crop of inlining heuristics.  Inlining is not happening in a crucial point in hot code exposed by these languages, and the lack of inlining is hurting performance in a major way.  AND the inlining isn’t happening because The Problem is a hard one to solve; (i.e. it’s not the case that we’ll wave our wands and do a quick fix & rebuild HotSpot and the problem will go away).   John Rose, Doug Lea and I all agree that it’s a Major Problem facing JVMs with long and far reaching implications.

The Problem is getting the right amount of context in hot inner loops – which also contain a mega-morphic virtual call in the loop and not much else.  Let me give you an example.  Here’s a simple loop doing “bitblit”: mashing two rectangles of bits (an image) together using some function chosen before the loop, in this case OR’ing two rectangles together essentially merging the image.  This is a function done constantly by all graphics libraries.  Even as you read this in your browser, the GUI is busy merging rectangles of bits as you scroll the blog around.

  // The function in the inner loop
long **or_word**( long a, long b ) { return a|b; }
// The iterator function
void blit( long[] dst, long[] src1, long[] src2 ) {
for( int i=0; i<dst.len; i++ )        // simple loop
dst[i] = **or_word**(src1[i],src2[i]);  //   around a simple loop body
}

Inlining the function “or_word” is crucial to performance here.  Without inlining the compiler doesn’t know what the loop body does (because function calls can in general do anything), and with inlining it can understand the entire function completely – and then see it’s a simple loop around a stream of array references.  At this point the JIT can do range-check elimination, loop unrolling, and prefetching – leading to an easy 10x speedup over the not-inlined loop.  We are so used to the performance level, we’re not realizing that lots of optimizations have to happen “just right” to make this go fast.

At this point it’s a no-brainer: inline, optimize and performance is good!  Baseball, apple pie, Mom, what’s the problem?  The Problem is, is that there are multiple variations on “or_word” AND the wrapping iterator gets complex.  It’s the product of these two parts getting complicated that makes The Problem.

Suppose the wrapping blit iterator gets complex – it’s walking all the bits in a canvas of partially obscured rectangular images, and each image is a collection of 24 bits worth of pixels and is possibly boxed or bounded by some other rectangles:

  // A sample complex iterator function
void blit( Rectangle bounding_box, Image dst, Image src1, Image src2 ) {
for( int i=bounding_box.low; i<bounding_box.high; i++ )
for( int j=bounding_box.left; j<bounding_box.right; j++ ) {
idx = i*bounding_box.width+j;
if( dst.is_rgb_pixels ) {
dst.red  [idx] = **or_word**(src1.red  [idx],src2.red  [idx]);
dst.blue [idx] = **or_word**(src1.blue [idx],src2.blue [idx]);
dst.green[idx] = **or_word**(src1.green[idx],src2.green[idx]);
} else if( dst.is_another_kind_of_pixels ) {
....other stuff....
}
}
}
}

Or imagine this function is really the iterator over a Fork-Join style parallel operator, doing a recursive decent breakdown of the Images – with decisions being made about how many CPUs, whether to split the current work pile, or do it all on one Thread, or join finished work bits together.   Such iterators might get really complex.

Now on top of the complex iterator we want to do something other than “or_word”.   We might also want to “and_word” (good for masking out images), “xor_word” (merging images), “scale_word”,  “filter” and so on.  What we really want is a way to pass in the function to apply on the basic data bits in the innermost loop of our complicated iterator.  In Java we often do this with either a Callable or a Runnable:

  // A sample complex iterator function
void blit( CallableTwoArg fcn2arg, Rectangle bounding_box, Image dst, Image src1, Image src2 ) {
...
dst[idx] = fcn2arg.call(src1[idx],src2[idx]);
...
}

Great use of Abstraction!  We need only 1 copy of our large complicated iterator, and we can do all kinds of things with Images.  Alas, that inner loop now contains a function call that needs inlining… and there are dozens of different functions for “fcn2arg.call”.  The JIT does not know which one to inline here – because all dozen different functions are called at various times.  Typically then the JIT does not inline any of them, and instead opts for a virtual call.  Alas, while the virtual call itself isn’t too bad the lack of knowledge of what goes on inside the virtual call prevents all kinds of crucial optimizations: loop unrolling, range-check elimination, all kinds of prefetching and alias analyses.  In short, we lose our 10x speedup.

How do we get it back?  One way might be to make the inner function call a static (final) call – then the JIT can inline and voila!  apple pie again!  Of course, if we do that we need an iterator for the “or_word” version, and one for the “and_word” version and one for the “xor_word” version and… we need a lot of them.  Worse: we need a new one for each new function we can think of; we can’t just name them all up front.  So we’ll end up with a ton of these iterators each with a custom call in the inner loop, plus some way to generically make more of them.   Too many iterators and we start blowing out the i-cache on our CPU (and that will cost us 10x by itself), and besides it’s a pain to maintain dozens of cloned complex iterators.  Blanket inlining is not the answer.

We don’t really need to clone all those complex iterators – we really only need to clone the hottest inner loop parts, not all the complex wrappers around the lowest levels of looping.  And we don’t really need to clone for all possible functions, just the ones that are getting called by the current program.  After all, that’s one of the reasons to JIT: you should only end up compiling the parts of your program that are really using the CPU cycles.

What we’d really like is some sort of more controlled code cloning – one that allows inlining of megamorphic function calls in hot inner loops, and does so under control of the JIT and JVM proper – which can profile and pick the loops that need cloning.  John Rose, Doug Lea and I have been going ‘round and ‘round on the right way to do this.   I have my favorite solution (coming up).  I’m not going to try and explain the alternatives here – see http://groups.google.com/group/jvm-languages/browse_thread/thread/72a5f4b16eba3505 and  JSR335 for some details.  I’m blogging here to raise awareness and to educate people on what’s going on – because there are serious long term implications for the Java community at work here!

My take on the right way to Solve This: ask programmers to write their programs in a “megamorphic inlining friendly” coding style, and the JITs can then optimize the code.

  // A sample complex iterator function
void blit( CallableTwoArg fcn2arg, Rectangle bounding_box, Image dst, Image src1, Image src2 ) {
... complex stuff...
// This code gets cloned and specialized according to "fcn2arg"
inner_blit( fcn2arg, dst, long src1, long src2, lo, hi );
... more complex stuff..
}
// This code holds a hot inner loop, with simple loop bounds and a megamorphic function body.
// This code can be highly optimized only if the call in the inner loop gets inlined.
// The JIT is strongly encouraged to recognize this code pattern and Do The Right Thing.
void inner_blit( CallableTwoArg fcn2arg, long dst[], long src1[], long src2[], int lo, int hi ) {
for( int i=lo; i<hi; i++ )
dst[i] = fcn2arg(src1[i],src2[i]);
}

Basically the JIT can see a hot loop around a megamorphic call, the entire method (not counting what is getting called) is small and mostly consists of array references and a loop.  The megamorphic call is loop-invariant: the target is unchanged by any loop iteration.  Furthermore it’s passed in as an argument to the function.  Now all the JIT has to do is compile versions of inner_blit specialized by the first argument, plus some sort of 1st-argument dispatching logic.  Like all such compilations it’s driven by frequency and profiling.  Since the inner_blit code is small it can be cloned without risking i-cache blow-out.

(Update: 4/8/2011 – Remi Forax plays with this idea)

In short, we’d get a controlled amount of code cloning plus all the right levels of inlining to get our 10x performance boost back.  And we can do it without changing either the JVM spec nor the Java language.  JVMs that do not support such specialized cloning will run the code with the same performance they always have.  JVMs WITH such support will run this code much much faster… encouraging JVMs everywhere to be improved to speed up this coding style  (or pass by the historical wayside).

Cliff