in Personal

Final Fields, Part 2

I’ve been having waaaaaay too much fun this month, dealing with “final” fields – final is in quotes because I’ve been finding waaaaay too many Generic Popular Frameworks (TM) that in fact write to final fields long long after the constructor has flowed under the bridge.  Optimizing final fields is in theory possible, but in practice it’s busting a Lot of Popular Code.

From Doug Lea:

It might be worse than that. No one ever tried to reconcile JSR133 JMM JLS specs with the JVM specs. So I think that all the JVM spec says is:
Once a final field has been initialized, it always contains the same value.

Which is obviously false ( etc).

De-serialization plays nasty with final fields any time it has to re-create a serialized object with final fields.  It does so via Reflection (for small count of objects), and eventually via generated bytecodes for popular de-serializations.  The verifier was tweaked to allow de-serialization generated bytecodes to write to final fields… so de-serialization has been playing nasty with final fields and getting away with it.  What’s different about de-serialization vs these other Generic Popular Frameworks?  I think it’s this:

De-serialization does an initial Write to the final field, after but before ANY Read of the field.

These other frameworks are doing a Read (and if it is null), a Write, then futher Reads.  It’s that initial Read that returns a NULL that’s tripping them up, because when its JIT’d its the value used for some of the later Reads.

Why bother?  What’s the potential upside to using final fields?

  • Expressing user intent – but final fields can be set via Reflection, JNI calls, & generated bytecodes (besides the “normal” constructor route), hence they are not really final.  It’s more like C’s “const”, just taking a little more syntax to “cast away const” and update the thing.
  • Static final field optimizations (ala Java asserts).  For these, Java asserts crucially rely on the JVM & JIT to load these values at JIT-time and constant fold away the turned-off assert logic.
  • Non-static final field optimizations.  This is basically limited to Common Subexpression Elimination (CSE) of repeated load, and then the chance to CSE any following chained expressions.

I claim this last one is almost nil in normal Java code. Why are non-static final field optimizations almost nil?  Because not all fields of the same class have the same value, hence there is no compile-time constant and no constant-folding.  Hence the field has to be loaded at least once.  Having loaded a field once, the cost to load it a 2nd time is really really low, because it surely hits in cache.  Your upside is mostly limited to removing a 1-cycle cache-hitting load.  For the non-static final field to represent a significant gain you’d need these properties:

  • Hot code. By definition, if the code is cold, there’s no gain in optimizing it.
  • Repeated loads of the field.  The only real gain for final-fields is CSE of repeated loads.
  • The first load must hit in cache.  The 2nd & later loads will surely hit in cache.   If the first load (which is unavoidable) misses in cache, then the cache miss will cost 100x the cost of the 2nd and later loads… limiting any gain in removing the 2nd load to 1% or so.
  • An intervening opaque operation between the loads, like a lock or a call.  Other operations, such as an inlined call, can be “seen through” by the compiler and normal non-final CSE will remove repeated loads without any special final semantics.
  • The call has to be really cheap, or else it dominates the gain of removing the 2nd load.
  • Cheap-but-not-inlined calls are hard to come by, requiring something like a mega-morphic v-call returning a trivial constant which will still cost maybe “only” 30 cycles… limiting the gain of removing a cache-hitting 1-cycle repeated final-field load to under 5%.

So I’ve been claiming the gains for final fields in normal Java code are limited to expressing user intent.  This we can do with something as weak as a C++ “const”.  I floated this notion around Doug Lea and got this back:

Doug Lea:

{example of repeated final loads spanning a lock}

… And I say to them: I once (2005?) measured the performance of adding these locals and, in the aggregate, it was too big of a hit to ignore, so I just always do it. (I suppose enough other things could have changed for this not to hold, but I’m not curious enough to waste hours finding out.)

My offhand guess is that the cases where it matters are those in which the 2nd null check on reload causes more branch complexity that hurts further optimizations.

Charles Nutter added:

I'll describe the case in JRuby…
In order to maintain per-thread Ruby state without constantly hitting thread locals, we pass a ThreadContext object along the stack for almost all calls.  ThreadContext has **final** references to the JRuby runtime object it is associated with, as well as commonly used literal values like “nil”, “true”, and “false”.  The JRuby runtime object itself in turn has **final** references to other common literal values, JRuby subsystems, and so on.
Now, let's assume I'm not a very good compiler writer, and as a result JRuby has a very naive compiler that's doing repeated loads of those fields on ThreadContext to support other operations, and potentially repeatedly loading the JRuby runtime in order to load its **finals** too.  Because Hotspot does not consider those repeat **final** accesses that are *provably* constant (ignoring post-construction final modification), they enter into inlining budget calculations.  As you know, many of those budgets are pretty small…so essentially useless repeat accesses of **final** fields can end up killing optimizations that would fire if they weren't eating up the budget.
If we're in a situation where everything inlines no matter what, I'm sure you're right… the difference between eliding and not eliding is probably negligible, even with a couple layers of dereferencing.  But we constantly butt up against inlining budgets, so anything I can possibly to do reduce code complexity can pay big dividends.  I'd just like Hotspot in this case to be smarter about those repeat accesses and not penalize me for what could essentially be folded away.

To summarize: JRuby makes lots of final fields that really ARE final, and they span not-inlined calls (so require the final moniker to be CSE’d), AND such things are heavily chained together so there’s lots of follow-on CSE to be had.  Charles adds:

JRuby is littered with this pattern more than I’d like to admit.  It definitely has an impact, especially in larger methods that might load that field many times.  Do the right thing for me, JVM!

So JRuby at least precisely hits the case where final field optimizations can pay off nicely, and Doug Lea locks are right behind him.

Yuch.  Now I really AM stuck with doing something tricky.  If I turn off final field optimizations to save the Generic Popular Frameworks, I burn JRuby & probably other non-Java languages that emit non-traditional (but legal) bytecodes.  If I don’t turn them off, these frameworks take weird NULL exceptions under load (as the JIT kicks in).  SO I need to implement some middle ground… of optimizing final fields for people who “play by the rules”, but Doing The Expected Thing for those that don’t.