Just fixed a 20-year-old bug…

I just fixed a 20-year old bug… that I wrote, and is in daily use by a few million people…

Some background:

[A HotSpot jargon section is included down below]

I’ve been chasing down a low frequency crash when running large Java apps in Azul’s HotSpot; periodically QA would get a crash in C2 (the “-server” compiler in HotSpot – and our HotSpot runs in “-tiered” mode by default, so by default C2 is always-on). The crash would appear in debug builds as a missing reaching-definition for some register during OopMap construction. In product builds the crash appears as a stale OOP, which can be caused by a broken OopMap. The crash was infrequent (once a week or so across all QA runs), and only appeared on very heavily loaded QA boxes, and was very rare in debug builds. After chasing this for nearly a year I finally got a core file on a debug build. I sat down with the core file and debugger and prepared to spend some time spelunking.

I first checked the C2 compiler basics: the compiler had stopped during the BuildOopMap phase with a missing reaching-definition assertion. The C2 IR (Sea of Nodes) was well-formed. C2 was building an OopMap for a Safepoint on the slow-path of a standard allocation. The live-value set included X86 register RAX (the BuildOopMap phase happens after register allocation, so the compiler is now referring to machine registers), and C2 was trying to find the reaching-definition to determine if an OOP had been defined (and live in register RAX and thus needing including in the OopMap). Slow-path allocation typically calls into the JVM and triggers a GC cycle (we get to the slow-path when the fast-path has failed to allocate an object).

After some more poking about in the core file, I suddenly snapped: slow-path calls use the standard X86 calling convention – and register RAX gets clobbered by the standard calling convention… hence RAX should NOT be alive now. C2 should have spilled around the slow-path call. (Oh great, I mentally groaned, a really rare register allocation bug). But before I started looking the register allocation pass, I decided to dig a little more – since instead RAX might be alive because I had broken live-ness information and live-ness is a lot easier to debug.

In the BuildOopMap phase, live-ness is computed using a standard bit-vector style dataflow analysis… another piece of very-old and reliably-working technology. I decided that it was unlikely liveness itself was broken. The live-ness bitmaps themselves are stored in a hash table mapping from Safepoints to the liveness bitmaps.

The hash table is implemented in a C++ ‘Dict’ (short for dictionary) – a piece of code I wrote while I was a grad student in Rice University more than 20 years ago. It’s a plain olde hash table written in very vanilla C++ code (almost a C dialect) – 20 years ago C++ compilers were more unstable than now and many new features were busted on different compilers. A portable hash table had to not hit very many language features if it wanted to be actually portable. I grabbed it as a reliable handy toolkit when I started working on the HotSpot server compiler more than 15 years ago; a copy is in hotspot/src/share/vm/libadt/dict.cpp.

Like any good hash table you have to provide the key-compare function, and if you screw that up you can end up pulling out wrong values – in this case I might be pulling out the wrong live-ness data if somehow I was using a broken key-compare. I certainly wasn’t ready to blame the Dict code (which had been reliably working for 20 years), but maybe I wrote a subtly wrong key-compare function for this usage in BuildOopMap… but no, I was using the default ‘cmpkey’ function which I also wrote 20 years ago. It’s a plain dumb compare-keys-for-equality function using straight-up pointer-compare.

The Bug

Realizing I hadn’t seen that code in 20 years I dug a little deeper: sure enough there’s the speed-hack I vaguely recall writing 20 years ago (and I defend that it was the Right Decision 20 years ago) – the pointers are compared by doing a plain ‘subtract’. A zero result means equals, and non-zero means not-equals… and then the results are cast to 32 bits.

 // Slimey cheap key comparator.
 int32 cmpkey(const void *key1, const void *key2) {
 return (int32)((intptr_t)key1 - (intptr_t)key2);
 }

20 years ago casting to 32 bits means *growing* the result size on 16-bit pointer models. But Azul Systems only does 64-bit JVMs… so now my results are getting truncated down to 32 bits. Keys that are equal in the low 32-bits but unequal in the high 32-bits would report back as 'equal' in the hash table. I sat and stared at it in stunned amazement… I'd been bit by a performance hack in 20-year old code that I wrote, and was in daily use by millions of people.

The Aftermath

All that was left was to confirm that this was indeed the result of my broken OopMaps (a painful GDB macro later and I had dumped all 10000 Nodes in this very large compile, and confirmed 6 pairs of collisions on the low 32-bits of Nodes’ addresses with different high 32-bits, and one of those pairs referred to a pair of unrelated Safepoints, and I was indeed getting the liveness bitmap for the wrong one).

I also google searched for copies of dict.cpp, and found several versions – both my broken version and a fixed one in OpenJDK. The OpenJDK fix is both more complicated than needed (don’t need a trinary result for a hash table) and slower and they removed my 20-year old comment word “slimey”. I decided to do a “proper” performance-hack version – one that passes modern portability standards and keeps the entirely appropriate word “slimey”. I used the new-fangled notion of ‘intptr_t’ for my results cmpkey results and keep my plain subtract for the compare function.

I made a new product build and QA is busily testing. Given the low failure rate here I’ll need to wait a few weeks until the bug is confirmed-fixed… but I feel confident that I nailed this one.

Cliff

Some HotSpot Jargon

GC – Garbage Collection. Hopefully if you are reading my blog you already know what this means! But I’ll add here anyways that many GCs are moving GCs – they move objects… and thus change object pointers (OOPs). To change a pointer you have to be able to find it… hence OopMaps.

OopMap – An OOP map is a list of machine registers holding OOPs at a particular point in the code. If a GC cycle needs to happen while the program is stopped at a Safepoint, GC will consult the OopMap to find all the pointers needing updating.

Safepoints – Points in the program where it is “safe” to stop the program and do a GC cycle, or deoptimize. In the HotSpot JVM, Safepoints have OopMaps and full-debug information (to support deoptimization).

stale OOP – GC moved the OOP but didn’t update this pointer. This pointer was left pointing to some random old place, and the OOP is was supposed to point to is somewhere else. Typically this is figured out a few trillion instructions after the object was moved and GC was supposed to have updated the pointer. These kinds of bugs are very hard to chase down.

debug vs product builds – A HotSpot debug build is a compilation of HotSpot with the C++ compiler flag “-g”, i.e. full debug information, and no optimizations on the VM itself (meaning internal VM operations such as GC or JIT’ing are very slow although JIT generated code quality is not changed. A product build is when HotSpot is compiled with the “-O2” flag – full optimization of the JVM itself with very poor internal debugability.

Reaching-definitions – ancient compiler terminology for some ancient compiler technology invented in the 60’s (so, you know, ancient). From the point of using some Value, the reaching-definition is the definition that reaches the use point. Java example: If you use a value X (e.g. “…f(x)…”, then the reaching definition might look like “x = blah();” – if this is the last time X was updated before reaching the usage point. Life gets complicated with control flow, hence reaching-definitions isn’t a trivial problem. In my case, I am using reaching-definitions to determine the definition for every live-value at Safepoints.

Live-values – Another ancient compiler technology telling you what values are alive (in-use or going-to-be-used) at any point in the program. It’s often used in conjunction with reaching-definitions to help with register allocation.

QA – Quality Assurance. The group at Azul that runs way too many JVMs on way to many cores with way to much stress / load / weird configurations. Hopefully these guys find the JVM bugs before you do.

C1, C2, Tiered – C1 is the HotSpot -client compiler. C2 is the HotSpot -server compiler. Tiered is Azul’s mode of running both JITs in the same JVM, using C1 generated code to do profiling for a following C2 JIT’ing. We typically see C1-startup speeds with better-than-plain C2 peak performance.