Odds-n-Ends 2

A collection of recent questions & answers…

These first two comment/questions comes from this TSS thread:

Java bytecode is NOT suited to be run on real hardware. It’s stack-based, so pipelining goes out of the window. In theory, one can do on-the-fly translation from stack-based to register-based machine, but it’ll require A LOT of transistors.  So in reality, it’s ALWAYS more effective to JIT-compile Java bytecode and then run it on a common CPU.

Azul’s CPUs are classic 3-address RISCs, with very few special-for-Java features.  Our addressing math matches 64-bit Java math exactly (and pretty much nobody does; saves 1 op per array address math).  We allow meta-data in the pointers which the LD/ST unit strips off.  We have read & write barriers in hardware (these last are really for GC and not Java).  We have a fast-inline-cache instruction.  We have a fast-not-contended-CAS for all those not-contended Java locks.

We do NOT have special hardware to parse or execute Java bytecodes in any way.  Like the speaker pointed out, it’s plain faster and simpler to JIT.

Of course, hardware can still provide few features to speed up JVMs. Like hardware-assisted forwarding pointers which allow to create fast real-time compacting pauseless GC (I assume Azul hardware has this support).
No, our hardware assist is a read-barrier op.  The hardware detects when a GC invariant is being blown on a freshly loaded pointer – and traps to a software routine.  1 clk if nothing is wrong (by far, vastly far, the common case) and 4 clks to enter the trap routine if there’s an issue.  Software does all the fixup, including relocating objects or adjusting pointers or marking or whatever.  Generally fixup is done in <100 clks (but not always, rarely an object-copy is required).

“What is Azul’s support for Stack Allocation?”

Azul’s support for stack allocation is really support for Escape Detection – fast detection on when an object is escaping a stack lifetime.  Sun has been trying to add a classic Escape Analysis (EA) to HotSpot for quite some time.  Its not on by default now.  Research from IBM some years ago showed that its really really hard to make EA effective on large program, although it works really well on lots of microbenchmarks.

“Do you know how Sun is implementing Tiered Compilation?”

I think Sun is heading for a cut-down C2 as their Tier 1 and dropping client compiler support; Tier 0 is still the interpreter and Tier 2 would be the full blow C2.  Azul is using the C1/client JIT as our Tier 1.  We insert counters in the JIT’d C1 code and profile at “C1 speed”.

“Do you do any self-modifying code?”

Self-modifying code has been the HotSpot norm for years, and appears to be common in managed runtimes in general.  There are a number of cases where we can “shape” the code but want to delay final decisions on e.g. call-targets until some thread really makes that call.  Having the call target into the VM gives the VM a chance to intervene (do class loading & init, etc).  Then the machine call instruction is patched to point to the “full speed” target.  C1 does this kind of delayed patching more aggressively, and is willing to generate code to classes that are not loaded – so field offsets are not even known, filling in that detail after the class finally loads.  For all HotSpot JITs, call-targets come and go; classes unload; code gets re-profiled and re-JIT’d etc.  Any time a call target changes the calls leading to it get patched.

“I always thought that a cas operation would be implemented by the memory controller. “

Every CAS that I’m aware of is implemented in the cache-coherency protocol and not the memory manager.

Azul’s CAS is in general quite cheap: ours can ‘hit in L1 cache’ if it’s not contended.  If it hits-in-cache then it’s 3 clocks (just a read/modify/write cycle).  If it misses in cache then of course it costs whatever a cache-miss costs.

“Why is ref-counting hard to use in managing concurrent structures?”

The usual mistake is to put the count in the structure being managed.  The failing scenario is where one thread gets a pointer to the ref-counted structure at about the same time as the last owner is lowering the count and thus preparing to delete.  Timeline: T1 gets a ptr to the ref-counted structure. T2 has the last read-lock on the structure and is done with it.  T2 lowers the count to zero (T1 still holds a ptr to the structure but is stalled in the OS).  T2 frees the structure (T1s pointer is now stale).  Some other thread calls malloc and gets the just-freed memory holding the structure.  T1 wakes up and increments where the ref-count used to be (but is now in some other threads memory).

The other bug-a-boo is that you need to either lock the counter or use atomic CAS instructions to modify it.

“Hi!  I just implemented (cool concurrent Java thing X) in C!  What do you think?”

A word of caution: C has no memory model.  Instead, it has ‘implementations’.  Implementations can differ and they ESPECIALLY differ around any kind of concurrent programming.  Different C compilers and different CPUs will vary wildly on what they do with racing reads & writes.  e.g. what works using gcc 4.2 on an x86 will probably fail miserably on an ARM or even on an X86 using Intel’s reference compiler.  I personally wrote C/C++ optimizations for IBM’s Power series that would break the Java Memory Model, and any straightforward port of a concurrent Java program to C/C++ using those compilers would break subtly (only under high optimizations levels and high work loads).

In short, you probably have an implementation of (cool concurrent thing X) in C that works for the compiler & hardware you are using – but recompiling with a different rev of the compiler or running on different hardware will require re-testing/re-verification.  In short, you cannot rely on the language to give you reliable semantics.

Yes, I am well aware of an ongoing effort to give C & C++ a useful memory model like Java’s.  However to support backwards compatibility the proposed spec is broken in a bunch of useful ways – e.g. for any program with data races “all bets are off”.  Many many Java programs work with benign races, and port of these algorithms to C will fall into the “all bets are off” black hole.

“Re: Non-Blocking Hash Map – Why not regard a null value as deleted (as opposed to a special TOMBSTONE)? “

I’ve gone back and forth on this notion.  I think a ‘null’ for deleted works.  You’d need to support a wrapped/Primed null.  I’d need to run it through the model-checker to be sure but I think it all works.  Looking at the 2008 JavaOne slides, it would drop the number of states from 6 to 5.

“Re: Non-Blocking Hash Map – I don’t fully get why during resize you allow some threads to produce a new array, while you do make other threads sleep.  Why not limit to 1 producer?”

It’s one of those tricks you don’t ‘get’ until you try a NBHM under heavy load on large systems.  If you don’t allow multiple threads to produce a new array – then you aren’t NON-blocking, because the not-allowed threads are …, well, blocked.  If you allow 1000 cpus (yes, Azul Systems makes systems with nearly that many) to resize a 1Gig array (yes we have customers with NBHM this size) and they all do it at once – you get 1000 cpus making a request for a 2Gig array – or 2 Terabytes of simultaneous allocation requests.  Bringing the sizes down to something commonly available: if 16 cpus (think dual-socket Nehalem EX) request to resize a 100Meg array you just asked for 3.2Gigs of ram – which probably fails on a 32-bit process – despite the fact that in the end you only need 200Megs of array.

So instead I let a few threads try it – more than one, in case the 1st thread to get the ‘honors’ is pokey about; generally 2 threads cut the odds of a bad context switch down into the noise – but if those few threads don’t Get The Job Done (e.g. allocate a new larger array) then every thread can try to resize themselves – and thus are really not blocked.

Cliff