# Biased Locking

Recently I re-did HotSpot’s internal locking mechanism for Azul’s JVM. The old locking mechanism is approaching 15 years old and features a number of design decisions that are now out-dated:

1. Recursion counts are kept as a NULL word on the stack for every recursion depth (i.e., counting in Base 1 math) in order to save a few instructions and a few bits of memory. Both are now in vast plentiful supply. On the 1st lock of an object, it’s header is moved into the stack word instead of a NULL and this means that GC or other locking threads (or threads installing a hash code) all need to find and update the header word – which can now be “displaced”. This mechanism is complex, racey and error prone.
2. The existing mechanism requires a strong memory fence after a Compare-And-Swap (CAS) op, but on most machines the CAS also includes a memory fence. I.e., HotSpot ends up fencing twice for each lock acquire, once to CAS the header and again moving the displaced header to the stack. Each memory fence costs about a cache-miss on most X86 CPUs.
3. The existing mechanism uses “Thin Locks” to optimize for the very common case of a locked object never being contended. New in Java7, +UseBiasedLocking is on by default. This optimizes the common case even more by not using any fencing for locks which have never (yet) changed threads. (See this nice IBM paper on how to do it). The downside in the OpenJDK implementation is that when an object DOES have to change thread-ownership, the cost is so high that Sun has choosen to disable biased locking for whole classes of locks to avoid future thread-ownership-change costs.
4. When a lock does see contention it “inflates” and then the “inflated” lock is much more expensive than a fast-path “thin lock”. So even the smallest bit of contention will cause a lock to be much more expensive than the good case.
5. JVM internal locks and locked Java objects use 2 utterly different code bases. This adds a lot of complexity to an already complex system. The two classes of locks are used in slightly different ways and do have different requirements, BUT they both fundamentally implement a fast-path locking protocol over the OS provided locking abstraction. At Azul Systems, we found that these two locking systems have a lot more in common than they do in difference.

My new locking implementation has met a number of goals:

2. Only one strong memory fence to lock contended locks instead of 2.
3. The normal fast-path is as good as the +BiasedLocking case. Revoking a biased-lock is cheap enough to do it on a case-by-case basis.
4. Inflation happens due to a one-time contention, but then low-contention or no-contention behavior quickly reverts back to a cost which is nearly as good as the normal fast-path. i.e., uncontended inflated locks pay only an extra cache-hitting indirection load.
5. One code base for all locks

As with the OpenJDK, Azul’s new locks are implemented as some bits in every Object’s header word, plus an ObjectMonitor structure when the lock “inflates”. Internal VM locks use a plain Monitor which ObjectMonitor subclasses (actually most VM locks are a subclass of Monitor called Mutex which supports only plain lock actions; Monitors support wait & notify as well).

Within the 64-bit object header word (Azul uses only 1 header word instead of 2), we reserve 32 bits for locking & hashcodes. If all 32 bits are zero, the object is unlocked & un-hashed; new objects appear this way. If the low bit is set, the remaining 31 bits are a hashCode. If the low 2 bits are ‘00’ the object is BiasedLocked to a thread, and the remaining 30 bits are the thread ID of owning thread (we can compute the thread ID of a thread in 1 or 2 clock cycles). If the low 2 bits are ‘10’ the remaining 30 bits are the address of an ObjectMonitor structure; e.g. the low 32 bits of the header contains a C++ pointer to an ObjectMonitor, plus 2.

### The low 32-bit patterns of the object header are:

00000000 – The most common case is that on object is never locked and never hashed; it’s low 32bits of header remains zero throughout it’s lifetime.

xxTIDx00 – The next most common case is that the object is locked but not hashed, and never changes owners. Many many Strings fall into this category. The first lock acquire will CAS the owning thread into the header, and the object is now Biased-Locked. Future lock attempts will simply do a load/compare/branch to confirm that the object remains biased-locked to the owning thread. Literally the code to do this is:

  // R01 - holds object
ldu4 R03,[R01+4] // load low 32-bits
bne R03,R02,slow-path-locking // thread-id bits do not match?
// bits match, so we own the lock!

xxHASHx01 – The next case is that the object is hashed but not locked, and we are trying to get the hashCode:

  // R01 - holds object
ldu4 R02,[R01+4]    // load low 32-bits
beq  R02,0,not_hash // zero header so no hash
shr4 R02,1,   // low-bit to carry flag
bcc  not_hash // "branch carry clear", if low bit was zero it was not_hash
// R02 - holds 31 bits of hash

xxMONx10 – The next case is that the object requires an ObjectMonitor, either because it is both locked AND hashed, or because it was biased-locked once and saw contention. The monitor is a 32-bit pointer (so limited to the low 4Gig), but can be directly used. This snippet of code assumes we already failed the fastest path lock-acquire given above (the snippet is actually a called subroutine so the code size does not really matter):

  // R01 - holds object
// R03 - holds loaded header word which might be a monitor: slow-path-locking:
extract R04,R03,2 // test bit#1 for being set
bne R04,1,not_monitor // branch if we need to inflate the lock
// Here we have code to check for self-biased in the monitor:
beq R04,R02,lock_is_owned // thread-id bits match?
// and here we carry on with slow-path locking
// including testing for recursive locks and
// attempting CAS acquire, short spin-waiting, etc

And so on. If all the various fast-path attempts fail we eventually end up calling into the VM and blocking on the OS-provided mutex. Along the way we’ll attempt a bunch of other tests (e.g. recursion & spin-waiting) and if all things fail we need to block, we’ll also do a fast/short crawl of the call stack for profiling. There are a lot of other interesting issues here, including heuristics on when to make a lock not-biased by default (producer/consumer design patterns make objects which always change hands at least once), or when to deflate an inflated lock that has “settled” into a new thread (or unlocked state), or how to achieve a modicum of fairness under extreme contention. Since this blog is already getting long (and as raw Assembly code ta-boot!) I’ll stick with a short discussion of the bias-lock revoke mechanism.

## Revoking a Biased Lock

Suppose thread T1 holds the lock on Object O – via the very fast-path mechanism: O’s header word contains the bits “xxT1xx00”, and that thread T2 needs to acquire the lock on O. What kind of interactions are possible? For starters, assume T2 has failed the fast-path attempts (it’s got the wrong thread ID!) and already entered the slower-paths in the VM. A simple inspection shows T2 that O is biased-locked to T1. What does T1 know about this situation? Turns out that thread T1 knows almost nothing:

• T1 may have acquired O’s lock in the misty past and may not now have access to a pointer to O.
• T1 may be rapidly acquiring and releasing the lock on O, all without fencing or atomic operations. Moment by moment, cycle by cycle, the actual Java state for T1 may have it holding & releasing O’s lock. T1 isn’t really tracking this, he is just periodically observing that he holds the bias lock on O.
• T1 may be blocked in long latency I/O operations or on other locks.