To Clone or Not To Clone…

I like the notion of using arrays to hold collections (efficiency (really cool paper), ease of access), but how do you incrementally migrate a collection from one array to another without blocking?  By it’s nature, you can’t atomically update the entire array at one go.  By the nature of concurrency, there can always be a ‘late arriving writer’ who wants to update the old array and isn’t aware that a migration is in progress.  If you let the writer’s write proceed, you risk losing his update (classic hard problem in non-blocking algorithms: dropped update during a ‘resize’).  How do you get the late-writer’s attention that something is going on, when (pretty much by definition) checking an unrelated volatile flag can’t do the job?

This deserves a little more discussion; checking the volatile flag narrows the race but always: the writer can check the flag, then the resizer sets the flag, then copies the old data to the new structure, then the writer (having seen the flag is clear) updates the old data, and the resizer then announces the new structure as “the Real Data” and the writer’s update has silently been dropped.  In a non-GC’d world it gets even worse: the resizer frees the old structure and the late writer updates into the dangling memory.

The idea here is to update something that the writer must also atomically update.  i.e., the resizer CAS’s down a flag into the memory that the late writer must also CAS.  The late writer must either witness the flag (and fails his update, and the flag tells him why), or his update “beat” the resizer’s work – and his update is now part of the state being copied/resized into the new array.

BUT (here’s the Non-Blocking catch-22), you can’t replace the existing data with just a flag without copying it first – because doing so means you’ve “hidden” the old data.  i.e., if you don’t copy the old data, it exists only in your internal thread state between the time you CAS down the flag and the time you publish the old data.  If the OS decides to swap out the copy-thread between those writes the old data can remain hidden indefinitely.  Other threads can see the flag but not the data – so they get “blocked” waiting on you to publish the old data.  e.g., a late-reader will have to wait/block until the OS allows the copy thread to publish the data in the new structure.

My first cut of a NonBlockingHashMap solved this by copying the data to the new structure first, then going back and setting the flag in the old structure.  The problem with this approach is that a late writer can update the old structure – and the copying thread has to restart the copy again.  This lead me to the solution of a 2-phase-commit copy, with as many repeated copy attempts as late-arriving writers (upper bound is the number of threads in the program).  A major downside to this approach is if multiple racing writers are writing and multiple racing copy-threads are copying, then a reader racing against them all can see the writer’s writes flip-flopping more times than you might expect.  Technically, this is legal by the JMM (since all threads are racing against each other) but it’s not desirable.

So here’s the New Plan: CAS a flag into the old structure that does not hide the old data.  This means I need to add a bit to every word – easy in C (or most non-GC’d languages where pointers tend to be “plain”) but harder in a GC’d language.  I add the bit in Java by boxing the data – presence of the box is the ‘flag’ but the data is still available inside the box. (there’s this funny AtomicMarkableReference class that looks real close to what I want, but I see in the HotSpot source that its not intrinsified).

The flag forces late-writers to check for a new structure and put their updates in the new structure.  Since the old data remains available, no thread is ‘blocked’ reading the old data.  This also means that the old data can be copied into the new structure by a different thread than set the flag – allowing for concurrent parallel copying, a requirement for timely copying of very large arrays!


Here’s the big picture.  Key slots are unchanged from before; either null (slot empty) or some Key (which never changes again) or the big X (meaning: go to the new table).  Value slots are a little different from before: either null (no value ever written), or some Value or Tombstone (may flip wildly) and finally boxed{Value/Tombstone} (and never changes again).

New Key insertion is as before (reprobe to find an empty key-slot then CAS down the key; finding an X means retry operation in the new table).

If a thread find the proper Key, it then checks the Value slot.  Normal Values are returned as a hit.  If it sees null or Tombstone, a reader reports a miss.  If it see a box then it retries in the new structure.  Same basic rules for a writing thread: you’re not allowed to update a box.

Copying threads:

  1. Transit unused Key slots to X (CAS null->X)
  2. For used Key slots:
  3. Sets the flag (boxes the existing value) if not already done (CAS V->box{V})
  4. Reads the boxed value
  5. If boxed value is NOT a tombstone then
  6. Claims/finds a Key slot in the new structure (CAS null->K)
  7. Writes the boxed value in the new structure, if nothing else is already there (CAS null->V)
  8. If something is already in the new table, do nothing: a recent update overwrote anything in the old table

Then the resize operation as a whole proceeds like this:

  • Decide to resize (array full, too many reprobes, etc)
  • Create & CAS-install a new target array.  At this point future users get slower by some small amount, so we want the copy to complete in a reasonable period of time.
  • Copy all live slots from the old to the new
    • Copy slots with any combination of readers, writers and background threads
    • Copy “some” slots with each touch (incrementally slowing down users to get the copy done)
  • Promote the new array as “the array”

No more 2-phase copy!  No more excessive value flip-flops (when racing with multiple writers & copy-threads)!


Now here’s the Big Picture, the Crucial Observations:

  • Only units of atomic-update count – unrelated words (the original volatile flag mentioned) can only narrow the race, not close the race
  • I must touch each datum that can be atomically updated, because that’s the only way to inform other updaters that something special is going on
  • I can do it with nothing more than 1 bit-per-word (really: per unit of atomic update).  I can do it with much less actually, since stealing 1 address for the Box per Value out of my 2^32 (or 2^64) is really tiny…. but I must steal something out of the address space.
  • I no longer need to check a “resize in progress” volatile flag before proceeding with any operation, as that doesn’t close the race in any case – and adds a constant amount of work per operation.

Chris Purcell noted something similar to this thesis: you must be able to atomically update a word, but you’d really like to atomically read many locations along with that single word update.  He proposes a primitive that’s stronger than CAS, but is both weaker than DCAS (updates only a single word) and stronger than DCAS (can read atomically more than 2 words).  I really like his primitive, and I hope it sees the light of day (and Yes, I’m ideally placed to make that happen!).

Really, I’m approaching a design-pattern for non-blocking coding of abstract data types: use arrays, use state machines per-array-word, and box/flag every word when it’s time to switch arrays.  All the rest is straightforward engineering.


My NBHM work has been moving along, albeit slowly – duty at Azul calls first.  In the SourceForge CVS (but not released) are ideas for lightweight iteration & ConcurrentHashSet… and at some point I’ll redo the core algorithm to remove the 2-phase commit.

Thanks for all the comments (and excitement) we’ve seen so far!  This whole Non-Blocking thing has really got people way more excited than I ever dreamed… and it seems to me that some really hard problems are starting to give way.

Cliff Click