in Personal

Some Source Forge Threads on NBHM

A couple of interesting threads running on high-scale-lib’s forums I thought I’d post to a wider audience.

Josh Dybnis (jdybnis) – 2008-11-10 11:46
I wrote an implementation of the non-blocking hash map in C. I put it up at http://code.google.com/p/nbds/

Cliff Click (cliffclickProject Admin) – 2008-11-10 17:59
I never know what to make of such statements –

– if the table has “almost no synchronization” then it’s not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say “non-blocking” you should also be able to say “no synch”. If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation – so the blocking embedded in “malloc” will probably never hurt you.

– No “C” program can ever implement any reasonable concurrent program, because “C” has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using “C using gcc_ver_XXX on X86_model_YYY”. If you change the “XXX” or the “YYY”, then all bets are off, and the program can fail. In short, the “volatile” keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with “volatile” variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things.  

Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing.
Congrats!
Cliff

Josh Dybnis (jdybnis) – 2008-11-11 01:12
Thanks for the feedback. I really appreciate you sharing the algorithm. It’s one of the most elegant pieces of work I’ve come across in a few years.

  • In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do “porting” work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer’s Manual). I think it could be ported to most other architectures without major modifications.
  • My implementation includes a non-blocking malloc() and a method to do safe memory reclamation.
  • My statement about “almost no synchronization” was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using “synchronization” to refer to x86 operations having the “lock” prefix, not any non-blocking property of the program.
  • A couple of more caveats about my malloc and the safe memory reclamation. They aren’t the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not “non-blocking”. But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking.

Josh

By: Cliff Click (cliffclickProject Admin) – 2008-11-11 07:41
How do you know when e.g. the NBHM Big Array is free?
Cliff

By: Josh Dybnis (jdybnis) – 2008-11-12 17:06
>How do you know when e.g. the NBHM Big Array is free?
Missed this earlier.
After I promote a new array the old one gets queued up for a deferred free. (The same with keys in the old array that are tombstoned and were not copied to the new array.) The frees actually happens at a later point when all the threads are guaranteed not to be holding references to the memory. I accomplish this using a technique from RCU. Each thread that might be holding a reference periodically announces that it is in a quiescent state. I have a way of detecting when every thread has made such an announcement. At that point I can safely free the array. My implementation of all that is not particularly robust. If a thread blocks, and never announces it is in a quiescent state nothing more will ever get freed. This is straightforward to fix with a more complex implementation. I’m having an on-going discussion about it on comp.programming.threads. http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1#

Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.).

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 17:21
Sounds good. Just was wondering.
Lack-of-GC generally brings about various broken attempts to free memory in concurrent programs.

Cliff


By: Cliff Click (cliffclickProject Admin) – 2008-11-11 13:52
> In theory, I agree 100% with you about writing concurrent programs in C….. I think it could be ported to most other architectures without major modifications. 

Itanium. 🙂

Cliff

By: Josh Dybnis (jdybnis) – 2008-11-11 19:47
Yeah Itanium, the Alpha would be even worse. It is your NBHM algorithm though; I don’t think there is anything dependent on having a strong memory model. You would just have to put memory barriers in the right places. I’m not saying that it would be easy to figure out the optimal solution, but it wouldn’t require any major rework to the code. The lazy way to do it would be to change all the CAS calls to include a full memory fence. Doing it that way would constrict performance of course, but it would be correct, and probably still more scalable than a traditional lock-based hash table.


By: Cliff Click (cliffclickProject Admin) – 2008-11-11 21:25
The NBHM State Machine is “correct” independent of any memory model. I still need fences & a Memory Model around the promotion-logic & table-copy areas. It’s here with an IA64 would cause you grief. If you didn’t need table-copy (i.e., pre-made fixed-sized table), then it would be correct on any shared-memory machine. 

The NBHM as presented as stronger semantics than a minimalistic non-blocking hash table – Keys are treated with Java volatile semantics, so e.g. you can make & install a fresh Key on 1 thread, and have a 2nd thread see the Key from the table and also see the Key’s contents so it can correctly run a Key compare. A naive C IA64 implementation can screw up there, although the X86 strong ordering saves it. Same issue happens with the returned Value (made in Thr1, returned in Thr2, but does Thr2 see the initialized contents of the Value?).

Here’s a sample IA64 table-copy screwup:

  • – T1 copies last K/V from old table to new table.
  • – T1 incrs the last count of copied values. Bug: No ordering between the stores.
  • – T2 sees the last count (read of T1’s 2nd store), 
  • – T2 promotes (stores new table as default)
  • – T3 sees the new table (read of T2’s store)
  • – T3 probes for K, but does not see T1’s store.
  • – T3 reports K as not-in-table, when in fact it is.

Cliff


By: Josh Dybnis (jdybnis) – 2008-11-12 16:36
What happens if a thread copies a slot and then dies before updating _copyDone? Does the copy never get promoted?

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 16:39
The fall-back case is that some thread (all threads?) scan the entire array and discover that all things are copied, despite any failed counts. At this point the “discovering” thread can promote. 

Cliff

By: Josh Dybnis (jdybnis) – 2008-11-12 17:15
I’m missing where that is in the implementation. I see where a thread enters panic-mode and does the scan, but it looks like it always compares its workdone with _copyDone and the size of the array. So if every slot is copied, but a thread didn’t get around to updating _copyDone, none of the panicking threads will ever terminate their scans.

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 17:23
Welcome to the difference between algorithm and implementation. 🙂

Feel free to post a fix. It should be a 1-liner.
Cliff

By: Josh Dybnis (jdybnis) – 2008-11-12 17:44
Heh. I have a similar bug in my implementation.

=============================

compareAndSwap memory fence (New)

Andrew Trick (andrewtrick) – 2008-11-12 11:16
My interpretation of your slide-set/blog comments is that the NBHM state machine–ignore table copying for now–does not require any memory fences or store ordering. CAS_key, CAS_val gets the job done as long as your machine has atomic compare-exchange. Correct?

In fact, in NBHM source you bypass util.concurrent.atomic, calling Unsafe.compareAndSwapXXX presumably to avoid the memory fence. However, the Sun JDK and HotSpot assumes that Unsafe.compareAndSwapXXX enforces a full memory fence. So would it be fair to say that you’ve made Azul-specific changes to the JDK/JVM to optimize NBHM? Should Sun get rid of the full-fence semantics of Unsafe.compareAndSwap before more Java/JDK code begins to rely on it?

-Andy

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 11:20
Yes, I need no fencing (ignoring table copy) for some kind of hashtable semantics. You DO need very carefully placed fences to get CHM semantics – which amount to ‘volatile’ semantics on values used in put & get calls.
 
I bypassed the Atomic classes to get the no-fencing CAS. The Unsafe calls specifically have weak no-fencing versions. On an X86 & Sparc you are stuck with the fencing anyways, but on Azul the weak no-fence version actually does not fence.
 
So I did not do any Azul-specific changes to the JDK here (Azul’s JDK IS different from Sun’s – but not here).
 
Cliff

Andrew Trick (andrewtrick) – 2008-11-12 19:05
Thanks for the reply and clarification.

The JDK does indeed have AtomicReference.weakCompareAndSet. Sorry to be pedantic about API details but… What I meant by “JDK assumes a fence” is that AtomicReference.compareAndSet for Sun is identical to weakCompareAndSet, so the underlying Unsafe.compareAndSwapObject must have a fence. What I meant by “HotSpot assumes a fence” is that the Unsafe.compareAndSwap intrinsic generates abstract memory barriers, which you would have to materialize as a real barrier on any machine that doesn’t have implicit CAS fences.

But really I was just trying to understand the intention behind your NBHM source, which is pretty clear now. It’s not important whether some careful JVM porting is needed to make it fast on certain h/w.

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 21:35
Letsee…
 
– sun.misc.Unsafe does not specify any memory ordering properties for compareAndSet.
– AtomicReference.compareAndSet is doc’d to have volatile semantics, and operates on a volatile variable.
– AtomicReference.weakCompareAndSet is doc’d to NOT have volatile semantics.
 
A legit JDK implementation could then
– not fence for sun.misc.Unsafe
– but yes fence around volatile ops, including Unsafe ops on volatiles.
 
This is exactly the intent of the docs & implementation, and it specifically allows Azul to not fence on sun.misc.Unsafe & still meet the spec – which we do (on both fronts: not-fence and meet-the-spec). We also typically do not suffer from other peoples’ bad assumptions here – where they expect Unsafe to fence and are surprised when it doesn’t. I suspect Unsafe.CAS users are a rare breed.
 
Removing the fence for Unsafe ops is a tidy win for Azul on some highly parallel contention-heavy codes.
 
Cliff 

That’s all,

Cliff