in Personal

A Brief Conversation with David Moon

 I had an email conversation between myself, David Moon & Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow). 

[Ed: 11/20/2008 – Some small follow-on conversation has been appended to the end]**Cliff Click wrote:** Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program – we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC – a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It's a RISC-y approach to the GC hardware that Symbolics had 20 years ago. 

**David A Moon wrote:** I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16×1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice. 

**Cliff Click wrote:** We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The “big” box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- & 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2\. Chips are fully interconnected. 

**David A Moon wrote:** 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the “sweet spot” for instruction set architectures. [**Cliff**: Yes!] I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. [**Cliff**: Yes, yes & yes] I don't think Java needs floating point rounding and trapping modes either. [**Cliff**: Yes, it does not] For your application you probably don't even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language. 

**Cliff Click wrote:** That's the theory but the practice is a little different: JVM's crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it's saved our butts any number of times. 

**David A Moon wrote:** Maybe you don't need any kind of mode bits register at all? Being completely modeless would be cool. 

**Cliff Click wrote:** There's also a speculative-memory mode: writes are buffered in the L1 but don't become visible until a 'COMMIT' instruction. 

**David A Moon wrote:** That's a pissload o' cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP? 

**Cliff Click wrote:** Nope, real cores all the way. 

**David A Moon wrote:** Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed? 

**Cliff Click wrote:** The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power's memory model. 

**David A Moon wrote:** I assume uniform memory access doesn't mean cached accesses aren't any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. [**Cliff**: Yup] 

It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches.

**Daniel Weinreb wrote:** Moon says: “… but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network.” Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core.

**Cliff Click wrote:** No, we're a full direct interconnect on the L2's. 9 L1's share an L2\. One step to the L2, then one step to the correct remote L2 – across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn't worth it. 

**Daniel Weinreb wrote:** If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don't know how the 54 L1 caches (or maybe fewer if you're doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. [**Cliff**: Key word: challenging] 

**Cliff Click wrote:** We have a **lot** of memory bandwidth – one of our bigger boxes is on [the top-20 supercomputer bandwidth list](http://www.cs.virginia.edu/stream/top20/Bandwidth.html). The chips are fully interconnected. 

**David A Moon wrote:** Meaning every chip has 15 ports directly connected to another chip? [**Cliff**: Yup] That's a lot of pins. You might want to look at[http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect](http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect)for an interesting slimmer approach. 

**Cliff Click wrote:** I'm sure we looked at sicortex at one point. 

**Daniel Weinreb wrote:** I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory? 

**Cliff Click wrote:** There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core's memory requests are spread across the box – mostly eliminating “hotspots”. 

**David A Moon wrote:** Also they are a message passing architecture and you have to be a shared memory architecture, but I think that's a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it. 

**Cliff Click wrote:** By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier. 

**Daniel Weinreb wrote:** What about cache coherency over multiple chips? 

These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word “illusion” in his mail.)

Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I'd love to know the results of such a test.

**Cliff Click wrote:** Our hardware has short cache lines (32b) so there's less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot – but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer? 

**Cliff Click wrote:** Things we do specifically for Java:

* GC read-barrier enables a fully parallel & concurrent GC; we can sustain 40G/sec allocation on a 400G heap indefinitely, with max-pause times on the order of 10-20msec. This uber-GC is partially made possible because of the read barrier (and partially possible because we 'own' the OS and can play major page-mapping tricks).

**David A Moon wrote:** That is indeed impressive! Even if I divide 40 GB/sec by 864 it's still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage. 

**Daniel Weinreb wrote:** Dave, you say: “My pet peeve is no way to return multiple values from a method without generating garbage.” In Rich Hickey's talk about [Clojure](http://clojure.org/)(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. [**Cliff**: I'm familar with Clojure] He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC's being so damned good. I do not know to what extent he has done actual metering, though. 

**Cliff Click wrote:** Yes, annoying. And it's embedded into the JVM bytecodes that way, so you can't even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT'ing & smart memory allocation: either inline & optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object. 

**David A Moon wrote:** Can you say anything about what page-mapping tricks are good for? 

Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don't even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead?

**Cliff Click wrote:** Ok, a bunch of things here:

* Page sizes are 1Meg.
* No swap; swapping is death for GC. If you are out of memory, then you are OutOfMemory. To put it another way: you are going to blow your performance guarantees so badly, the JVM might as well have crashed. Better to make the user allocate enough resources up front to keep the performance guarantees.
* The page mapping is integral to the GC; we free physical memory earlier in the GC cycle than virtual memory. This lets us recycle physical memory much sooner, and makes it much harder for an application to “surprise” the concurrent GC – i.e. an app whose allocation rate suddenly accelerates might run a different concurrent GC out of physical memory – and force an application stall.
* The mappings are also used to deny user-mode threads access to a page where the objects are being concurrently relocated, but still allow GC-mode access to that page. GC-mode is a subset of user-mode (so we sorta are like a 3-privilege mode machine), except that user-mode can promote itself to GC-mode anytime it wants.
* The read-barrier will take a fast-trap on failure, and by default get promoted to GC-mode in the trap handler. This lets the faulting CPU fixup the object reference and continue, without needing to wait for GC to catch up.
* We also have a write barrier, but it's merely a slight efficiency hack. The read-barrier enables the good GC.

**David A Moon wrote:** I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation. 

**Cliff Click wrote:** Ahh…. here's the difference: a 'card-mark write barrier' can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op 'filters' card-marks that don't need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap – unless there's a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot. 

**David A Moon wrote:** Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn't slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC's memory scanning interferes with the application's accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven't measured it and you probably have. 

**Cliff Click wrote:** We're definitely doing a generational collector, so we'd be using a software write-barrier if we didn't have the hardware one handy. Or maybe: “We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software.” 

**David A Moon wrote:** I don't know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don't remember if we kept another index or just had the GC scan the page table. Since it's an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn't been way too small. 

Did you ever look at Eliot Moss's “train” garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET.

**Cliff Click wrote:** HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant. 

**Cliff Click wrote:** Things we do specifically for Java (continued):

* Simple 3-adr RISC is easy to JIT for; it's HotSpot's JIT and makes quite good code – easily comparable to “gcc -O2”.[**David**: Yeah.]
* Just-In-Time zero'ing of L1 cache for allocation – and we don't issue a memory read for the cache-line getting zap'd to zeros. This is worth about 1/3 less bandwidth than the usual “store a bunch of zeros in a row” implementations.

**David A Moon wrote:** [dcbz](http://www.ibm.com/developerworks/power/library/pa-memory/index.html)instruction on POWER. Clearly a good idea. 

**Cliff Click wrote:** dcbz doesn't make the grade:

* it's not a prefetch so you can't issue it far enough in advance without causing the stall you are trying to cover, and
* any java lock that happens will invoke a fence, which will also stall
* it **does** avoid the memory read bandwidth.

Our version gets all of the above right. 

**David A Moon wrote:** Dedicated register for the allocation pointer?

**Cliff Click wrote:** No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it's too small picken's to be worth holding a register for. 

**Cliff Click wrote:** Things we do specifically for Java (continued):

* GC bits in each pointer (old-gen, young-gen, stack-allocated, normal-C-ptr) [**David**: I remember that now. Good.]
* Java type-heirarchy bits in each pointer. This saves a trip to memory to fetch the Java type when making virtual calls.

**David A Moon wrote:** So you're saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits? 

**Cliff Click wrote:** Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes. 

**Daniel Weinreb wrote:** I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don't know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines (“stack groups”). 

**Cliff Click wrote:** Nope – we're using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split. 

**Daniel Weinreb wrote:** Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a [“JVM summit” (something like that)](http://openjdk.java.net/projects/mlvm/jvmlangsummit/)of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible. 

**Cliff Click wrote:** I'm well aware of the interest in tail-calling optimizations – but it has to compete for scarce engineering resources like everything else. 

**Cliff Click wrote:** Things we do specifically for Java (continued):

* Really big interconnect & memory bandwidth. Java has lousy locality, so things miss in cache – a lot.

**David A Moon wrote:** Would that be true even with 4 times larger caches? 

**Cliff Click wrote:** I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache. 

**Cliff Click wrote:** Things we do specifically for Java (continued):

* Support for 2 outstanding memory misses per cpu, up to 24 per L2 cache counting prefetches – so 2304 outstanding memory ops across the whole box. [**David:** “Non-blocking caches are good.”]
* Minor tweaks to help do Java-specific math: array addressing is sign-extend, THEN shift-and-add – which we do in 1 alu op. Also we have a Java range-check op, faster virtual-call support, the IEEE754 subset needed for Java FP, etc. [**David**: Good.]

**David A Moon wrote:** Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven't measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed. 

**Cliff Click wrote:** Yes… specialized hardware, not specialized cache. 

We do the standard HotSpot [“inline cache”](http://research.sun.com/self/papers/pics.html)trick that's done on all CPUs. 

The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain “CALL” instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can't be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets.

So there IS a “cache”, but it's not specialized. It's the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks & 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks – except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss.

**Cliff Click wrote:** Things we do specifically for Java (continued):

* Simple hardware transactional memory, used to speculate through Java locks.

**David A Moon wrote:** I'd be interested in reading more about that. 

**Cliff Click wrote:** We have a white-paper here.[http://www.azulsystems.com/products/whitepaper/wp_otc.pdf](http://www.azulsystems.com/products/whitepaper/wp_otc.pdf)

**Cliff Click wrote:** Ok, I'm done boasting. 🙂 

**David A Moon wrote:** It's plenty worth boasting about. 

–Dave Moon

**Cliff Click wrote:** Thanks, Cliff 

===========================================
11/20/2008 – A little more:

**David A Moon wrote:** Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn't try to put related threads onto cores in the same chip.

**Cliff Click wrote:** The thread scheduler is well aware of the hardware layout. It's no mean feat to schedule 800+ cpus with 1000's of runnable threads. 

Instead, nobody has a clue what counts as “related threads”. Certainly there's no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2\. It's a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work.

**David A Moon wrote:** But you said your memory access time is mediocre. That's the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren't big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today.

**Cliff Click wrote:** Split out the notions of 'uniform' from 'fast'. Everybody pays attention to caches (or should): the mindset is “memory is slow“. But only some people pay attention to memory layout: the mindset is “memory is uniform“. 

For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are “near” the CPU. In such a case it makes sense to build a NUMA machine: “near” accesses are faster than “far” accesses.

Azul builds are gear as 'UMA' because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow.