Un-Bear-able

More news from the Internet connected-tube-thingy:

Cool –

From: cmck….

I’m going to reference the blog on the landing page tomorrow. I know the readership will be more than pleased that we successfully poked the bear and got some insight from Cliff.

Also –

TheServerSide.com managed to raise the ire of Azul Systems’ Cliff Click, Jr., …

I’m not just a bear, I’m an irate bear!

Just so’s you know – it takes a lot more than a casual question about “where’s my Java Optimized Hardware” to make me irate.  In fact, that’s a very good question – because the answer is not obvious (ok, it was obvious to me 15 years ago, but I was already both a serious compiler geek and an embedded systems guy then).  But it’s not-obvious enough that quit a few million \$ have been spent trying to make a go of it.  Let me see if I can make the answer a little more bear-able:

Let’s compare a directly-executes-bytecodes CPU vs a classic RISC chip with a JIT.

The hardware guys like stuff simple – after all they deal with really hard problems like real physics (which is really analog except where it’s quantum) and electron-migration and power-vs-heat curves and etc… so the simpler the better.  Their plates are full already.  And if it’s simple, they can make it low power or fast (or gradually both) by adding complexity and ingenuity over time (at the hardware level).  If you compare the spec for a JVM, including all the bytecode behaviors, threading behaviors, GC, etc vs the spec for a classic RISC – you’ll see that the RISC is hugely simpler.  The bytecode spec is complex; hundreds of pages long.  So complex that we know that the hardware guys are going to have to bail out in lots of corner cases (what happens on a ‘new’ when the heap is exhausted?  does the hardware do a GC?).  The RISC chip spec has been made simple in a way which is known to allow it to be implemented fast (although that requires complexity), and we know we can JIT good code for it fairly easily.

When you compare the speed & power of a CPU executing bytecodes, you’ll see lots of hardware complexity around the basic execution issues (I’m skipping on lots of obvious examples, but here’s one: the stack layout sucks for wide-issue because of direct stack dependencies).  When you try to get the same job done using classic JIT’d RISC instructions the CPU is so much simpler – that it can be made better in lots of ways (faster, deep pipes, wide issue, lower power, etc).  Of course, you have to JIT first – but that’s obviously do-able with a compiler that itself runs on a RISC.

Now which is better (for the same silicon budget): JIT’ing+classic-RISC-executing or just plain execute-the-bytecodes?  Well… it all depends on the numbers.  For really short & small things, the JIT’ing loses so much that you’re better off just doing the bytecodes in hardware (but you can probably change source languages to something even more suited to lower power or smaller form).  But for anything cell-phone sized and up, JIT’ing bytecodes is both a power and speed win.  Yes you pay in time & power to JIT – but the resulting code runs so much faster that you get the job done sooner and can throttle the CPU down sooner – burning less overall power AND time.

Hence the best Java-optimized hardware is something that makes an easy JIT target.  After that Big Decision is made, you can further tweak the hardware to be closer to the language spec (which is what Azul did) or your intended target audience (large heap large thread Java apps, hence lots of 64-bit cores).  We also targeted another Java feature – GC – with read & write barrier hardware.  But we always start with an easy JIT target…

Cliff