JavaOne Slides

Slides for my 2009 JavaOne talks:

Alternative Languages on the JVM

This Is Not Your Father’s Von Neumann Machine; How Modern Architecture Impacts Your Java Apps

The Art of (Java) Benchmarking

I’d like to say more on JavaOne, but I’m (re)discovering that if you’re a speaker at JavaOne it’s really hard to get into the other talks & technology being presented.  I had my head full with my 3 talks, making sure they looked good, were relevant, etc.  To be sure, there were plenty of talks I wanted to attend, but I never seemed to find the time.   🙁

And of course shortly after JavaOne, I went to 2009 ECOOP in Italy for a week, and now I’m on a long deserved vacation.

More blogging soon, on ECOOP at least,

Cliff


Comments

Hi Cliff

I saw the slides for your benchmarking talk a few weeks back (I think from the JavaOne presentations site)–one thing that I missed was what you, as longtime JVM engineer, use to benchmark, e.g. what benchmarks you trust and which lead you to believe you have improved the code, are running as fast as/faster than another language, etc. Mostly you covered problems with existing, public benchmarks, but I’m sure you must have some way to verify your own expectations. Maybe you could blog about this at some point.

Thanks for posting the presentations.

Regards
Patrick

Posted by: Patrick Jul 21, 2009 9:36:56 AM

STM is not 15 years old — the first STM systems appeared in 2005 — off by a factor of 3 :-).

Posted by: Maurice Herlihy Jul 21, 2009 6:03:30 PM

What about this one from 1995?

Nir Shavit and Dan Touitou. Software Transactional Memory. Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, pp.204–213. August 1995.

Cliff

Posted by: Cliff Click Jul 21, 2009 6:33:30 PM

Ah, but the 1995 Shavit & Touitou STM was a design only; there was no implementation. It was a PODC (theory) paper, intended to show it was possible to implement TM entirely in software (which seems obvious now but wasn’t then). The actual algorithm required predeclaring write sets, and had other issues that did not make it a promising candidate for implementation. AFAIK, the first actual STM implementation was DSTM in 2005.

Even HTM received little or no attention until after 2000. I recently prepared a graph of citation counts for a talk, and turns out 2000 was the first year the 1993 TM paper broke through the 10 citation per year barrier. (It didn’t break 100 per year until 2006.)

So TM, both hardware and software, is much less mature than you would think from the standard citations. If we had written the 1993 paper in 2001, not much would have changed.

Anyway, I am enjoying your slides and reports: carry on!

Maurice

Posted by: Maurice Herlihy Jul 21, 2009 9:34:54 PM

Good read. Could you elaborate on the following quotes?

  • Fixnums: “Setting final fields cost a memory fence”
  • Lessons: “Language/bytecode mismatch – can’t assume, e.g. Uber GC or Uber Escape Analysis, or subtle final-field semantics”

Thanks.

Posted by: Nils Jul 23, 2009 6:36:02 AM

  • Fixnums: “Setting final fields cost a memory fence” –
  • Fun corner case of the JMM. During construction, final fields are not really final, as they move from zero to some other value. If you publish a half-constructed object globally other threads are allowed to see either the zero or the final value (actually, I think all bets are off). Immediately after construction, remote threads are only allowed to see the final value. This requires st/st ordering on the writer side and ld/ld ordering on the reader side. On the uber-strong X86 memory model, the st/st ordering can be a nop (I think – but Caveat Emptor, I’ve not looked at the X86 JMM issues in a while, might need an SFENCE). And on the reader side, the loads are pointer-dependent and perhaps only on the Alpha would you need a fence there.
  • Lessons: “Language/bytecode mismatch – can’t assume, e.g. Uber GC or Uber Escape Analysis, or subtle final-field semantics”
  • Mostly pointing out that the endless use of Fixnums instead of primitive e.g. “ints” has a serious cost that only goes away with some kind of Escape Analysis. This becomes a language design issue: what’s the value of the illusion of infinite-precision integers if all programs slow down by 2x (or more!). Since math beyond 64 bits is rarely needed, does it make more sense to require programmers to specify when they need more than 64 bits (and speed up all programs by 2x)?

It’s a chicken-and-egg type problem. No important programs churn Integers, so nobody tries to optimize them. Do you dive in an try to make an ‘egg’ and hope a JVM ‘chicken’ will appear that Escape-Analysis’s away the 90% cost?

Cliff

Posted by: Cliff Click Jul 23, 2009 7:21:15 AM

Just a few cents.. I am less aware of specific pitfalls that relate to Java, but tend to be more concerned with the more generic ones. Maybe some of this may help,,,

  • First, how many tasks are you running and is that number greater than or close to the number of cores in the system,,
  • how many threads execute periodicly and do they re-run on the same core,
  • are you dirtying a large number of pages at once (bunched up) that delays near term allocations due to needed cleaning of the pages?,,
  • are your sub-page allocations grow the slab (or other) to the point that you are stealing from the back-end?,
    if you have a large number of threads doing sys calls, are they hitting a reader/writer lock and not grouping the readers together to get rid of the default FIFO nature of this lock,
    if you are doing a large number of I/O ops, are they async in nature?,
    are their bottleneck threads that the main threads are waiting on?,
    does your periodic threads change execution times if the same workload is presented to them? why?,
    If you find that one set of structs need periodic freeing, what happens if you don’t free, would your thread be faster?, then maybe a local memory pool would shave time off,
    What is taking the longest..

There was once a saying that said if a process was EQUALLY split into 2 threads and one thread went Infinitely faster, you would only halve your execution time..

Posted by: Mitchell Erblich Aug 4, 2009 2:28:43 AM

‘m not sure which slides you are referring too! 🙂

  • First, how many tasks are you running and is that number greater than or close to the number of cores in the system,
    – We’re finding that other bottlenecks than CPU tend to dominate on most jobs; for Azul this would require running >800 threads.
  • how many threads execute periodicly and do they re-run on the same core,
    – This kind of optimization is very sensitive to the cache size and thread/job size and re-run duty cycle. For smaller caches & larger jobs, there’s nothing to be gained here. For well understood and tightly controlled jobs there’s a fair bit to be gained.
  • are you dirtying a large number of pages at once (bunched up) that delays near term allocations due to needed cleaning of the pages?,
  • are your sub-page allocations grow the slab (or other) to the point that you are stealing from the back-end?,
  • If you find that one set of structs need periodic freeing, what happens if you don’t free, would your thread be faster?, then maybe a local memory pool would shave time off,
    – GC removes all these concerns.
  • if you have a large number of threads doing sys calls, are they hitting a reader/writer lock and not grouping the readers together to get rid of the default FIFO nature of this lock,
    – No: Java’s default r/w lock bunches readers just fine.
  • if you are doing a large number of I/O ops, are they async in nature?,
    – You can get nearly the same effect as async i/o by running lots of I/O ops on lots of different threads. This is a common programming idiom in Java.
  • are their bottleneck threads that the main threads are waiting on?,
  • does your periodic threads change execution times if the same workload is presented to them? why?,
  • What is taking the longest…
  • There was once a saying that said if a process was EQUALLY split into 2 threads and one thread went Infinitely faster, you would only halve your execution time..
    – I think you are referring to Amdahl’s Law.

Cliff

Posted by: Cliff Click Aug 4, 2009 8:48:09 AM

Cliff-

I just watched an online video of the “This Is Not Your Father’s Von Neumann Machine” talk you gave at the Sun 2009 JVM Language Summit (http://www.infoq.com/presentations/click-crash-course-modern-hardware ). Excellent talk. Thanks for sharing it. The slides appear to be the same as the JavaOne talk of the same name linked above, which is why I’m commenting here, even though I didn’t see the JavaOne talk.

I have a question about the “real chips reorder stuff” example you gave. The slides don’t list which architecture(s) the example applies to, but in the video I think you referred to x86. I’m trying to reconcile this with my understanding of the x86 memory model, which I thought would prevent the kind of reordering you describe, at least under normal circumstances (i.e. using write-back memory and no fancy instructions).

I went back to the “Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A” ( http://www.intel.com/Assets/PDF/manual/253668.pdf ) and found section 8.2.3.2, titled “Neither Loads Nor Stores Are Reordered with Like Operations” (page 323). This example, at least as I read it, is a simplified version of the one presented in your talk, but it reaches the opposite conclusion: that you are guaranteed to never get into a state where the second write done by CPU #0 is seen by CPU #1 but not the first.

Am I misreading the Intel document, or does your example only apply to other chips with weaker consistency guarantees, like the ia64?

Thanks!

Posted by: Dave Clausen Jan 15, 2010 4:34:06 PM

X86 has a very strong hardware memory model, but it still allows younger loads to bypass older stores. You get other racey behaviors than what I demo’d; I stuck with an easy-to-understand example. That particular re-ordering in the slides is possible on IA64, Alpha, & Azul gear, plus a bunch of higher-end DSP-like things.

Keeping the strong X86 memory model costs Intel something; if they dropped the strong ordering they totally could make a faster X86 – but people would then have to deal with the O-O-O’ness of it.

Cliff