in Personal


(yet another trip report to yet another conference)

‘Tis the Season to be Conferencing…

Blah, I’d rather be hacking, but here goes…

I went to EclipseCon, to CGO and back to EclipseCon.  Here’s how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit.  Like a dummy I accepted it instantly instead of checking my calander.  Naturally it conflicts with CGO – and I was on the CGO PC this year, so I felt obligated to attend CGO as well.  Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) – so I scheduled a 1-day only trip to Seattle.  Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths.  The flights on Alaska Air were smooth and easy.  No WiFi at CGO – the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?).  (Remember my grief over Hotel WiFi in the last blog?)  So over lunch break (which was Chinese food and I don’t like Asian food of any kind) I headed over to Starbucks.  Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT – provided you register $5 on a Starbucks card with them.  Yes!!!  Suddenly I have a $5/month nation-wide WiFi plan!  I snacked on some salad’y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came ‘round again.

1st up: EclispeCon, mostly cause I got nothing to say.  These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer).  Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window.  Both me and the attendees would have been better off if I could have spent more time on the details.  C’est la vie.  Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend.  Also the staff was exceptional friendly and easy to work with.

Next up: CGO.  Despite my 1 day fly-in-fly-out I got a lot more out of CGO.

Keynote: Vikram Adve

Missed!  Arrived barely in time for the last 10 minutes.  That’s the penalty for attempting a 1-day fly-by.  I hear he said something good about Azul systems (he did ping me the day before for some details).  Basically it’s a Call To Arms for improving C compilers and C runtimes.  He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!).

Revisiting Out-Of-SSA Translation

**Cliff Notes:** HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... **but there's a key new piece**.

Linear-time colaescing congruence classes; speed & memory footprints.
Why is coming out of SSA hard? (**Cliff Notes:** It's not!)

And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies.

Nice diagrams for IFGs + copy-affinities.

Notices the value-equivalence property for defining IFGs (also done by HS).

**Ok: here's what's new compared to HotSpot -server:**

Fast interference test w/out building an IFG!
Only need to test against closest dominating var?
So scan the dom tree, with a stack-based DFS traverse of dom tree.
Uses 2 sorted lists (?); more complexity for value-based IFG tests.
But no building (and rebuilding?) the quadratic IFG.
Claim 2x faster than Sreedhar for coming out of SSA.
Big memory footprint reduction, because no need for live-ness sets & IFG!!!

40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness & IFG sets

**Wave Propagation and Deep Propagation for Pointer Analysis**
Wave Prop is very fast and uses huge memory (working on parallel version)
Deep prop is faster for small & precise memory sets – works better on desktop (and JIT?)There's a zoo of ptr-analysis names:
– context-sensitive or not
– flow-sensitive or not
– field-sensitive or not
– inclusion-based (Andersen-based) or unification-based (Steesgaard-based)

**Cliff Notes**: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-based

Build a graph of constraints, and do contraint-propagation.
Code: “a = &b;” Constraint: “a superset {b}” – means b element-of P(a)
Code “a = b” constaint: “a superset b”- means P(b) contained in P(a)
Code “a = *b” or “*a = b” – complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes.

Now build a graph: nodes are P(a) – sets of things pted-at by 'a', and edges represent subsets.

Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot – Vick's algorithm).  After collapsing cycles, you get a DAG – and Nuutila also produces a topo-sort as output.

(Still looks like a complex diff/delta propagation algorithm – but makes me wonder if standard bit-vector stuff is faster).

Not a bad-looking ptr-analysis. Might work in a JIT.
Manesh suggests using BDD's.

**Cliff Notes: **Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS's style of alias-analysis gets you 95% of all the possible gain. 


**Fast Track: A Software System for Speculative Program Optimization**Chen Ding: another process-level fork/join speculative program optimization (contrast this to [Emery Burger's “Grace”]( system). Gives sequential program semantics. Add annotations to “suggest” fork/join
points. Actually fork a process, use COW & [mmap]( to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc).

_New Idea_: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: “return do_whichever_is_faster( methodX(), fast_methodX() )”

_New Idea_: “fast_methodX” might be speculative and ***wrong***. So run slow-version tocompletion and do error checking after the fact. But let the fast version run ahead (and if it's faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what's been changed, and verify thecorrect semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling.

Example using gcc's “[mudflap](”. Array under/overflow, leaks, un-init'd memory, etc. Run 2 versions of all code: the existing code is “fast track” and the mudflap version is the “slow correct track”. Goal: same speed as the “fast track” code with all the safety checks of the “slow track using mudflap”. In some cases can get real nice speedups.

**Cliff Notes: **can we turn on mudflag for Azul?

**Scenario Based Optimization**Profile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed.  Some JVM JITs do this.  HotSpot only sorta does.

The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness – what if one of the versions is busted? Or tickles a timing bug which breaks things?)

Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic.

**Software Pipelined Execution of Stream Programs on GPUs**

Basically [StreamIt]( on a GPU

StreamIt is much easier to use than e.g. CUDA or OpenCL.

StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y “eats” 2 ouputs from part “X” before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution.

– Profile each filter standalone, to figure out it's work load.
– Configure selection
– Add constraints 
– Use CPLEX to solve
– Output CUDA code.

Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access.

Impressive compile times (30sec to 178sec).
Can get pretty good utilization on some kernels, so seeing 100x speedups.

Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model – then StreamIt is the fastest language for you!  If what you got doesn't look like StreamIt, then Too Bad.


**Stream Compilation for Real-Time Embedded Systems**

StreamIt Again, but for Cell.
New constraint: out of memory per cell (distributed memory).
Different from GPU – GPU has shared memory, but very slow if doing bank-conflicting access – but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell – must have a 
uniform split of memory.


PS – Thanks Doug for inspiring this blog


I suspect you mean C’est la vie …

Posted by: Alex Blewitt  Apr 5, 2009 1:23:10 PM

Fixed, thanks. My French isn’t very good…. 🙂

Posted by: Cliff Click Apr 5, 2009 1:25:42 PM

Cliff, thanks for speaking at EclipseCon. I admit that while I was able to follow along for a significant part of your talk, by the end I got lost. I agree that it was probably not a lot of time. Either way, I appreciated that you exposed us to a facet of the VM that we don’t often talk about at EclipseCon, so I hope you don’t think it was a waste.

Posted by: Robert Konigsberg Apr 5, 2009 6:06:46 PM

Somewhere on the JVM Lang Summit web pages is a full video of my talk there – where I go slower & have more time for meaningful Q&A. Part of the problem was me feeling rushed – so I ran ahead, instead of going back over slides where I could feel the audience slipping – and maybe just dropping a few slides off the back end.


Posted by: Cliff Click Apr 5, 2009 6:42:42 PM

Thank Cliff for the many conference write-ups you have published. I always look forward to hearing what is going on in the research community summarized so concisely.


Posted by: William Louth  Apr 6, 2009 12:32:11 PM

authored by you and your colleagues from Sun cites a paper by the same authors called “Interference Graph Trimming” that is noted as having been submitted to the 2001 PLDI. That paper does not appear in the PLDI record, is it available anywhere?

Posted by: Carl Apr 18, 2009 11:13:04 PM

It was never accepted to PLDI – definitely a bummer. The paper is all about how to write a fast graph-coloring allocator, and the reviewers all wanted to see a comparison to a linear-scan allocator – but there isn’t enough space in the 10pg PLDI format to do both. I’ll see if I can post a link here. I’ll send you a copy in any case.


Posted by: Cliff Click Apr 19, 2009 5:04:43 AM

Appreciate if you can share the paper “Interference Graph Trimming” to me too. Thanks in advance.


Posted by: Leela Mohan Venati Oct 25, 2009 10:50:50 AM