in Personal

Part 2: Lisbon to San Francisco, via Toronto – plus ISMM and PLDI

2010 ISMM & PLDI in Toronto, CA

Skip to the techy parts…


As mentioned in my last blog, I am flying into Toronto from Lisbon, Portugal.  There is no direct flight, so I’m taking US Airways and connecting through Philadelphia.  The flight over is fairly easy and in daylight the whole way so no chance I get any sleep – no problem, the quickest way for me to kick jet-lag is to stay up late anyways (not that I get over jet-lag very rapidly in any case).

We arrive on time, but getting through customs is much slower than I expected.  I mean, every step moved along fairly well – but there where a lot of steps.  I probably showed my passport to more than 10 different people as I moved through the various queues.

Long story short, I made my gate with maybe 5 minutes to spare – before boarding should begin.  So really with plenty of time… except as I arrive at the gate they are announcing that my flight from Philly to Toronto has been canceled.  Ugh… deja-vu – I head over to the service desk, again.  I wait in line for a new flight again.  This time it’s fairly quick, 1/2hr at most and my turn arises.  The 5:30 flight is already booked, so I’m on the 8:30 flight.  Much better than last time – now I’m only being delayed 5 hrs instead of 24.

I settle down to do some email… and discover that my laptop charger is fried.  Apparently taking it to Europe was a 1-way trip.  It worked fine with 120V before I left, it worked fine with 240V overseas, but it’s not working now!  Of course the battery is dead, and I’d like to check email, finish blogging, skype people at work, etc, etc… but no-go.  So now I really got some time to spend.  I walk the airport to confirm they do not have a ‘tronics shop that carries a laptop charger (phone chargers are in every little tiny newspaper store but nothing for a dead laptop).  I end up catching up on some reading I’ve been meaning to do.*

*footnote: I end up reading “Conversations with God (an uncommon dialog)”.  I agree with most of what the author says, but certainly not all.

Around 8pm, I notice the airport’s gate screen reporting that my 8:30 flight is delayed to 8:50pm.  Then it’s canceled.  Then it’s on at 11:00pm.  Then 9:30pm.  Then 9:50, then 10:10 and finally 10:15.  I am not making this up!  Over the course of a couple of hours US Airways changed the time on my flight a half-dozen times!

But finally the plane arrives, and we get boarded.  The flight to Toronto is easy (if 8 hours delayed).  I finally get to my hotel around midnight.  The room is plain but servicable and I can finally hit the sack.  The next day I shop around and finally pay $130 for a laptop charger – for a laptop who’s value cannot exceed $50 anymore – but I’m stuck!  I need the laptop for the conference!  Indeed this blog is being written right now on the laptop with my shiney new charger.  I found two food courts closed on a Friday evening at 6pm in the downtown area; rather shocking to my American sensibilities.  Fortunately the street vendors make a really nice hotdog.  Upon returning from the shopping trip, my hotel door failed (battery in the card reader died) so the hotel upgraded me to a nicer room higher up.  There was a large & loud party outside my old room when I went back to it (to gather my shirts hanging forgotten in the closet) – I had to ask the dozen ladies in high heels to make room in the hallway.  Later, the same party made it up to my new floor and around 1am I had to ask them to lower the volume.

[postlude: Saturday night: the party continues.  The the first party-er arrived at 1am and banged on their door across the hall.  Of course nobody was in the room yet, so they called their friends & held a loud phone conversation about getting a room key.  10 minutes later the the main party arrived and ran on until 2 or 3am, with about 20 people in 2 rooms and mostly spilled out into the hallway.  Sunday night they went home and I finally got some sleep.  Monday I’m dragging all day; I crash at 9:30pm and Tuesday I’m all chipper again.]

Somewhere along the way I figure out how to test C2 inlining heuristics in a production setting: I can write microbenchmarks with well controlled call access patterns.  If properly inlined, C2 will be able to discover a key constant which will make some outer loop infinitely fast; if not then not.  So a slow loop means: not the expected inlining which I can detect with basic timing.  Needs more work before I can put it into a production testing situation, but I got a prototype working.

Tuesday my flight leaves at 5pm, and I’m getting conflicting info (between the front desk and the concierge) on when I should leave for an international flight from Toronto.  We settle on leaving the hotel at 2pm; time enough for me to catch the morning sessions and Cormac’s talk at 1:30.  The flight home is easy and no mistakes – that it’s Air Canada instead of US Airways might have something to do with it.  There’s also a power outlet near my seat, so I’m busy blogging again.

Ok, On to the Good Stuff!

As usual for my conference reports I brutally summarize a year of somebody else’s hard work in a short paragraph; I skip some sessions and talks; I blend ISMM and PLDI sessions (and I gate-crash some LFX sessions).  Caveat Emptor.

My winners:

Evaluating the accuracy of Java Profilers– all your profilers are belong to us.  I’ve fought this fight for years and years.  I “got it right” at Azul.  I hope to bring better (heck, plain not-busted) profilers to the masses someday.
Schism: (Fil Pizlo) Real Time GC– Best real-time GC ever… as of 2010…
Memory, an Elusive Abstraction– hysterical talk moving us from before the “Era of Vagueness” till past the “Era of Incorrect Specifications” (as in: no, the hardware doesn’t really do that)

Adversial Memory for Detecting Destructive Races – really?  I can see that value?

GC

Heap Shape & Scalability– Yes!  I can in fact parallel-mark most heaps with 100 cpus.   😉
Concurrent, Parallel Real Time GC – not a bad RT/GC just not as good as Schism
Optimizations in a Private Nursery-based GC– All your cache-locality belong to us.
Economics of Garbage Collection– Fascinating attempt to cross-discipline concepts
The Locality of Concurrent Write-Barriers– Yeah!!!  Tony Hosking renames all the write barriers to sane names!!!   Boo!!!!  He fails to tell anybody that’s what he’s doing.   🙂
Detecting Inefficiently-Used Containers to Avoid Bloat– how to find HashMaps with 1 element
Detecting Low-Utility Data Structures– how to find HashMaps with 1 ‘get’ call

Safety

CETs: Compiler Enforced Temporal Safety for C– capability pointers
Parallel Checking of Expressive Heap Assertions– run this assert as you walk the heap
Chess–  model checking for the masses.  Alas the masses need some education about model checking…
Breadcrumbs: Efficient Context Construction for Dynamic Bug Detection– see it before, latest invocation still looks good but I_still_say don’t decode, just cache
Detecting JNI bugs– it’s too easy, should be on by default
Verve– proved OS; very lite-weight way to right totally type-checkable X86

Experience

JavaScript: The Bad Parts– scary JS; but well experienced by me.  Alas I see Giant programs slowly running with JS.
Keynote: Susan Eggers, “Saving the Bacon”– classic old-school compiler tech applied to FPGA’s

misc
Speculative Parallelization use State Separation and Multiple Value Prediction


CETs: Compiler Enforced Temporal Safety for C

Fix buffer overflows & dangling pointers.

Buffer overruns have been done a bunch before.

Add meta-data to the side, no change to the object data layout so works with legacy code.  Check each ptr-deref.  After some (aggressive) static compilation, get a 50% slowdown.

Compare to valgrind/purify.  They use external bit-map metadata to track the free/not-free status of every byte.  But if memory is freed and then re-alloc’d, a dangling pointer might operate on the wrong kind of memory.

Instead use an identifier-based approach (a capability).  Every malloc produces a new ‘key’; must have proper key to access memory.  So ptrs have a ‘key’ with them.  Must copy the key/lock pair with each ptr-copy.  This requires source-code modifications to help with the new ‘fat’ ptrs.

CETS: keep the extra key/lock data off to the side in a seperate meta-data space.  Memory based ptrs use their address to key into the meta-data.  Register-based ptrs copy the key/lock into registers (i.e., the compiler ‘inflates’ the ptr into a fat-ptr when loading a ptr from memory).

So memory layout is unchanged, and handles re-allocs.

Extend for stack objects in the obvious way.

Rest of talk devoted to the (fairly obvious) careful engineering to make it run fast.  Hooking into LLVM compilation system.  ‘Hook’ into the optimizer before register allocation, so spills do not get annotated.  Also no need to check most stack refs (compiler temps & locals).  Also do redundant-check elimination.  Remove perhaps 70% of static checks.

Runtime overhead is about 50%.  Overhead is mostly due to shadow-memory accesses & register pressure.

Cliff Notes: Might work on HotSpot.  That’s a very strong statement despite the very strong disclaimer of “might”.  Almost no memory safety tools work on HotSpot.  We’ve tried a lot of them over the years.

Cliff Notes: Not well defined in multi-threaded apps; data-races are not properly handlable in any case?

Cliff Notes: He’s keeping key’s unique.  Could also just never free virtual memory hence have all ptrs unique (want to free physical memory tho).  Compared to EFence – might work – but for high-allocation rate (of small objects) need a better OS mechanism for alloc’ing and free’ing memory.  mmap is too slow.


Parallel Checking of Expressive Heap Assertions

Checking whole-heap assertions in parallel.

“Unrestricted use of aliasing is Evil” and

“Unrestricted use of aliasing in the presence of concurrency is the Ultimate Evil”

Presents a classic data-race style bugs.  My classic example is one thread is setting a ptr to null, and another is trying to load through it (after testing that it’s not-null).  His example is from Swing with one thread calling “dispose” and another thread trying to render the object (after checking that is has not been disposed yet).  He’s got a few more real-life examples, but they all have that flavor: two reads with a (rarely) intervening write.

So want to check invariant: “Object foo is only reachable by one thread (except for e.g. the DB manager)”.

Expressiveness: is object shared?  is object reachable?  by avoiding some paths?  by only 1 object (after ignoring some certain other objects)?  etc.

Use Java Modeling Language (JML) to describe various pre-/post- conditions.  This are checked every whole-heap GC cycle.  JML Compiler is fragile right now; sometimes exits silently or does not handle some queries.

This work: make JMLC robust for some common queries, modify the JVM/JML combo so the common queries are also efficient.  Also hard to use: must modify source code sometimes.  User must add probes “where it makes sense”.

How to piggy-back on GC?

Implemented in IBM’s QVM (IBM J9 production JVM).

Can do a portable version using JVMTI but slowly.

Performance of the added probes appeared to be binary: either very little slow-down, or huge slow-downs.

Found a number of broken idioms in a bunch of real apps.  Showed a dozen apps, 3 have no bugs the remaining average about 3 bugs.  Running the queries in parallel helped a few apps but mostly didn’t help.

Cliff Notes:

He’s assuming a snap-shot-at-the-beginning.  So the answers are “atomically correct” from the heap-shape at the start of GC, but the heap will have changed so you do not get the witness.  Azul of course uses an eager-update GC.


Heap Shape & Scalability

Can we use many-core to trace large heaps?

Really: does load-balancing on 1000 cores work?

Cliff Notes: Azul sees 100 cores usefully marking 300+ Gig heaps.

How deep are object graphs in real programs?  Could be you make a singlely-linked list with 4Billion objects, but how deep in real life?  Limited to dacapo, specjvm98, specjbb.  SpecJBB is boring: always depth 24.  But other programs see depths from 1K to nearly 20K (pmd in Dacapo).  Other benchmarks have a max depth less than 128.

Deep & Narrow object graphs are Evil.  So it’s not just deep, but if it’s narrow you are capped at the number of parallel CPUs.  Want wide bushy graphs.

Wow… object-shape-depth graph.  At shallow depth for jython or xalan, graph is “wide” – 10000 at depth 1.  Somewhere between depth 10 and 50 suddenly graph goes from very wide to very skinny.  Some linked-list makes it out to 10000 but most objects are in the very wide/fat/shallow part.

Did idealized traversals on different heaps.  Assume perfect load-balancing, assuming perfect memory latency.  BFS traversal.  Count objects available to be scanned & also processor busy/free cpu slots.  Report utilization for different numbers of processors.

See that up to 8 processors utilization is typically very good.  For most benchmarks are still good for 100 cpus.  But for 5 benchmarks, have a very long tail and utilization drops way off.  Also, many GC cycles do not see the very bad graph shape.

Can we help the 5 bad benchmark/heap-shapes?  (1) – Add shortcuts.  (2) – Speculative roots.

(1) Add shortcut links.  Invisible to program, used by the tracer.  Compute the ITU both before & after adding links.  Devise a strategy for adding links: look for long-skinny chains.  Ratio of depth-vs-size(# nodes).  Need no more than 1/10th of 1% of new edges.

(2) Speculative.  As discussed at Azul before (patent?).  Pick some random starting points and mark.  Later see if this whole trace gets live; if not then work is wasted.  If found, then make whole trace live.  Some random dead stuff will accidently get marked live.

Amount of floating garbage turns out to be fairly large.  Hard to find good roots as tracing gets close to finishing.  As heap gets more marked, hard to find good un-marked good roots.  Very inefficient use of the extra CPUs.  Only a modest improvement to ITU score.  I get cite’d a few times for this idea in the slides, too bad it turns out to suck.


Concurrent, Parallel Real Time GC

Multi-core scheduler hacks

CPU-local GC data structures

Lock-free algorithms for GC

Non-moving collector.

Cool hack: require at GC times that all roots are in the heap & reachable from some common place.  So no stack scanning.  But more spill code around call sites: must squirrel away roots in a scannable place.  Imagine if every other stack was only allowed to hold oops.  Then stack scanning is as fast as scanning a large array.

Obvious hack for large object-ref arrays: array-lets.  Per-object scan time is now bounded.

Scheduler: need to do what Azul did for co-op scheduling for GC.  Need to halt all threads at  safepoints.  Actually doing totally co-op scheduling!!!  NOT pre-emptive!!!  And these points are GC points as well.  Points auto-inserted by the compiler.  Humm….I like this co-op pre-empt is also GC-safepoint.

Need another CAS/step in the marking of each object to help with the load-balancing algorithm.  Also impacted by heap-shape.  Pure linked-list heaps cannot be marked in parallel.

Cliff Notes:  For his real-time apps, depth is usually fairly small.

Then pretty standard parallel lock-free sweep phase.  Similar to NBHM copy ops: CAS to ‘own’ a chunk to copy.  Allow extra ‘owns’.  Do stuff like striped local counters for free-memory, CAS to the global counter when the local counters/memory/lists run out.


Optimizations in a Private Nursery-based GC

New functional scalable language.  Need a better GC.

Seeing score peaking of high-allocation-rate stuff at 4 to 8 cores; limiter is memory bus bandwidth for new allocation.

Using a thread-local private nursery.  Shockingly close to Azul’s SBA.  When nursery gets full do a thread-local GC.  Typically see 3% survival rate.  Move the objects into the public heap.  Otherwise keeping the same physical address for the private nursery, so that the PN stays hot in your cache.

Cliff Notes: Really it’s like Azul’s stack-based allocation (SBA), done again for bandwidth reasons.  Our caches are not big enough to make it pay off, and our default GC is really good, and we have lots of bandwidth so we never bothered to make our stuff production-ready.  Naturally there’s no good reference for this (I’ve presented a handful of times on this, so the slides are on the web somewhere).

How does he handle ‘escapes’?  Maybe classic store-barrier card-mark?   Yes – it’s a classic 3-gen system: classic stack-vars, the private-nursery (SBA area) and the shared public heap.

Write-barrier for escapes.  If the object has no-refs and is immutable, then we dup the object (FAIL!?!? for “==”).  Otherwise do a PN collection (Azul: thread-local stack-walk but wasn’t doing a GC).

PN size is important.  Too small, and endless GC cycles.  Too large, and the PN gets kicked out of the cache by the mutator.  Dynamically tweak size.  Can fake cache performance by estimating bytes-allocation-rate (size of PN, time since last PN GC).

A lot of time is spent stack-walking.  Expecting 10000x more stack-walks than normal GCs.  Can to trimming of previously-walked frames.  Using stubs to watermark stacks.

Paying a 20% throughput loss, max pause time <5ms for a 24-cpu raytracer.

Cliff Notes:  Why the throughput lossage?  Expensive write-barrier?  High rate of thread-local GCs?


Speculative Parallelization use State Separation and Multiple Value Prediction

  • Execute loops iterations in parallel
  • Ignore cross-loop dependencies
  • Software thread-level spec
  • Complexities: determine frequency of speculation vs fail; user hints; recovery; user-defined rules; user-defined roll-back code, etc.

Main thread runs in “non-speculative” space or D-space (very helpful name! Not!)  At loops, launch speculative threads; copy their “speculative space” or P-space.  Also add a control-space (C-space) for each thread.  When thread finishes an iteration, use the C-space to confirm if speculation is still good.

Cross-iter dependencies have to be rare, also failures have to be rare.

Does not quantify “rare”.

Later spec threads can rely on earlier spec threads – but such data is not available (yet).  So predict it.  Each iteration is predicting the results of the prior (speculative) iteration.  Attempt to “slice” prior iterations to feed the prediction logic.  Later iterations do more work (must execute the slice) but still less work than a full loop iteration.

Also willing to speculate lots of threads for each “reasonable” set of predictable inputs to the next iteration, based on a prior profiling run.

On an 8-core machine speedup varies from nothing to 1.5x.  i.e., not a whole lot of speedup for an 8-core machine.  It’s log-scale speedup (for the wrong direction of log-scale).  It’s not much speedup for lots of cores.


The Locality of Concurrent Write-Barriers

Write barriers cost locality.

Ugh – Tonk Hosking changed jargon names for GC write-barriers ON PURPOSE.  The names are better than the “Dykstra” barrier or the “Brookes” barrier but he needs to give us more warning!  I’ve no idea what the names mean until 1/2-way through the talk.

Folklore: “Deletion” barriers have bad cache behavior

Gives examples of 3 different write-barriers.

Two are Insertion barriers for incremental-update.

One is the Deletion barrier for snap-shot-at-beginning.

Shows the classic scan-once, then scan-again (after 1st marking) and maybe scan-again.  It’s the classic race with mutators preventing a GC cycle from ending: Azul’s GC wins again (deterministic 1-shot marking, no mutator race for us).

Try to measure the cost of various barriers.  Removed as many JVM-specific costs as they can; remove JIT cost, other GC cost, etc.  As the heap size grows (so GC happens less and less), the difference in write-barriers seems reduced.

Objects “noticed” by a mutator are going to be in the cache.  Objects that have to be touched by the write-barrier won’t have a locality problem if the mutator is going to “notice” them anyways.  Turns out the distance between needing the object for GC write-barriers, and needing it for the mutator – tends to be very small distance.  So the objects are not being evicted.  Hence they have good locality.  Furthermore the “deletion” style barrier had the best cache behavior.


Memory, an Elusive Abstraction

Keynote: Peter Sewall

BTW, a hysterical talk.  And since it’s a keynote there’s no paper.  You’re stuck with my writeup.

Golden Age of memory: 1945-1972.

Memory was an Array of Values

Finite function from address/index/X to Values/Y.

Multi-processors appeared, and Leslie Lamport added this: memory appears “as if” everything happened in some serialized order.  This was true in the 60’s.  But not true as of 1972, IBM 370/158MP.  But recently it’s mass-market: core-duo.

But modern hardware & compilers provide only “relaxed” or “weakly consistent” memory models.  Gives the most simple data-race possible.  Expect to read only 0/1 or 1/1 or 1/0 but not 0/0 given sequential semantics.  But fails on a core-two duo 630 out of 100000 runs.

Real processors & languages are complex & subtle.

Most research is done on idealized models or simplified hardware.

Industrial processor & specs: we’ve looked at X86, Power, Arm, Java and C++ specs. _They are *all flawed_ *.  Good thing he didn’t look at ours.   🙂

“The Era of Vagueness”.  Pre-History: before Aug 2007.  None of the memory manuals available from the hardware vendors could be made sense of.

“The Era of Causality”.  White papers with “principles” and 10 litmus tests.  “P1.  Loads shall not reorder with older loads”, etc.  And then reading between the lines you “invent” a hypothetical micro-architecture that meets all the principles & litmus tests.  But then, exploring this you can find more vagueness, it’s just a little further along.  And then they discover all kinds of bugs & ambiguities in the spec.  And it’s unsound: real hardware does things explicitly forbidden by the spec – although failures might be on the order of 1 per every million runs of the litmus test.

Nov 2008-now: closed holes in the spec by allowing more behavior.

Really: it’s smart people labouring under impossible situation.  Architects need to reveal enough for effective programming, but not reveal any IP, and not constrain future products.

So finally come up with the “x86-TSO” Abstract Machine.  Bits & parts that taken together give a clean description of what a real X86 allows – as a spec.  Wow: he gives a pretty clear description.

Zounds –no program is data-race-free at the spin-lock level, because the spin-lock itself is a big data-race.  SO all those DRF program theorems are utterly useless.  New results: a particular linux kernel spin-lock is OK against this X86-TSO model.  But no other locking paradigm has been proven.  All your Data-Race-Free-Memory-Models are Belong To Us.


Chess

  • Tom Ball

I skipped out of the ISMM Sunday morning session to visit the LFX session.  LFX is attempting to bring practioners “out of the woodwork” and getting people who have built large practical systems to talk about their experiences.

Chess – it’s a testing tool, a testing methodology, a platform for research and teaching.

Chess is a user-mode scheduler.  It “hijacks” scheduling from the OS.  It can take a different thread schedule each time, it can replay schedules.

Common usage: fork/join unit test.  You start 2 threads, launch them, join them, then assert for something.  Chess then allows all possible interesting interleavings to be checked.

Interesting Observation: For most bugs, there exists a short-running scenario with only a few threads that can find it.

Unit tests: better coverage of schedules, easier debugging, regression, etc.

Stateless state-space exploration.

No change to OS scheduler

Ability to enumerate all/only feasible OS schedules

Uses sync-points – schedule flips where threads communicate + race-detection

Serialize concurrent behavior

Has a suite of search/reduction strategies

Don’t capture program states.  Not needed for termination, and it’s very expensive because states can be huge.  At the cost of perhaps revisiting states.  Use some partial-order reduction to avoid revisiting too many similar states.

Example: insert schedule points before each lock/unlock.  Then the Chess scheduler might use random-sleeps there:

T1:                   T2:

lock                  lock

bal += x              t=bal

unlock                unlock

lock

bal=t-x;

unlock

Improvement 1: understand happens-before relations.  If T1 is stalled even a little, then T2 will hold the lock for awhile, and T1 cannot run in any schedule or for any random sleep.

Improvement 2: understand lock/unlock and run/wait queues

Chess converts partial schedule orders into total orders.  As-if you have a uniprocessor.  Chess makes scheduling determinstic, but NOT e.g. input non-determinism.  So network packet-arrival order is not captured.  Nor is variation of user action.

Fast-Mode: just this: introduce scheduling before sync, volatiles and interlocked operations.  Finds many bugs in practice.

Data-race mode: Find data-races.  Introduce schedule points before racing accesses.  Eventually you’ll ALL capture SC executions.  A bit more expensive than fast-mode.

Learning from Experience: “User forum”, Champions

Feedback: “CHESS doesn’t Scale” – humm, we managed to boot/shutdown the Singularity OS with Chess and found bugs, so what does “Scale” mean? What they usually mean: “It needs to run my tests lots and lots and each of my tests is slow” or “gosh the state-space is huge”.  So there was this kind of learning curve for most CHESS users, that time spent is “time-to-run-one-test times #_schedules”.

Feedback: “CHESS Isn’t Push Button”.  Push-button techniques are very valuable but Chess isn’t push-button.

Cliff Notes:  Check out ‘CUZZ’ – ASPLOS 2010 – push button run giant programs with weird interleavings.

Feedback: “CHESS Doesn’t Find this Bug”.  Added: “Warning: running CHESS without race detection turned on can miss bugs”.  And also: turn on race-detection on for a few exections.

Feedback: “CHESS Can’t Avoid Finding Bugs”.  After finding a bug, would like CHESS to quit finding the same bug and go on to looking for more bugs.  Add in controls for CHESS to avoid doing pre-emption in particular places.  So can do “preemption sealing” of method M and now M runs atomically inside of Chess.

Feedback: Confusing error messages.

Feedback: Chess isn’t real-time.  Chess treats “wait()” calls with timeouts as always doing the timeout.  “It’s a rainy day and …” your paging too death or Outlook is hanging or whatever… so the wait times out.  Now what?

Lots of great success quotes… but boring to read.

“Learning by F(l)ailing with PFX”.

High level tips:

  • Proper expectation setting
  • good methodology & good default behavior
  • good warnings & messages
  • cultivate champions
  • listen to them and learn!

Economics of Garbage Collection

Jeremy Singer & Richard Jones

allocation elasticity – apply to control heap growth

The “allocation curve” – with a large heap size you do few collections, with a smaller heap you do more collections.  There are 2 interesting points: heap so big you do ZERO collections, and the other end: the minimum required heap to let the program run at all.

“Law of Demand” – when something is more expensive, less is required.  Demand-Curve: curve showing that as something gets cheaper, people will buy more.

Price Elasticity of Demand – for a particular good, this is a measure of the responsiveness of the quantity to a change in price.  Independent of units: % change in quantity for a 1% change in price.

Tends to be negative.

Inelastic goods, e.g. life-saving drug or petrol.  People will pay any price.  Maximum Elastic goods: fixed price or nobody bothers.  Goods with lots of choice, i.e. TP.

Elasticity = (dQP) / (dPQ)

Price ==> heap size

Quantity ==> number of GCs

Excise tax or “duty” – similar to a sales tax; causes a verticle shift in the demand curve.  In GC, the object header increases the heap size – similar to increasing the “tax” on the heap.

Difference in curves, the verticle shift, varies by benchmark.  Varies because of the different mean object size.

Allocation Elasticity = (d(#GCs) / d(heapsize)) * (heapsize/#GCs)

Java heap growth is managed by various heuristics.

Then looks are Jikes heuristics for heap size.

At program start user specifies a “target elasticity value”.

At each GC, JVM computes ‘current_elasticity’.

If this current elasticity is smaller than the targetE then grow (shrink?) the heap until we match the elasticity.


JavaScript: The Bad Parts

JavaScript, very dynamic, widely used, very expressive

Cost: performance, assurance, maintainability

Optimizations & analysis depends on assumptions

How dynamism is actually used & needed?

Instrument webkit/JS interpreter.  Record traces.  Async filter the traces.  Save them offline & execute the traces abstractly offline.  Looked at top 100 sites (interesting large fraction are porno sites).  Got 8GB of JS from sides, distilled to 500MB database, down to 33 charts.

Results- – JavaScript usage varies widely.  Some sites use nearly 100% Date objects, some only ‘new’ some only arrays, some only mess with the DOM, etc.

Assumptions:

  • Program size is modest
  • Exec time is dominated by hot loops, so tracing & JIT matters
  • Call-site dynamism is low.  All use inline-caching.
  • Declared FCN signatures are meaningful
  • Properties are added at object init time.  Object ‘shape’ is defined all-at-once.
  • Properties are rarely deleted.  ‘Shape caching’ works.
  • Protoype hierarchy is invariant
  • eval is infrequent and harmless
  • industry benchmarks are representative

Reality

  • Program size is NOT modest: 81K, 1.6Megs, 2Megs, etc.
  • Exec time is dominated by hot loops: 90% mark uses 0.6% of code (so yes hot   loop), or 7% or 15% or more.  i.e., some sites have no hot loops
  • Call-site dynamism is low: megamorphic call sites are common; in some places   more than 1000 targets.  Also, the count of megamorphic sites is very large.  I think he missed the point though: the interesting number isn’t the number of targets for the most megamorphic site because anything beyond 1 has the same cost.  The interesting number is the %-tage of dynamic sites where an inline-cache “works”.
  • Object lifetimes: mostly add a beginning, OR add fields throughout the lifetime.  Constructors are mostly monomorphic.  But a few return return 100’s of unique types.  And some sites add fields to objects forever, or delete fields shortly after construction.
  • Eval: used a lot.  Runs arbitrary code.  Could be serialization, or JSON, or trivial.  Or making new objects or recursively calling eval or hot looping.
  • Industry benchmarks: NOT representative.

New plan: write better benchmarks


Breadcrumbs: Efficient Context Construction for Dynamic Bug Detection

Sam Guyer

Gives sample data-race bug.  Could detect w/vector clocks or other info.  How do we record something?  What do we record?

Cliff Notes: So it’s the olde compact/cheap calling context.  I’ve written on how to do this before: pass in to each call a hash of the current context.  Hash in the PC inside this fcn whenever you want a new context.  And then return back as an extra return value the current hash again.  See prior Probabilitistic Calling Conext work.  Very cheap to compute: 5%.  Claims no way to decode a PCC???  But it’s obvious… don’t decode: cache.  But I want record probably 100x fewer samples than Sam does.

How do you invert the hash back to the callsite?  Key: hash function f is invertable.  They use ((3hash+pc)&0xffffffff).  Problem: cannot search all the possible call paths going backwards.  So each step is invertable, but unclear what is the next step (going up the call stack).

So next idea: keep a per call hashtable of hash’s.  But must dynamically maintain this list of hash’s.  Issue: must insert into hash table on the fly, and this typically is in the million of hash-table operations.  But some call sites are very very hot and most are cold.

So recompile the hot fcns without this kind of hash table.  Performance hit starts at >2x.  If you throw out info above 100K executions then only 1.5x slowdown, at 10K executes only a 10% slowdown.  Now the search is complex hybrid.  Sometimes you have a hash-set and it’s easy.  Sometimes you’ve not recorded the info, and must do a global search.

Issue: the ‘fan-in’ to a method is key to deciding if you need a hash-table or not.  Single caller methods don’t need any help.  Same for single-digit of callers.  But popular methods with 1000 callers do need it.

So this technique lets you make any race detector context-sensitive.

Cliff Notes: I just assume you’d do a quick global hash-lookup and that would not be too bad.  Making the hash is cheap.  Doing a probe costs a little more, but all you need is to test whether or not you’ve seen this stack before, and only if you are about to record the hash.


Detecting JNI bugs

(also works with Python/C)

Give a state-machine description of allowed Java/JNI behaviors, and then modify the state machine at each JNI call.

void Bug_producer(JNIEnv *env, jobject lref) { C_global = lref; }

When returning to Java, the local reference is now invalid, but dangling in the C code.  Sounds like a job for EFence!!!  Later Java calls again into C:

void Bug_consumer(JNIEnv *env ) { C_global.de_ref; // crash }

State machine: tracks e.g. JNI local reference.  Notice that some local-ref is ‘acquired’ by C code at the JNI call.  On return, mark the local-ref as ‘released’ state.  Then later the JNI call attempts to de_ref, but the state is ‘released’ and not ‘acquired’, so trigger an assert.

Cliff Notes: Really just mapping state-machine states to language transitions???  This is too trivial… Has a list of 10 things they check.

Cliff Notes: Really pointing out we should run with -check:JNI always on (unless we can show a performance hit), and further should do more JNI sanity checking – because it’s cheap.

Overhead: about 15% on Dacapo.  Actually, HotSpot is fairly robust here compared to J9, but he still finds a number of cases where HS crashes or has some other error, while he reliably throws a java-level exception.

Found a bunch of real-world bugs in e.g. eclipse, gnome, etc.


Keynote: Susan Eggers, “Saving the Bacon”

Dang, I was hoping for a celebrity David Bacon roast, but no go.  Instead it’s classic old-school compiler opts – but applied to FPGA’s.  Like a domain-specific language, “if the shoe fits” then you should wear it.  Auto-compiler “interesting” C kernels down to FPGA.

SuperScaler- both “verticle” waste and “horizontal” waste – cannot issue all 4 slots, or take a long-latency cache-miss op.

FGMT – fixes verticle waste

CMP – fixes horizontal waste

SMT – seeing huge speedups?  2x to 4x on a way-way.  Dang missed context…

Alpha chips?  Off-the-shelf execution, no compiler tricks.

SMT – hides latency in 1 thread by executing ops from another thread.

SMT Compiler Strategy: seek to issue more ops instead of hide latency.  Seek to share data in caches instead of distribute data across unrelated caches.

Cyclic data layout vs tiling.  For SMT, cyclic is better.

SMT is also less sensitive to tile-size.

Azul: We went to full cores instead of SMT; it costs just as much silicon area but has no shared-resource constraints.

Heading for auto-compilation to FPGA’s.  “CHiMPS”.  Auto-compile C to FPGA.

Emphasis on ease-of-use over max performance.  CHiMPS auto-builds caches to help with the C memory model; but the caches are distributed across the FPGA and tend to make unique caches for each C structure needing caching.

CHiMPS is built from standard parts: C front-end, hardware generator back-end, feeding into a standard VHDL place&route for the FPGA.  Nice results for reasonable scientific kernels.


Verve

A fully verified (simple) OS Kernel.  Looks like same style of writing X86 assembly as HotSpot assembler, but then verifying the result is type-safe throughout.  It’s a quite simple OS, but it’s also done with very light weight annotations – looks much easier as a writing style.


Schism: (Fil Pizlo) Real Time GC

Real-time GC needs hard bounds on space AND time.

Previous work either no space bound or major slowdown.

Heap access is wait-free, no locks/spinning.

Proven space bounds.

Fastest RT/GC.

Compared to HS parallel collector.  It’s fast but not concurrent.  Java RTS: hard space, concurrent, wait-free – but 60% slow-down; log-time heap access.  Metronome: only a 30% slowdown, concurrent, wait-free BUT fragmentation kills it.  Schism: same speed as Metronome but de-frags.

Defrag: STW is easy but causes pauses.

Concurrent defrag: custom hardware (azul) – $$$.  Normal hardware – then take a throughput penalty during defrag.

Replication-based GC; but writes are not atomic; loss of coherence; works for immutable objects.  Doesnt work in Java: most objects mutate.

Seibert: alloc all objects in small junks.  No frag issue.  Access cost is known statically.  Most objects need only 2 objects.  But for arrays you make a tree; but log-time access.

Can we combine: replication-copying & fragmented-allocation?

Arraylet Spine: 1-lvl of indirection to arraylets.  But since the fragments are not moving, the SPINE is immutable!  Hence it can be moved with a replication-based GC.  No coherence issue, all accesses are statically known (and only 2 deep for arrays).  Other things do not need to move because the fragmentation allocation.

Schism – plus other work to make a complete RTGC.

Also use slack-based scheduling to let the GC to run in the background.

Use prior work to prove the GC is keeping up.

Use INMIX for better locality?

Use (?) to get cheap/free stack scanning & concurrent on other CPU.

Schism/A – completely deterministic by always using fragmentation allocation.

Schism/B – use contiguous allocation when possible, so better throughput in practice & on average, but hard bounds of Schism/A.

Schism/C –

Compared Schism vs HotSpot/ParGC.  Slightly faster than Metronome.  Much faster than Java RTS.

Compared fragmentation: HotSpot/ParGC compacts, Metronome dies.

Now look at this on a real RTS – 40Mhz LEON rad-hard sparc & a RT OS.  Java/Jikes is about 40% slower than C on the CDx real-time benchmark, but completely deterministic.

Root-scanning is independent of #of threads… because the RT threads all stop (and allow the GC to run) only at shallow stacks.

Cliff Notes:  Best RT GC ever… at least of 2010.


Detecting Inefficiently-Used Containers to Avoid Bloat

Java containers are routinely misused: either container’s min size is much to large, or else the container has many more things Add’d than ever Get’d.

Many containers are required for general-purpose library calls, but most library call uses only need 1 or 2 elements – they would do better with an optimized form that does not need a general container.

Dynamic Analysis – used by a lot of tools.  Has a lot of false positives; does a symptom-to-cause diagnosis with lots of noisy heuristics.  Depends on specific runs.

Static – have to decern programmer intent.  Hard to determine frequency of operation or size of structures.

Step1 – statically find all the Add/Get operations on all ADT

Step2 – dynamically find Add/Get freq’s

Step3 – report: Add only one or two (overkill structure), or Add»Get (under-utilization)

Step1: ADD is really a “store op” that writes an “element” into a field inside of another object.  GET is really a “load op” that reads an “element”.

Cliff Notes: now follows a fancy analysis to finds any generic container kind of thingy.  Question: does he find more interesting containers than doing the obvious thing and hooking the JDK containers???

Step2: do both static & dynamic analysis for # of Add/Gets.  For static, use loop nesting depth

Example of overpopulated container:

List L = new

while(*) {

while(*) { L.put(x); }

L.get()

}

Example of underpop’d container:

while(*) {

List L = new

L.add

L.add

…  L.get

}

Tested on 21 large apps; randomly checked 20 hits for false-positives, found none.


Detecting Low-Utility Data Structures

Bloat – lots of temp objects, or excessive data cpies or highly-stale objects or under/over used contains.  In other words: high cost and low benefit.

Example: bool isResourceDir( s ) { List dirs = computeAllDirs(); return dirs.contains(s); }

No need to compute all DIRS to determine if 1 is in it.  So “Blame the List”.

But better: compute cost/benefit ratio.

Cost: compute total ops executed to produce the result.  Actually compute a DAG to avoid double-counting values.  Use dynamic slicing, but that  is very expensive.

Benefit: ?  how to compute.

So start with: if a heap-value is used to produce an output, then value is high

If heap-value is NOT used, then value is 0.

If heap-value is used to help make another high-value, then it’s value is hi.

Solved with taint-like tracking.

Able to find a bunch of cases in DaCapo; some small hacking on each benchmark gets speedups from 5% to 30% by using better data structure choices.


Evaluating the accuracy of Java Profilers

Test 4 profilers.  But they all disagree on the hot methods.

At least 3 profilers must be wrong.  Which one is right?

How do we figure which one is right?  Timing data – so no “ground truth”.  So we have this idea of “actionable”; if acting on a profile yields results.

So if slow down a hot method, then the whole program should get slower.  Inject Fibinacci function into hot method M.  Turn this parameter of Fib counts and see if whole program slows down, and ask profiler: is M taking more time?

Most profiles are NOT ACTIONABLE: slowing down the hot method M causes NO change in the profile, but it DOES slow down the program.

Issue: these profilers are using Safepoints.  They appear in loop-back edges and method epilog.  Compilers optimize them heavily.  In particular, HS removes all Safepoints on counted-loops.

So these profilers are not sampling randomly.

Make ‘tprof’ which really samples PC’s, and has to do mapping from PC’s to methods or bytecode addresses.

Cliff Notes:  I reviewed this, and immediately compared his results with Azul’s hi-res sampling profiling and got the same answers as ‘tprof’ (plus we give you call stacks).


Adversial Memory for Detecting Destructive Races

Cormac at it again…

Back in the Stone Age: Large sequenctial program, runs deterministically.  Program crashes, just re-run.  But now we have non-determinism: both scheduling non-determinism and also memory-model non-determinism.

So run race-detector and come up with all the fields which are involved in data races.  But many data races are benign.  Is this data-race a real bug?  Or is it an intentional & benign bug there for performance?

So insert a ‘software write buffer’ for these fields.  Collect old values written to this field.  But any racey read is allowed to any return old value.  IN fact, return the eldest possible value except also never return the same value in a row.  i.e., return alternative the last two values written.  In particular, return NULL and not-NULL (if these have been written in the past).

In practice, programs crash rapidly & reliable – if they have a destructive race.  Can annotate and give stack traces for the past racey writes.  Uses vector clocks to determine the set of writes that can be read for any given racey read.

For specjbb, specjvm/mtrt, lufact, sor, moldyn, tsp – basically 100% crashes.

For eclispe, “FastTrack” finds 27 race fields.  Then run Jumble manually once for each field; found 4 destructive races.



A talk on static scheduling for shared caches for multi-core systems.

He’s getting decent speedups: 10% to 30% across a bunch of SpecINT.

Looking for constructive cache usage opportunities.

Cliff Notes: this really has to be dynamic for biz-logic Java.  But the speedups are pretty decent, I’m just not into static pre-compilation.


I attended a handful more talks not worth mentioning…. and that’s all folks!

Cliff