in Personal

I Went to Boston And All You Got Was This Lousy Trip Report

CGO was in Boston this year.  I wanted to attend the workshops, so I needed to arrive by 1pm Sunday – there’s no way to do that from the West Coast without an extra overnight stay or taking a red-eye.  I was fighting off a cold, so I took a generic night-time-no-cough-do-all pill just before takeoff.  JetBlue has the most legroom in coach and I got the exit aisle to myself: within 10 minutes I was laid out flat and snoozing.  This had to be the comfiest 5 hour flight I’ve ever taken. The winds blew in our favor, and we arrived a full 1/2 hour early: 4:30am local time in Boston.  I yawned my way down to the ground-transport level.I lived in Boston briefly maybe 12 years ago; I remembered that it’s a nightmare to drive around in but the local subway (“the T”) is fast & easy.  I bought a week-pass on the T.  There’s a bus from the airport to the nearest T stop; I hopped on the bus figuring I’d have some Real Boston Experiences while trying to re-figure out the T. First snag: the bus driver informs us that the T doesn’t start running till 6am on Sunday!  Oops – looks like I got a hour wait in the airport.  While hanging out, Manish Vachharajani (figured out how to compress trace files using BDD’s, both massive compression and you can do many trace-file analyses directly on the compressed format) jogs up: he too was stymied trying to take the T.  We end up splitting a cab.

The Omni Parker House claims to be the oldest continuously operating hotel in America.  Certainly the building is very old, but with that old-hotel charm (gilded ceilings, massive old chandliers, dark wood paneling everywhere, helpful-not-intrusive staff).  It’s smack downtown Boston; right between the old North Church cemetery (buried: John Winthrop, first governor, and other loyalists) and Granary Burying Ground (buried: Sam Adams, John Hancock, Paul Revere & revolutionaries).

I paid for maybe one dinner, AMD paid for another, Google paid for another.  I eventually got to use my T pass to get to and from the Google reception (during rush hour: so the subway was a crush but we beat the bus) and also to the airport, so the pass wasn’t a waste.

As usual for these trip reports, I skip some sessions and brutally summarize what I do attend.  Apologies in advance if I gave your paper short-shrift, feel free to correct me here.  I actually attended 2 workshops and the CGO conference proper:

  • EPHAM homepage -Workshop on Exploiting Parallelism with Transactional Memory and other Hardware Assisted Methods
  • STMCS homepage – Third Workshop on Software Tools for MultiCore Systems
  • CGO homepage – 2008 International Symposium on Code Generation and Optimization

Here’s the talks I attended, sorted roughly by how important I feel they are (and not in any time order):

I attended at least 8 more talks, who’s notes are not worth repeating here.  I got to see snow, slush and drizzle; hiked around in 30 degree weather with cars splashing salt/sand/mud over the cracking stone curbs, dodged rusty metal poles, wandered through mildew-covered crumbling buildings and graffiti covered subways and generally remembered all the reasons why I left Boston so long ago.  It’s a fun place to visit, but I’ll never live there again.


Long talk with Maurice Herlihy about his current work

He’s building STM solutions at a higher level of abstraction, e.g. adding transaction support to a hashtable; the underlying structure (hash table in this case) can be anything with reversable operations. He basically builds an ‘abstract lock’ lazily as the transaction progresses, by taking locks covering all the ‘keys’ used in the hash table. A collision on the abstract lock means a (statistical likely) collision on the actual transactional data, and he can roll-back patially executed XTN’s by using the reversable operation.

The actual covering locks are just a pile of striped locks in some large hash table, so contention can be made arbitrarily low.  Also, the underlying datastructure, e.g. a hashtable, is treated as a black-box. We could, for instance, add transactions to my NonBlockingHashMap fairly easily with this idea.


Private conversation w/Manish

How to speed up worklist/task-queues.

His approach (low latency focus): keep producer & consumer seperated from each other by more than a cache-latency. i.e., producer stuffs into a ring buffer, consumer pulls from ring buffer. Ring buffer needs to be sized such that producer & consumer are not ping-ponging lines during worklist push/pop ops. Clearly the cache lines ping-pong – but you want them to ping-pong once per cycle around the whole ring buffer – not once per each time you insert or remove. Arrange to slow down consumer if it “catches up” to producer, because if the consumer gets to close, he slows down (the already slow) producer by making the producer take more cache-misses while inserting. Same in reverse for slow consumers (but if you have slow consumers then work accumulates ad-infinitem and really you need to throttle the producers).

My approach (scalable focus): stripe queues ab-absurdem until all queues are of length 1, hence really just an array word holding either null or task. Push/pop requires scanning the array for empty/full slots. Real trick for me: proper sizing of the array. Huge scanning is expensive, but some scanning is OK.

Combo: big array, with parallel insert/remove threads. Want an auto-resize heuristic, and an auto-“back-off” heuristic. Auto-resize: need a bigger array if getting too full (producers can’t find empty slot).  In the make-bigger logic, can throttle producers (sleep until some reasonable # of empty slots appear). Need to make smaller array if consumers can’t find producer’s work: if typically end up scanning a larger fraction of array before finding work (it’s OK to scan thru a large empty array IF we’re following just behind a producer, so really scanning very little). Auto-“back off”: if see CAS failures for insert/remove then we’ve run up against another threads’ (could be consumer or producer) cache. ie., we’re making the other guy slow (and not being fast ourselves) – so “back off”. Either pick a different spot in the array to work, or sleep or grow the array.

Keynote: Vivek Sarker (now at Rice)Need more emphasis one auto-parallization.
Some recent successes have improved common use cases a bunch.
Lots of cores now.  Lots of libraries for parallel languages; java concurrency, Fortress, X10, intel tbb, Cilk, MPI, .NET parallel extensions

So… compilers are “under seige” – the “old model” of: parse-to-IR, then optimize, then reg alloc, then code emit… isn’t working. No room in the IR for all the new parallel paradygms.

Vivek suspects a coming paradygm shift (me too as well).

Shows what breaks if we include parallism in the low-level IR: e.g. parallel CFG; the usual “dominator” relationship doesn’t hold; no dominator tree. But this analogy is flawed (in my mind) because the DATAflow still happens “as-if” the parallel blocks executed sequentially, only the control-flow is parallel.

But probably want some form of DATA-parallel use; admit that the usage is racey in the IR, and compiler is allowed to take any value that could reach – or even emit code that makes that choice at runtime.

Goes on to (attempt to) derive a parallel IR onstage in a keynote.

Long discussion followed; he promises to look at using the open-source HotSpot along with Jikes. More-or-less invited me to come visit Rice sometime and check out the grad students.

Keynote, ATI, Norm Rubin
Pixar: 100,000 min of compute per min of image

Blinn’s law: time-per-frame is constant. It’s 1000 machines (all the movie company can manage) and 100 min/frame (all the schedule allows). Every new movie, toss out all old machines & buy new: they are faster.  Faster machines lets the graphics become more realistic.

8 PCs; 8 groups of 64 lock-step SIMD threads = 512 threads.
Each individual instruction is a 5-way VLIW.
(VLIW packing rate varies from 3.4 ops/VLIW to 4.2 ops/VLIW).
Pipeline is fairly long, but it’s 64×5 = 320 fp-ops / cycle.
Loads block a thread till they resolve.

Each SIMD pipe has 256 registers; each register is 128 bits.  If each thread needs 5 registers, then 256/5 = 51 wavefronts can get into run-queue. 51 wave-fronts = 3264 threads per simd or 13056 running or waiting threads.  Roughly 262,144 registers or 1meg of register space.

GPU programming model: producer/consumer relation. Hundreds of pending GPU instructions.  GPU runs a loop 30-60 frames/sec; build vertex buffer, set uniform inputs, draw.

User writes: vertex shader, pixel shader. Nothing about parallelism, or number of cores, or sync. That’s all added by the GPU JIT.  No user program has a data-race; the language doesn’t allow it.

Open interesting problems in register allocation.

Sawzall Robert GreismerGoogle server farm, map-reduce.
read/write using ‘protocal buffers’
WorkQueue – scheduling system for cluster of processors. Tries to co-locate tasks near data (no network i/o, just disk i/o).
MapReduce – abstracts the distributed system

Filter code is auto-parallelized, and fed into aggegrators.  Not sure what’s new here compared to StreamIt on a grand scale.
Sawzall – language of MapReduce. Domain specific. Like graphics, very parallel- but no concurrent programming.

Aggregators can be clever: cannot keep all the data coming in; must reduce the data. Compute sums, averages, std devs, top N.
Domain-specific language to write aggregators; strongly typed and concise. Mostly simple, regular syntax, close to C. Specific basic
types (strings, time, fingerprints, table types); value semantics, even for structs (no pointers). Lots of string regex support.
Explicit “undefined values”; want to abort sooner rather than later; willing to test for explicit “undefined value”; tests are not
auto-inserted everywhere, but must have explicit test. At runtime, can switch a flag to make undefined values silently ignored (better
to ignore e.g. network corruption rather than crash).

Quantifiers in the language: “when (i: some int; P(a[i])) {…}” will iterate over array ‘a’ for all indices ‘i’ and run some filter ‘P’
first. 3 forms: each, some, all. Nest nicely. Seems similar to Fortress. Generates bytecodes, which are then JIT’d. Ref-counting GC
(value semantics, no cycles possible); uniform object model (all things are ptr’s – even int’s).

Cliff Notes: Hey! We could do some kind of decent sample-based profiling in the HotSpot Tiered World. Periodically have a sample-thread wake up and hack the inline-caches of the top hot methods to call their C1 equivalents. Have the C1 code patch back to the C2 code, but then run on collecting info (for 1 method invocation).  IBM has already shown that if the frequency of profiling is >0.1% then you get good-enough profiling info; we’d only need to hack the inline-caches rarely to get decent continuous C1 profiling info.

Hardware Acceleration for Lock-free Data Structures and Software Transactional Memory – AMDAMD – Advanced Sync Facilitiy – not in silicon yet.
ASF limited to 8 cache lines.  Issue a set of ‘locked’ loads. ‘acquire’ starts critical section. Checks for consistency. Stores rIP & rSP for rollback. No other registers.  Then allow mods to the locked locations.  Finally ‘commit’ (or abort).  Use ASF for the 1st 8 lines of a TinySTM transaction, then switch to software afterwards.

Cliff Notes: More or less, this is AMD announcing HTM support in their lab: free (good) simulation available, but no silicon.  The hardwired ‘8’ is a lower bound: you are guaranteed you can atomically update at least 8 lines.  Several (most?) lock-free algorithms require atomic update of 2-4 unrelated addresses.  This is an architectural guarantee that such algorithms would work on all future AMD processors with this support.  No actual hardware has been announced.

Energy Concerns using HTM
Maurice HerlihyNo space in a embedded to do a full STM or even a hybrid TM; so expect multiple cores in an embedded system with a simple HTM.

Basically, abort & retry costs power.

MP-ARM – has a 512b transaction cache (fully associative), plus an 8K L1. 128B scratchpad. On an abort, the CPU rolls-back the registers from the scratchpad & enters lower-power mode. It wakes up after a random interval & retries.

T-Cache costs them 20% more power – but make it up with shorter running time. Generally a win for apps with contention (less total power than locking & serializing).

Cliff Notes: Basically it’s ARM’s version of HTM

Synchronization Aware Conflict Resolution for Runtime Monitoring Using Transactional MemoryUse TM to detect dataraces. AUTO-insert TM begin/end around regions doing conflict detection. Tried it per basic-block in SpecInt – too expensive. Tried to make a TM per 8-12 basic blocks; then overhead is down to 10% or so. (Cliff Notes: What TM are they using? Since the TMs are small, they assume hardware TM which explains the low 10% overhead).

Turns out auto-inserted TM causes lots of live-lock issues.  He demos several live-lock scenarios with reasonable looking TM hardware & spin-locks.

So they spot spin-lock patterns and do not auto-insert TMs around spin-locks… no they make the hardware spot spin-locks and break the TM in a way that makes progress.

Automatic Array Inlining
Christian WimmerUber object-inlining, for arrays. Great for short fixed-length arrays, ala what comes out of Reflection.

Really it’s object co-location, which the GC preserves OR will tell the JITs to recompile.

  1. phase 1 – decide what to co-locate (profile in C1); tell GC
  2. phase 2 – GC co-locates all such things; tells JIT
  3. phase 3 – JIT recompiles assuming co-location.

Also JIT must detect that objects are co-allocated, and not modified later (implicit final fields: detect changes with field guards).

This technique is easy for known fixed-length arrays: it’s just like inlining an object. Also works for unknown-length arrays that are not changing; the “tail” of the inlined-object-set isn’t known, but isn’t needed.

Can handle changing the inlined array field itself.  Can fold the array length check into the array field guard: set the old array length to 0 (the old array is ONLY pointed to by the original object, this is a precondition to inlining, so no other object/thread can see the length change).  If the range-check vs 0-length fails, 1st go to some slow code which re-checks after loading the proper field.  Eventually the GC resorts/realigns, and then the fast code works again.

Also can handle having remote /shared pointers to the inlined array, even with hacking the array length: have 2 array lengths. The “normal” length used by non-inlined code, and “inlined” length used only by inline-smart code.

Also can do inlining objects into arrays: first lay out the array, then have the GC place the objects following the array in order.  Alas, harder to detect changes & harder to detect setup conditions.  They gave up here – but seems to me you could do it if GC tells you the objects are unshared (only pointed-at by the array), and you catch all array-writes to that array (by tagging the array as read-only after GC does the layout, perhaps placing it in a read-only page).

Get 20% speedup on some SpecJVM98’s.
Alas no speedup on DaCapo. 🙁
Very surprising to me that no speedup on DaCapo; I would expect the String co-location to be worth alot – aha he’s stuck with the old JDK strings which share char arrays, so it doesn’t work for him.  I predict a ~10% DaCapo speedup on Azul based on the Strings optimization.

Azul looked at this earlier and declared it a good-idea for our next-gen. Basically, branch-on-random powers-of-2, i.e. either 50% or 25% or 12.5% or …, etc.  Use it to lower the costs of sampling.  Typical sampling cost uses a counter and every N (e.g. 1024) executions we take a heavy-weight sampling pass. But there is a cost to test&dec the counter, uses an extra register, extra control flow, extra cache, extra instructions.

‘brr’ is just a conditional branch, but the condition is a frequency. Only need 4 bits to cover all reasonable use cases. Low hardware requirements (a LSFR, total maybe 100 gates including mux’s); fits well into existing pipelines. Can be made tolerant of branch mis-predict (i.e., can get the info very early in the pipe).

Multi-core Thread-level Speculative ParallelismDiscovering lots of speculative parallism opportunities in sequential code. Lots more if you are willing to do large code transforms.
Ex: while-ptr-chaining-loop finding the min.
Ex: nest the above loop in a worklist which pulls the min element, then tranforms the list (ala a standard worklist algorithm).

Use value prediction to allow thread-level speculation. Basically, start a parallel loop at a random value-predicted pointer location.
After a few iterations either crash or line up with some linked-list body. Actually, capture some value on the 1st time through the inner
loop; on later iterations thru inner loop have some good choices for prediction. There’s some decent auto-load-balancing going on; each iteration of the inner loop records enuf info to allow a good (re)partition for the next iteration of the inner-loop.

Technique does not only do linked-lists; needs any double-nested loop. Inner loop gets parallized, outer loop allows the inner loop to be continously ‘profiled’ and ‘load balanced’. Really can parallize other ‘hard-to-parallize’ inner loops. Would work well in Java; I probably could do it in C2 with enuf effort.

Analyzing perf diffs in JVMsUsing “calling context tree” – similar to our 8-deep call-stack capture.

Doing ‘diffs’ on these trees. Diff’ing based both on structure, and on edge weight. Doing things like “adjusting” the weight of top-nodes in a tree so 2 trees match in weight, then looking at diff’s farther down the trees. Also looking at weight of normalized subtrees, etc.

Comes up with a fairly ‘diff’ like interface (cmd-line) to compare 2 forests of trees and highlight the diff’s.

Able to track down differences in 2 websphere/trade6 setups.
Discovered differences between timezone settings between AIX & ZLinux, and also DB setups.

PiPa – Pipeline Profiling (best student paper)
Faster tool for doing e.g. cachegrind. Parallize instrumented application. In SuperPIN – idea: main app just forks off another thread periodically to do the instrumentation work. (how is this different from jamming ops/time into a buffer, and having background
threads pick up new buffers & gzip ‘em).Idea: process the background buffer in another thread; do full analysis on the buffer(s) in parallel – in real time with the app running.

Demo running a cache-simulator, in “real time”.  Using double-buffering, large shared buffer.  Pretty obviously equivalent to Azul’s tracing mechanism, except she’s doing a cache-sim instead of a gzip.  Can run the original program with 3x slowdown, but get a full cache sim.

Prediction & Trace Compression vs Nested Loop RecognitionLooks cool: find simple loops from traces; loops are simple linear algebra. Imperfect loops allowed. From loops, find nested loops.  Get massive compression ratios vs bzip & other trace compression techniques. Very simple to decompress; just “execute” the loops.

Fast Liveness Checking in SSA formAnswer based on demand: is var live here?
Faster precomp, slower queries.  Info depends on CFG & def-use chains
Invariant to adding & removing instrs; sounds useful for reg alloc adding & removing spill ops.

For HotSpot, liveness computation is probably 10% of total compile time.

Compute ‘highest point’ in dom tree where query point is dominated by def point. Compute all-paths dom-vs-each CFG node

  • Compute transitive closure over reduced CFG (simple DAG). Simple post-order traversal
  • For each block q, compute highest block (highest backedge target). Simple pre-order & post-order traversal.
  • Arrange for highest node to be first set bit in trans closure

Query: goto highest block; see if def dominates highest block.
Looks fast & simple. Might speed up reg-alloc.

Nearly Optimal Instruction Selection on DAGs
Stupidly slick: tile the DAG once, just the way HotSpot does right now – except allow sharing always (right now HotSpot does “CSE” on sharing, forcing shared values in registers – except for small expressions that are common in address-arithmetic).Then look at each place where a shared value was subsumed into larger instructions (hence replicated in the instructions). For just those places, see what the local cost would be if you did not share. If the cost is better to not-share, flag the node as getting it’s own register result. Then re-run the DAG tiling, but living with the shared results from the 1st pass.

Ex: ((a+8)(b+8)). Should we share the ‘8’ or not?
Pass 1:
t1 = a+8   // 5 bytes
t2 = b+8   // 5 bytes
t3 = t1
t2 // 1 byte
But if ‘8’ goes into a register:
t0 = movlo 8 // 5 bytes
t1 = a+t1    // 1 bytes
t2 = b+t1    // 1 bytes
t3 = t1*t2   // 1 bytes

Reduces code size by 2-4% nearly always, and very close to optimal. What hotspot does is essentially what he calls ‘cse-leaf’ in paper. His NOTLIS does somewhat better.
Cliff Note: I did a bunch of Azul-specific hackery in our DAG share-vs-clone heuristic to try and get what he gets naturally. – grad’ing soon

Exception Check Motion for Java

  • Run aggressive PRE, ignoring exceptions checks
  • See if can move checks to allow the PRE
  • Undo any PRE motion where the checks can’t be moved

Got ~2% on production compiler – which confirms my suspicion that there’s not much more to be had there. Mostly moving null-checks around – and HotSpot is already very aggressive about moving & removing null checks.

His examples HS will all get another way; loop-peeling and Split-If transform will wipe out all his gains.

Another unsafe code-motion talk for Java.He’s got better examples: loop-peeling won’t wipe out his first example (but versioning will).

Safety checks turned into data-dependence, just like HS

heh, another talk that’ almost getting as good as HS has been for the last 10 years.

Then convert data-dependency safety checks into IA64 predication bits and happily predicate speculative loads. Seeing 4% for just the PRE based motion (on SpecJVM98), another 2% for doing hardware predication & speculation bits.

Towards Fair, Scalable Locking
Barcelona Supercomputing & MicrosoftScalable reader-writer locks (something I’ve thought about alot in the last year). Want fair (no starvation of either readers or writers). Want fine-grained striped locks (? specific to different data types)

Demo’s starvation in a HTM system similar to Azul – eager updates can mutually kill each other.  Demo’s cache-directory-reservation can lead to deadlock.

So far seems like a more complex version in hardware of something that’s easy enough in software. It’s only partially fair (not FIFO). Preserves proportions of readers & writers.

Using STM. Overhead looks very bad.

Runtime Tuning of SM Validation Techniques
Rice UniversitySTM validation is expensive; no one technique wins all cases. Use runtime to dynamically change validation scheme. Periodically (every 5% of TM attempts) run the same TM with 1 of 3 different validation techniques; if the new technique is working better than the old one, then keep it – otherwise immediately revert back to what was working well.

Transactional Support for Dynamic LanguagesLooking at Jython – get to use the fast JVM.  To make Python bytecodes need some huge bad wrappers around every little variable access. This makes Jython as slow as regular interpreted python.

What we’d like to do is skip the fancy python lookup and use plain Java variables – except that we want to keep the good python dynamic semantics. Would “lookup x in y” to be as fast as a local variable reference.

So use transactions to test the pre-conditions & use fast specialized code otherwise. i.e., execute inside a transaction & test preconditions.

Use best-effort HTM. Add a few specific speculation primitives to JVM (speculate, abort, commit). Jython generates both specialized-case fast-code and also full semantics. Use HTM to run the fast-case. If that fails, fall back to the full semantics bytecodes.

Cliff Notes: Sun Labs guy noticed that this technique requires abort info from the hardware (was it capacity vs remote thread) – does Rock TM lack the failure-mode data?

User-defined Priority Based Transaction Contention ManagementReally, this is poster case for why TM’s will not save the world.  He’s proposing users can put some kind of priority on transactions – and higher priority transactions can “barge” over lower priority ones. What a mess!!!

Map-Reduce for GPU’sMore general term than Google’s map/reduce.  Really its stages of independent computation, followed by communication.

GPUs are great for doing ‘map’ – lots of GPUs, lots of bandwidth. But only local communication is cheap; global communication is very expensive. Also, reduction on GPUs is well known to be very difficult. Lots of PHDs going into deciding how much unrolling is best, or how best to split up computations. Sometimes see 60x slowdowns if the best-reduction for one dataset size is used for a different dataset size.

User writes a MAP function in CUDA (NVidia’s C+GPU operaters).  User’s map function outputs results in a per-thread local-store. User writes a binary associative reduction operator – takes 2 inputs and returns 1; and his framework will restructure the reduce step for best parallelism. (also needs an identity value, in case some MAP results are filtered out)

Analysis of CUDA Programsauto-detect race conditions (detects potential dataraces!), detect bank conflicts (extreme poor memory access patterns)

CUDA abtractions: kernel functions, threads, scratchpad ram

Cores are SIMD; execute program in lock-step; program is in scratchpad ram.

2 global bookkeeping arrays, reads & writes to global memory.  two per-thread arrays: reads& writes of a single thread. After each access to a global, check for a race.

Scratchpad is 16-way banked per 16 threads; fast as a register if no bank collisions. Bank conflicts can reduce performance by a factor of 4; slowdown equal to not using scratchpad at all.

Streaming Applications on Multi-Core SystemsBoxes using different hardware for streaming: gpgpu, fpga, webcam, disk, network, etc. Very hard to simulate & debug when lots of different parts are moving.

Auto-Pipe is a set of tools to create, test, build, deploy, optimize distributed streaming apps. Lots of pipeline parallelism, and regular parallelism within a pipe stage.

Auto-Pipe – define a set of blocks and edges or computation & communication. Also describe the hardware layout: cpus, cores, fpus, etc.  Then provide a mapping from blocks to hardware chunks.  No need to worry about communication between the blocks.

Phase-Based Adaptive Recompilation in a VM
Using hardware perf counters to detect phase-changes, and trigger recompiles. There’s been a lot of work in this area already, I’m not sure what’s new here.

Finding Critical Instructions Without Hardware“critical” – instructions on the hot path.  Create OFF-line synthetic instruction traces based on existing hardware profile info, including branch mis-predicts & aliasing info.  Then analyze those traces, and compile (schedule) to those.

Take a random-walk thru binary, throwing a weighted-die at each branch (starting at a random spot using weighted-die). Emulate a call stack for call/return sanity. Can “get stuck” in various oddball paths, so reseed the trace after N instructions.

More: seed abstract register state with random values, then start tracing. Loads & stores can use random addresses, or load/store addresses: aliases will appear “obvious” because the random addresses will collide.  Also simulate memory using the random addresses. Turns
out most aliasing is captured, but not all (structured linked lists will behave as random instead).

Finally: stuff the random traces thru the hardware simulator & come up with critical paths & schedules. Then use this info to compile better scheduling (he’s doing clustered VLIW’s). Seeing maybe 10% performance gain on his funny clustered VLIW’s.

Program Optimization Space Pruning for a Multithreaded GPUNVIDIA CUDA talk. Need loads of threads (12000 hardware supported). Have 45G/sec bandwidth (Gbyte? GBit?) but still run out; must use local caches. Encouraging loop unrolling, hardware prefetch, traditional optimizations. Scheduling loops for thread-level and memory-level parallism. Have about 8000+ total registers.

Increasing regs-per-thread might increase per-thread performance, but run out of the shared register space and thus be able to launch fewer threads in parallel. In short, need to search large space of configs; often hand-tuned code is about 15% below max trapped in some local maxima.  A non-intuitive mix of other optimizations can move you to a better minimum.