Too Much Theory, Part 2

As I write this I am flying back on SwissAir having just spent a week in beautifully freezing downtown Stockholm (highs are below freezing all this week) speaking at JFokus.  My talks went well; the audience was engaged and I had lots of  questions and comments afterwards.  The speakers were treated to a marvolous dinner (thanks Mattias!) which included singing by some of the JFokus crew.  I attended a few talks and wish I had time for more; JFokus definitely was a fun conference to attend.

When not at JFokus, Shelley & I managed to see The Vasa, a 1600’s era warship that sank on her maiden voyage, barely out of dock.  She lay forgotten for 300 years and was finally raised in the 1960’s and preserved & restored in the 1970’s and 1980’s.  All the iron bolts had rusted out (the iron cannons were salvaged a mere 50 yrs after her sinking) but all else remained.  Apparently the salinity in the harbor prevented the usual shipworms from eating everything wooden; the ship is 95% original.  She remains by far the largest salvaged wooden structure in the world.  She’s an amazing sight to behold and well worth the museum trip.  We also managed to visit the Swedish historical museum and ride the local trams some.  Only the horrible hacking cough Shelley picked up kept us from doing more exploration (nothing like being sick in a foreign land).

Assuming the rest of our trip back remains easy (first flight out was 15min delayed, giving us a whopping 30min to change from Terminal A to Terminal E in Zurich, including 2 passport checks – but we made it with only a little running), I’m going to dive into Part Two of my Deep Dive into Too Much Theory.

Last week, if you recall (I’ve wanted to use that phrase in a sentence for a long time now!  🙂 we were looking at the link between lattices and constant propragationConstant propagation is a means to discover constants in programs which in turn can enable more optimizations (e.g. replacing the obvious math producing the constant with just the constant directly).  Lattices are how we describe the kinds of “constants” we can discover.  As already discussed, we’re not just finding simple integer constants like “3”, but possibly things like the null pointer, or that a pointer is not_null (some unknown value… just never null), or ranges of integers (useful to remove some kinds of array range checks).  So when I say “Constant” what I really mean is “Some interesting values” – or properties about values – not necessarily limited to true constant values.

Before we dive into extending our lattice into even more exciting forms, I want to talk about properties of the lattice directly.  We need some properties or we’re doomed – but there are some properties that we want that are less obvious.

A Commutative, Associative, and Symmetric Lattice

  • A lattice is a collection of elements (e.g. top, bottom, 0, 1, 2, 3, …)
  • With an operator for mixing the elements called meet, which defines a partial order over the elements

We used meet when mixing different values for the same variable because of control-flow; (for you SSA wonks it is the operation given to Phi functions) meet produces another element in the lattice.  Thus lattices are closed under the meet operation.  Lattices also need:

  • A distinguished top-most and bottom-most values.  We call these values top and bottom respectively.  They represent the start and endpoints of Constant Propagation.
  • Our lattices need to be bounded height: meaning we cannot apply the meet operator an infinite number of times before we hit bottom.  The height of the lattice dictates our running time.  Especially for a JIT short running time is key; in practice lattices used in compilers tend to have a very small height.  For HotSpot it is something like 5 or 6.
  • We want our meet operator to be commutative, associative and symmetric.  The HotSpot C2 lattice has all these properties, and is defined using the above meet operator and a dual operator: reflection across a centerline.  The Wikipedia article uses meet and join to define a lattice, but HotSpot uses meet and dual because its easier to implement in C++.

The reason for these properties is a little subtle: if the meet operator lacks these then the lattice isn’t such a simple structure – and the order of application of meet will matter.  i.e., the order we visit our program and apply our constant propagation will matter.  Some visitation orders will give better answers than others.  Example: Suppose meet is not commutative and ‘b=3’ on entry to a loop and ‘b=TOP’ on the backedge.  If (3 meet TOP) is (3) but (TOP meet 3) is (BOTTOM) then we lose a constant – and it matters how (& when) we flow information around.  By requiring a commutative, associative and symmetric lattice we get a “Church-Rosser” style property, and this in turns means we can use an un-ordered worklist style technique and get the same best answer no matter how things come off the worklist – which is exactly how HotSpot implements Constant Propagation.

  • We want our lattice to be reflective around some centerline (which we choose as the line of simple constants); this is implied from the symmetry on the meet operator above.

The prior integer lattice was commutative, associative & symmetric:

</div>

                        _**top**_
    ...   -2      -1     0     1     2     3  ...
                       _**bottom**_
But the lattice with ranges from the prior blog (in additional to having a ridiculous bound) was not symmetric.  Lets try the lattice of ranges again – but limited to 2 values and then we'll add symmetry.  As before if we try to merge (**meet**) a lattice element of X with X+1 we'll make a short range, but if we meet a range of 2 values with a third – instead of expanding the range we will fall to _**bottom**_.  Here's our lattice:
                        _**top**_
    ...   -2      -1     0     1     2     3  ...
    ...  (-2,-1) (-1,0) (0,1) (1,2) (2,3) (3,4) ....
                       _**bottom**_

Is this lattice commutative?  i.e., is (a meet b) == (b meet a)?  Lets look at some examples.  What happens when we meet top and -1?  (top meet -1) == -1 == (-1 meet top), so this example works.  So does: (1 meet 2,3) == bottom == (2,3 meet 1).  So does: (0 meet 0,1) == 0,1 == (0,1 meet 0). I claim the “obvious interpretation” of this lattice is commutative.

Is this lattice associative?  i.e., is ((a meet b) meet c) == (a meet (b meet c))?  Again, running a few examples can convince us the lattice is indeed associative.

Is this lattice symmetric?  Well, in a word: No.  The opposite of top is bottom.  Things on the centerline are symmetric with themselves (e.g. in the lattice world, 3 is the opposite of 3).  How about a range like (1,2)?  It represents “1 or 2 but we don’t know which, and no other number”.  What is opposite the centerline from the range (1,2)?  Well, it’s a little weird: it is both the constant 1 AND the constant 2 at the same time.  NOT the more obvious “1 OR 2 and we don’t know which one it is”.  Whoa… lets digest that for a second.  Like top, it is an unnatural state of affairs (except in mathematics) whose concept is a little slippery.  Like top, we want to embrace a concept of allowing as much flexibility in number choices as we can… while still being a little bit more constrained than top.  Remember: top is ALL constants ALL the time; we’ll settle here for just 2 constants, but BOTH AT ONCE.

Having discussed the concept to death, lets settle on some nomenclature and then run some examples.  Lets write the opposite of the range (1,2) as the inverted range (2,1) (the high and low values are inverted).  Another syntax might be pre-pending a ‘~’ as in: ~(1,2).

                         _**top**_
     ..(-1,-2)  (0,-1) (1,0) (2,1) (3,2) (4,3) ....
     ...   -2      -1     0     1     2     3  ...
     ...  (-2,-1) (-1,0) (0,1) (1,2) (2,3) (3,4) ....
                        _**bottom**_
Notice how our lattice looks symmetric across the centerline now?  Here's the same example from the last blog using _**top**_:
  b = 1;       // b is the constant '1' here.
  while( P ) { // b is _**top**_ around the backedge -
               //   which matches the constant '1'
    S1;        // b is a '1' here
    b = 2-b;   // b is STILL a '1' here!
  }            // we bring a '1' value around for 'b' here
Again using the inverted range (2,1):
  b = 1;       // b is the constant '1' here.
  while( P ) { // b is (2,1) around the backedge -
               //   which matches the constant '1'
    S1;        // b is a '1' here
    b = 2-b;   // b is STILL a '1' here!
  }            // we bring a '1' value around for 'b' here
Ok, why bother with inverted ranges when _**top**_ covers everything?  Because we use them whenever we can **assert** a property is true, such as just after the property is tested with an IF:
  b = ...; // b is BOTTOM
  if( b >= 1 && b <= 2 ) {
    // here we can **assert** that b is the **join**
    //      of whatever it was before, and (1,2)
    b;     // b is range (1,2)
  }
New jargon word: **join**, an operation which produces the intersection of properties (both the state of 'b' before the IF must be true, but also inside the if() the test must be true – so both properties are true at once).  Contrast this to the **meet** operator which yields the union of unknown properties.  The **join** of A and B can also be defined as: **~(~A meet ~B)**, and indeed this is exactly how HotSpot's C2 compiler computes it from src/share/vm/opto/type.cpp:

>

// JOIN operation; higher in lattice.
> // Done by finding the dual of the
> // meet of the dual of the 2 inputs.
> const Type *join( const Type *t ) const {
>  return dual()->meet(t->dual())->dual(); }
Back to our example of using **join** in a sentence:
  b = ...; // b is _**bottom**_
  if( b >= 1 && b <= 2 ) {     // b is the join of _**bottom**_ and (1,2)
    // b is ~(~_**bottom**_ **meet** ~(1,2))
    // b is ~(_**top**_ **meet** (2,1))
    // b is ~(2,1)
    // b is (1,2)
  }
The symmetry property implies that:
if  **M == (A meet B)**, then:**~A == ~A meet ~M**     AND**~B == ~B meet ~M**
Our range lattice is now symmetric, so lets test that on some A,B pairs.
Example 1:  _**top**_ & (1,2)
M == (_**top**_ **meet**(1,2)) == (1,2)~_**top**_ =? ~_**top**_ **meet** ~(1,2) **_bottom_** =? **_bottom_** **meet** (2,1)

**_bottom_** == _**bottom**_

~(1,2) =? ~(1,2) **meet** ~(1,2)

(2,1) =? (2,1) **meet** (2,1)

(2,1) == (2,1)

Example 2: 1 & (0,1)
M == (1 **meet**(0,1)) == (0,1)~1 =? ~1 **meet** ~(0,1) 1 =? 1 **meet** (1,0)

1 == 1

~(0,1) =? ~(0,1) **meet** ~(0,1)

(1,0) =? (1,0) **meet** (1,0)

(1,0) == (1,0)

Summary:

We have a **lattice**, with a distinguished _**top**_ and _**bottom**_ elements.  It has a **meet** operator which defines the lattice structure; **meet** is _commutative_, and _associative_.  The **lattice** is also _symmetric_ about the centerline and we use the _dual_ operator “~” to refer to elements across the centerline.  Constants exist on the centerline.  The elements above the centerline are used after IF statements to assert properties, or for unknown loop backedge values – and represent 'impossible' situations like some expression producing multiple constants being true _**at the same time**_.

We (almost) have enough knowledge now to be dangerous!  Lets build a more realistic Java lattice now.

## A Class Hierarchy

Our pointers have more structure than just **null** or **not_null**; they have types!  We have a Class Hierarchy.  Can we do something with it?  Somewhat similar to adding ranges to plain integer constants, the base case is simple but we need to be careful as we get clever.  The payoff is also very high: if we can deduce a **constant class** for a pointer, and the pointer is used to make a virtual call then we can simplify the virtual call to a static call … and maybe even inline.  That's a big payoff, so it's worth some effort here.

Lets go with a simple starter lattice; similar to the integers we'll look JUST for things are known to be a specific constant class… and NOT any subclass.  Weird, I know, but bear with me:

>

            _**top**_ - **any** single class the compiler desires
> String   Object(no subclass!)   XMLNode   Fred
>            _**bottom**_ - multiple choices,
>            e.g. String or XMLNode or a subclass

A common source of known-class-thingies is the '**this**' pointer in calls (and usually any other arguments), so my examples start that way:

>

boolean String.weird_equal( Object that ) {
>   // 'this' has the constant Class String, no subclasses!
>   // 'that' has the Class _**bottom**_
>   //        since it might be a subclass of Object
>   int hash1 = this.hashCode();
>   int hash2 = that.hashCode();
>   if( hash1 != hash2 ) return false;
>   if( !(that instanceOf String) ) return false;
>   return equals((String)that);
> }

What does CP tell us here?  Lets roll it forward one line at a time:

        int hash1 = this.hashCode();  // '**this**' has Class String
                                      //  and is **not_null**

**this** has the class String, so this is really a call to String.hashCode() and we can (now) happily inline it.

        if( that == null ) throw NPE; // hidden built-in NPE check...
        int hash2 = that.hashCode();  // 'that' has Class **_bottom_**
                                      // and is **not_null** 

No constant Class here, so this becomes a not-inlinable virtual-call, i.e. expensive.

        if( hash1 != hash2 ) return false;
        if( !(that instanceOf String) ) return false;

Hey!  After the 'if' test we know that '**that**' has Class String!  We already know that '**that**' is **not_null** – this gives us 2 orthogonal lattices we can track: pointers might be **null** or not, and they might be a constant Class or not.

        return equals((String)that); // the cast can be optimized away!

And since '**that**' is known to be the class String, the JIT can remove the cast.

A quick summary of what we have so far:
  pointers can be **null** or **not_null** (or _**top**_ or _**bottom**_)
    and this info is used to remove _null checks_
  pointers can be constant Classes (or _**top**_ or _**bottom**_)
    and this info is used to _de-virtualize_ calls
    and remove extra type checks
  We know both kinds of lattice info about pointers, at the same time.

Similar to integer constants, our pointer Constant Propagation can be optimistic or pessimistic.  Here's some code where being optimistic might help:

>

  **final class** LinkedListThing {
>     LinkedListThing _next; // instance field
>     int hashCode() {
>       int hash = 0;
>       Object x = **this**;
>       while( PRED ) {
>         hash += x.hashCode();
>         if( x **instanceof** LinkedListThing )
>            x = ((LinkedListThing)x)._next;
>         else
>            x = "fail";
>       }
>       return hash;
>     }

Deep breathe – okay, this example is getting large for a blog.  Let's break it down.  What happens if we are pessimistic?

>

  int hashCode() {    // this is Class LinkedListThing
>     int hash = 0;
>     Object x = **this**;  // x also is Class LinkedListThing
>     while( PRED ) {  
>       // x is _**bottom**_ around the backedge
>
>       hash += x.hashCode(); // unknown v-call
>
>       if( x **instanceof** LinkedListThing )
>       // _**bottom**_ so needs the full type-check
>
>            x = ((LinkedListThing)x)._next;
>            // x is Class LinkedListThing
>
>       else x = "fail";
>       // Else x is Class String if the full type-check fails
>
>     } // Here x is either String or LLT... so _**bottom**_
>     return hash;
>   }

In one pass of CP we “discover” that 'x' is Class _**bottom**_ – not much help here!  Let's go again with the optimistic approach:

>

  int hashCode() {    // this is Class LinkedListThing
>     int hash = 0;
>     Object x = **this**;  // x also is Class LinkedListThing
>     while( PRED ) {   // x is _**top**_ around the backedge;
>                       //   assume it is LLT
>       hash += x.hashCode(); // known call to LLT.hashCode()
>       if( x **instanceof** LinkedListThing )
>            // type-check not needed!
>
>            x = ((LinkedListThing)x)._next;
>            // x is Class LinkedListThing
>
>       else ... // Not reachable; x is the correct Class already
>     } // Here x is LinkedListThing!
>     return hash;
>   }

Hey presto!  We figure out what the programmer shoulda wrote already: that 'x' is of type LinkedListThing, and the **instanceof** test is useless.  Of course, we might have gotten here from some prior rounds of inlining, where 'x' could only be typed as an Object so lets not blame our strawman programmer (who's really just me trying to make some not-too-contrived examples).

All right, time to wrap up this blog.  **Next** time we'll add subclasses (and maybe arrays and/or interfaces) and see what happens.  But for now this blog is big enough.  Besides, me & my daughter hiked up to Upper Yosemite Falls and then hiked down at night, strictly by starlight… but that's a exciting story for the next blog.

Cliff

Too Much Theory

As I write this I am flying back on SwissAir having just spent a week in [beautifully freezing downtown Stockholm](http://www.weather.com/weather/tenday/59.328930,18.064910) (highs are below freezing all this week) speaking at [JFokus](http://www.jfokus.se/jfokus/?lang=en).  My talks went well; the audience was engaged and I had lots of  questions and comments afterwards.  The speakers were treated to a marvolous dinner (thanks Mattias!) which included singing by some of the JFokus crew.  I attended a few talks and wish I had time for more; JFokus definitely was a fun conference to attend.

When not at JFokus, Shelley & I managed to see [The Vasa](http://www.vasamuseet.se/en/), a 1600's era warship that sank on her maiden voyage, barely out of dock.  She lay forgotten for 300 years and was finally raised in the 1960's and preserved & restored in the 1970's and 1980's.  All the iron bolts had rusted out (the iron cannons were salvaged a mere 50 yrs after her sinking) but all else remained.  Apparently the salinity in the harbor prevented the usual shipworms from eating everything wooden; the ship is 95% original.  She remains by far the largest salvaged wooden structure in the world.  She's an amazing sight to behold and well worth the museum trip.  We also managed to visit the Swedish historical museum and ride the local trams some.  Only the horrible hacking cough Shelley picked up kept us from doing more exploration (nothing like being sick in a foreign land).

Assuming the rest of our trip back remains easy (first flight out was 15min delayed, giving us a whopping 30min to change from Terminal A to Terminal E in Zurich, including 2 passport checks – but we made it with only a little running), I'm going to dive into Part Two of my Deep Dive into Too Much Theory.

[Last week, if you recall](http://www.azulsystems.com/blog/cliff/2012-02-12-too-much-theory) (I've wanted to use that phrase in a sentence for a long time now!  🙂 we were looking at the link between [**lattices**](http://en.wikipedia.org/wiki/Lattice_%28order%29) and [**constant propragation**](http://en.wikipedia.org/wiki/Constant_folding).  **Constant propagation** is a means to discover constants in programs which in turn can enable more optimizations (e.g. replacing the obvious math producing the constant with just the constant directly).  **Lattices** are how we describe the kinds of “constants” we can discover.  As already discussed, we're not just finding simple integer constants like “3”, but possibly things like the **null** pointer, or that a pointer is **not_null** (some unknown value… just never **null**), or ranges of integers (useful to remove some kinds of array range checks).  So when I say “Constant” what I really mean is “Some interesting values” – or properties about values – not necessarily limited to true constant values.

Before we dive into extending our **lattice** into even more exciting forms, I want to talk about properties of the **lattice** directly.  We **need** some properties or we're doomed – but there are some properties that we want that are less obvious.

## A Commutative, Associative, and Symmetric Lattice

* A lattice is a collection of elements (e.g. _**top**_, _**bottom**_, 0, 1, 2, 3, …)
* With an operator for mixing the elements called **meet**, which defines a partial order over the elements

We used **meet** when mixing different values for the same variable because of control-flow; (for you [SSA wonks](http://en.wikipedia.org/wiki/Static_single_assignment) it is the operation given to Phi functions) **meet** produces another element in the lattice.  Thus lattices are **closed** under the **meet** operation.  Lattices also need:

* A distinguished top-most and bottom-most values.  We call these values _**top**_ and _**bottom**_ respectively.  They represent the start and endpoints of Constant Propagation.
* Our lattices need to be **bounded** height: meaning we cannot apply the **meet** operator an infinite number of times before we hit _**bottom**_.  The height of the lattice dictates our running time.  Especially for a JIT short running time is key; in practice lattices used in compilers tend to have a very small height.  For HotSpot it is something like 5 or 6.
* We want our meet operator to be _commutative_, _associative_ and _symmetric_.  The HotSpot C2 lattice has all these properties, and is defined using the above **meet** operator and a **dual** operator: reflection across a centerline.  [The Wikipedia article uses meet and join to define a lattice](http://en.wikipedia.org/wiki/Lattice_%28order%29), but HotSpot uses **meet** and **dual** because its easier to implement in C++.

The reason for these properties is a little subtle: if the **meet** operator lacks these then the lattice isn't such a simple structure – and the _order of application_ of **meet** will matter.  i.e., the order we visit our program and apply our constant propagation will matter.  Some visitation orders will give better answers than others.  Example: Suppose **meet** is not commutative and 'b=3' on entry to a loop and 'b=**TOP**' on the backedge.  If **(3 meet TOP)** is **(3)** but **(TOP meet 3)** is **(BOTTOM)** then we lose a constant – and it matters how (& when) we flow information around.  By requiring a commutative, associative and symmetric lattice we get a [“Church-Rosser” style property](http://en.wikipedia.org/wiki/Confluence_%28abstract_rewriting%29), and this in turns means we can use an un-ordered worklist style technique and get the same best answer no matter how things come off the worklist – which is exactly how HotSpot implements Constant Propagation.

* We want our lattice to be reflective around some centerline (which we choose as the line of simple constants); this is implied from the symmetry on the **meet** operator above.

The prior integer lattice was commutative, associative & symmetric:

                        _**top**_
    ...   -2      -1     0     1     2     3  ...
                       _**bottom**_
But the lattice with ranges from the prior blog (in additional to having a ridiculous bound) was not symmetric.  Lets try the lattice of ranges again – but limited to 2 values and then we'll add symmetry.  As before if we try to merge (**meet**) a lattice element of X with X+1 we'll make a short range, but if we meet a range of 2 values with a third – instead of expanding the range we will fall to _**bottom**_.  Here's our lattice:
                        _**top**_
    ...   -2      -1     0     1     2     3  ...
    ...  (-2,-1) (-1,0) (0,1) (1,2) (2,3) (3,4) ....
                       _**bottom**_

Is this lattice commutative?  i.e., is (a meet b) == (b meet a)?  Lets look at some examples.  What happens when we meet top and -1?  (top meet -1) == -1 == (-1 meet top), so this example works.  So does: (1 meet 2,3) == bottom == (2,3 meet 1).  So does: (0 meet 0,1) == 0,1 == (0,1 meet 0). I claim the “obvious interpretation” of this lattice is commutative.

Is this lattice associative?  i.e., is ((a meet b) meet c) == (a meet (b meet c))?  Again, running a few examples can convince us the lattice is indeed associative.

Is this lattice symmetric?  Well, in a word: No.  The opposite of top is bottom.  Things on the centerline are symmetric with themselves (e.g. in the lattice world, 3 is the opposite of 3).  How about a range like (1,2)?  It represents “1 or 2 but we don’t know which, and no other number”.  What is opposite the centerline from the range (1,2)?  Well, it’s a little weird: it is both the constant 1 AND the constant 2 at the same time.  NOT the more obvious “1 OR 2 and we don’t know which one it is”.  Whoa… lets digest that for a second.  Like top, it is an unnatural state of affairs (except in mathematics) whose concept is a little slippery.  Like top, we want to embrace a concept of allowing as much flexibility in number choices as we can… while still being a little bit more constrained than top.  Remember: top is ALL constants ALL the time; we’ll settle here for just 2 constants, but BOTH AT ONCE.

Having discussed the concept to death, lets settle on some nomenclature and then run some examples.  Lets write the opposite of the range (1,2) as the inverted range (2,1) (the high and low values are inverted).  Another syntax might be pre-pending a ‘~’ as in: ~(1,2).

                         _**top**_
     ..(-1,-2)  (0,-1) (1,0) (2,1) (3,2) (4,3) ....
     ...   -2      -1     0     1     2     3  ...
     ...  (-2,-1) (-1,0) (0,1) (1,2) (2,3) (3,4) ....
                        _**bottom**_
Notice how our lattice looks symmetric across the centerline now?  Here's the same example from the last blog using _**top**_:
  b = 1;       // b is the constant '1' here.
  while( P ) { // b is _**top**_ around the backedge -
               //   which matches the constant '1'
    S1;        // b is a '1' here
    b = 2-b;   // b is STILL a '1' here!
  }            // we bring a '1' value around for 'b' here
Again using the inverted range (2,1):
  b = 1;       // b is the constant '1' here.
  while( P ) { // b is (2,1) around the backedge -
               //   which matches the constant '1'
    S1;        // b is a '1' here
    b = 2-b;   // b is STILL a '1' here!
  }            // we bring a '1' value around for 'b' here
Ok, why bother with inverted ranges when _**top**_ covers everything?  Because we use them whenever we can **assert** a property is true, such as just after the property is tested with an IF:
  b = ...; // b is BOTTOM
  if( b >= 1 && b <= 2 ) {
    // here we can **assert** that b is the **join**
    //      of whatever it was before, and (1,2)
    b;     // b is range (1,2)
  }
New jargon word: **join**, an operation which produces the intersection of properties (both the state of 'b' before the IF must be true, but also inside the if() the test must be true – so both properties are true at once).  Contrast this to the **meet** operator which yields the union of unknown properties.  The **join** of A and B can also be defined as: **~(~A meet ~B)**, and indeed this is exactly how HotSpot's C2 compiler computes it from src/share/vm/opto/type.cpp:

>

// JOIN operation; higher in lattice.
> // Done by finding the dual of the
> // meet of the dual of the 2 inputs.
> const Type *join( const Type *t ) const {
>  return dual()->meet(t->dual())->dual(); }
Back to our example of using **join** in a sentence:
  b = ...; // b is _**bottom**_
  if( b >= 1 && b <= 2 ) {     // b is the join of _**bottom**_ and (1,2)
    // b is ~(~_**bottom**_ **meet** ~(1,2))
    // b is ~(_**top**_ **meet** (2,1))
    // b is ~(2,1)
    // b is (1,2)
  }
The symmetry property implies that:
if  **M == (A meet B)**, then:**~A == ~A meet ~M**     AND**~B == ~B meet ~M**
Our range lattice is now symmetric, so lets test that on some A,B pairs.
Example 1:  _**top**_ & (1,2)
M == (_**top**_ **meet**(1,2)) == (1,2)~_**top**_ =? ~_**top**_ **meet** ~(1,2) **_bottom_** =? **_bottom_** **meet** (2,1)

**_bottom_** == _**bottom**_

~(1,2) =? ~(1,2) **meet** ~(1,2)

(2,1) =? (2,1) **meet** (2,1)

(2,1) == (2,1)

Example 2: 1 & (0,1)
M == (1 **meet**(0,1)) == (0,1)~1 =? ~1 **meet** ~(0,1) 1 =? 1 **meet** (1,0)

1 == 1

~(0,1) =? ~(0,1) **meet** ~(0,1)

(1,0) =? (1,0) **meet** (1,0)

(1,0) == (1,0)

Summary:

We have a **lattice**, with a distinguished _**top**_ and _**bottom**_ elements.  It has a **meet** operator which defines the lattice structure; **meet** is _commutative_, and _associative_.  The **lattice** is also _symmetric_ about the centerline and we use the _dual_ operator “~” to refer to elements across the centerline.  Constants exist on the centerline.  The elements above the centerline are used after IF statements to assert properties, or for unknown loop backedge values – and represent 'impossible' situations like some expression producing multiple constants being true _**at the same time**_.

We (almost) have enough knowledge now to be dangerous!  Lets build a more realistic Java lattice now.

## A Class Hierarchy

Our pointers have more structure than just **null** or **not_null**; they have types!  We have a Class Hierarchy.  Can we do something with it?  Somewhat similar to adding ranges to plain integer constants, the base case is simple but we need to be careful as we get clever.  The payoff is also very high: if we can deduce a **constant class** for a pointer, and the pointer is used to make a virtual call then we can simplify the virtual call to a static call … and maybe even inline.  That's a big payoff, so it's worth some effort here.

Lets go with a simple starter lattice; similar to the integers we'll look JUST for things are known to be a specific constant class… and NOT any subclass.  Weird, I know, but bear with me:

>

            _**top**_ - **any** single class the compiler desires
> String   Object(no subclass!)   XMLNode   Fred
>            _**bottom**_ - multiple choices,
>            e.g. String or XMLNode or a subclass

A common source of known-class-thingies is the '**this**' pointer in calls (and usually any other arguments), so my examples start that way:

>

boolean String.weird_equal( Object that ) {
>   // 'this' has the constant Class String, no subclasses!
>   // 'that' has the Class _**bottom**_
>   //        since it might be a subclass of Object
>   int hash1 = this.hashCode();
>   int hash2 = that.hashCode();
>   if( hash1 != hash2 ) return false;
>   if( !(that instanceOf String) ) return false;
>   return equals((String)that);
> }

What does CP tell us here?  Lets roll it forward one line at a time:

        int hash1 = this.hashCode();  // '**this**' has Class String
                                      //  and is **not_null**

**this** has the class String, so this is really a call to String.hashCode() and we can (now) happily inline it.

        if( that == null ) throw NPE; // hidden built-in NPE check...
        int hash2 = that.hashCode();  // 'that' has Class **_bottom_**
                                      // and is **not_null** 

No constant Class here, so this becomes a not-inlinable virtual-call, i.e. expensive.

        if( hash1 != hash2 ) return false;
        if( !(that instanceOf String) ) return false;

Hey!  After the 'if' test we know that '**that**' has Class String!  We already know that '**that**' is **not_null** – this gives us 2 orthogonal lattices we can track: pointers might be **null** or not, and they might be a constant Class or not.

        return equals((String)that); // the cast can be optimized away!

And since '**that**' is known to be the class String, the JIT can remove the cast.

A quick summary of what we have so far:
  pointers can be **null** or **not_null** (or _**top**_ or _**bottom**_)
    and this info is used to remove _null checks_
  pointers can be constant Classes (or _**top**_ or _**bottom**_)
    and this info is used to _de-virtualize_ calls
    and remove extra type checks
  We know both kinds of lattice info about pointers, at the same time.

Similar to integer constants, our pointer Constant Propagation can be optimistic or pessimistic.  Here's some code where being optimistic might help:

>

  **final class** LinkedListThing {
>     LinkedListThing _next; // instance field
>     int hashCode() {
>       int hash = 0;
>       Object x = **this**;
>       while( PRED ) {
>         hash += x.hashCode();
>         if( x **instanceof** LinkedListThing )
>            x = ((LinkedListThing)x)._next;
>         else
>            x = "fail";
>       }
>       return hash;
>     }

Deep breathe – okay, this example is getting large for a blog.  Let's break it down.  What happens if we are pessimistic?

>

  int hashCode() {    // this is Class LinkedListThing
>     int hash = 0;
>     Object x = **this**;  // x also is Class LinkedListThing
>     while( PRED ) {  
>       // x is _**bottom**_ around the backedge
>
>       hash += x.hashCode(); // unknown v-call
>
>       if( x **instanceof** LinkedListThing )
>       // _**bottom**_ so needs the full type-check
>
>            x = ((LinkedListThing)x)._next;
>            // x is Class LinkedListThing
>
>       else x = "fail";
>       // Else x is Class String if the full type-check fails
>
>     } // Here x is either String or LLT... so _**bottom**_
>     return hash;
>   }

In one pass of CP we “discover” that 'x' is Class _**bottom**_ – not much help here!  Let's go again with the optimistic approach:

>

  int hashCode() {    // this is Class LinkedListThing
>     int hash = 0;
>     Object x = **this**;  // x also is Class LinkedListThing
>     while( PRED ) {   // x is _**top**_ around the backedge;
>                       //   assume it is LLT
>       hash += x.hashCode(); // known call to LLT.hashCode()
>       if( x **instanceof** LinkedListThing )
>            // type-check not needed!
>
>            x = ((LinkedListThing)x)._next;
>            // x is Class LinkedListThing
>
>       else ... // Not reachable; x is the correct Class already
>     } // Here x is LinkedListThing!
>     return hash;
>   }

Hey presto!  We figure out what the programmer shoulda wrote already: that 'x' is of type LinkedListThing, and the **instanceof** test is useless.  Of course, we might have gotten here from some prior rounds of inlining, where 'x' could only be typed as an Object so lets not blame our strawman programmer (who's really just me trying to make some not-too-contrived examples).

All right, time to wrap up this blog.  **Next** time we'll add subclasses (and maybe arrays and/or interfaces) and see what happens.  But for now this blog is big enough.  Besides, me & my daughter hiked up to Upper Yosemite Falls and then hiked down at night, strictly by starlight… but that's a exciting story for the next blog.

Cliff

What I Did For My Christmas Vacation

Warning – contains no technical content …

My GF and I both love to drive (I own an Evo which I love, I drive really hard, and it’s been a rock-solid car, now with over 120K miles).  The kids were with their mom for a week-and-a-half, so GF & I took off for Bryce Canyon and the Grand Canyon.  She did most of the driving on the way out – both because she loves to drive my toy! – and because I was stuck doing PLDI reviews (sign up now!).  I did half my reviews from the road, uploading reviews at passing Starbucks.  Along the way we passed through Casa de Fruita, a 100-yr-old over sized farm stand, classic California at its best – complete with toy riding trains, carousel, restaurant, hotel, and of course plenty of fruits & nuts (not referring to myself at all here).  I basically run on chocolate and coffee, so this was a fuel-stop for my brain.  GF stopped to feed the pet peacocks.   Since this vacation is being done “on the cheap” we stayed at a smaller hotel outside Bryce.  It was fine (if a bit plain) and the price was right.  No WiFi in theory – but it was available in the lobby, and we finagled a room just behind the lobby area which DID have WiFi… so I finished off my PLDI review uploading from our hotel room after all.

Typical Bryce Canyon weather prevailed for several days: snow flurries with lows down to 9!  After a day recovering from driving (we had put a LOT of miles under our belts!), we checked out Bryce Canyon.  Fantastical views (and GF found a raven to feed)… but obscured by snow flurries and low slung clouds.  Next day dawned with crystal clear-blue skies and winter wonderland style snow everywhere; red red rock, “hoodoos” and dark evergreens.  The colors still sparkle in my minds eye… pictures here.  We took a 4 mile hike along the rim, then down in and around the “hoodoos” and back out the other side.  Being the brilliant person I am, I forgot the camera battery charger (and none for sale in any of the local shops)… so the digital camera died at the bottom of the hike (we bought a throw-away camera for the next hike, but the photos aren’t back yet).

Next we headed for the Grand Canyon.  We’ve been planning on hiking it for a couple of months now, and we lucked into a cancellation in the Phantom Ranch barracks (typically you need to reserve space a year in advance!).  Along the way there we made good time… except for the one dreaded moment when that SUV on your bumper suddenly spouts red & blue lights, and the sheriff wants to talk about safe driving speeds in the mountains of Utah (warning only: maybe 2nd time in my entire life I’ve been pulled over and only received a warning).  We got in very late (got stuck behind a SUV doing between 15 and 25mph on clear roads labeled at 45mph, but it was night, on national park roads w/double yellow strip; very frustrating that the person would not pull over!) and so collapsed on the hotel bed without seeing the canyon that night.  Next day we walked over and ogled the canyon.  It’s still beyond my words to describe, even if I’ve seen it (hiked it!) before.  Its one of those things you just got to experience in person.  GF’s never hiked it before, so we took our time…

Day One.  After checking out the trail heads and talking with the rangers we decided to buy crampons.  BEST DECISION EVER.  Hiking in Bryce on the snow taught us that there’s ice under all that pretty snow… dang slick ice.  We hiked down a full mile of ice from the South Kaibab trail, then the trail dried up and opened up into the amazing views.  Only place I know of you can stand with a 1000+ foot dropoff to one side, and yet be looking at a 1000+ cliff towering above you on the other hand…  and both be yet another 1000+ further down from the rim and yet ANOTHER 1000+ feet above the river!  It’s a real-life 3-d eye candy mind bender.  We hiked it down in 5 hrs.  Not bad for this 50yr old pile-o-bones.  Sore and triumphant we walked into the Phantom Ranch main room.  Our (segregated) bunk beds awaited… but they had had another cancellation.  For a mere $30 more, we got a spacious 10-sleeper cabin to ourselves!  We rushed over to the cabin, and sat down on a bed to “rest a bit”.  Two hours later, we awoken from a deep nap and realized that dinner was fast approaching…  Dinner was delicious (as only a hearty steak dinner can be after a day of hiking).  After dinner we took a star-light stroll in the pleasantly cool evening, laid down besides the rushing creek and star-gazed awhile.

Day Two.  Day One, of course, is the day you act like an idiot and climb down this great big hole.  It’s on Day Two that you have to fix the situation by climbing out.  The first hour of waking we spent discovering all those muscles (and joints!) that got overworked yesterday… and being ravenously hungry despite the giant meal of the day before.  So after a big breakfast we started back up, the Bright Angel trail this time.  Other than the amazingly scary river crossing on the swinging suspension bridge (the Colorado is wide and fast there, and the bridge very clearly was swinging more than 5ft side to side), the upward hike was … upward.  Hard work undoing all those steps yesterday, although the uphill is MUCH easier on the joints.  The crampons came out again for the three miles of snow and ice near the rim… and we made it up in 7.5 hours.

Next day we drove back home, 750 miles in one day (again).  Next day we cleaned up, picked up, packed up… and the kids arrived here for Christmas!  We had a wonderful family Christmas; GF made a fantastic dinner and there was much regaling of tales and playing of video games for all…

Cliff

What the heck is OSR and why is it Bad (or Good)?

What puzzles me is why this isn't already done.  It's essentially equivalent to just pulling the inner block out into a separate tail recursive method.   If the compiler has decent support for inlining JIT methods and associated opportunity for intraprocedural optimizations this approach comes basically for free.  Yet the OSR compilation you describe seems to generate substantially less efficient code.  Sure entering/exiting the interpreter is expensive but either it's quite rare as only the inner loop is hot or it gets eliminated when the outer method gets JITed.

I guess the trouble is that java bytecode makes the business of converting a block to a procedure very inconvenient.  In SSA form it's trivial but if the inner loop is accessing parts of the stack, parameters and instance variables then the method call code would be unnecessarily ugly. - id: 184   author: nksingh   author_email: nksingh85@gmail.com   author_url: ''   date: '2011-11-25 10:07:31 -0800'   date_gmt: '2011-11-25 18:07:31 -0800'   content: |-
Hi Cliff,
From my understanding of HotSpot, it seems that you have sufficient information to map the execution context from a normally JIT'ed function to an interpreted version of the same code for deoptimization purposes. What makes it difficult to apply the mapping the other way to transition from the interpreted code directly to the 'normal' JIT compilation at a safepoint? - id: 185   author: cliff   author_email: cliffc@acm.org   author_url: ''   date: '2011-11-25 11:04:14 -0800'   date_gmt: '2011-11-25 19:04:14 -0800'   content: |-
For "simple" loops with fixed known termination conditions (e.g. most scientific loops; String.hashCode, etc), there's NO mapping kept in the loop body.  Removing the maintenance of the mapping speeds up these kinds of loops alot, typically 2x or so (for small loops) or maybe 20-30% (for larger loops).  Since there's no mapping, you can't transit into the middle of the loop.  Also, the compiler will track other values than what the interpreter directly needs or produces; jumping into the middle of such code means computing those values from the interpreter's state.  Such computations get arbitrarily complex.  Example from String.hashCode: "hash = hash*31 + ary[i];"  The interpreter has tracked 'hash' and 'ary' and 'i', but the JIT will often track some 'p = A+12+i*2' and do a 'p += 2' in the loop... except the loop  is unrolled, and any entry point will come around only every 4th or 8th value for 'i'.  So we'd have to JIT the code; then tell the interpreter to keep executing until 'i mod 8 == 0', then compute 'p=A+12+i*2' and place it in some such register (and 'this', hash, ary and 'i' probably in other registers) before we could start up the optimized loop.  In short, the mapping in this direction is complicated and isn't something I'd like to record as a 'mapping' per-se.  Instead I'd just have the optimizer JIT code for the mapping... which is how the problem is solved now.  Except you better have only 1 such loop-entry or else your loop is now 'irreducible' and all the classic loop optimizations no longer apply.

Cliff ---

What the heck is OSR?  It's HotSpot jargon for On-Stack-Replacement – and it's used to convert a running function's interpreter frame into a JIT'd frame – in the middle of that method.  It crops up in micro-benchmarks all the friggin' time, but only rarely in other apps (or at least only rarely in a performance critical way).  What happens is this:

  1. The JVM starts executing some method for the first time ever, in the interpreter, e.g. main().
  2. That method has a long running loop, that is now being executed in the interpreter
  3. The interpreter figures out that the method is hot and triggers a normal compilation
  4. That compilation will be used the NEXT time this method is called, but e.g. it's main() and there is no next time
  5. Eventually the interpreter triggers an OSR compilation.  The OSR is specialized by some bytecode it will be called at, which is typically the loop back-edge branch.
  6. Eventually the OSR compilation completes.
  7. At this point the interpreter jumps to the OSR code when it crosses the specialized entry bytecode – in the middle of the method.

 

Martin Thompson is asking me questions about writing micro-benchmarks.   I'm replaying some of that conversation here (with permission!).

On Fri, Nov 18, 2011 at 11:20 AM, Cliff Click wrote:

Do more & shorter warmup runs.  100K iterations is fine, but loop around the 'runTest' 10 times during warmup.  Otherwise you risk ending up testing an 'OSR' compile instead of a normal one.

On Mon, Nov 21, 2011 at 12:29 PM, Martin Thompson wrote:

Standard Jit'ing and On Stack Replacement are not equal.  I have been operating under the assumption that they were.  Just to ensure my understanding is correct.  The JVM implements a counter for each method that gets incremented when the method returns and also on branch back within a loop.  If this counter exceeds 10K on the server VM then it will trigger a compilation of the method.  If the method is called again the compiled code is used.  If the method was continuing to loop more than the 10K increments then at approximately 14K increments the method will be swapped via OSR.  I do not understand why OSR can be a less efficient result but will take your word for it.  So I should re-organise the code to avoid OSR and do a number of shorter warm up runs.  This is very interesting feedback for main loop design.  Can you point me about more detail on how the main loops should be best structured?

OK, Martin, here goes…

A number of shorter warm-up loops will do better, or calling the warm-up loop itself in a loop.  Do not nest the warm-up loops in the same function, that defeats the purpose.  It probably helps to understand what's going on in an OSR compile.

In an OSR the code has entered some Java method IN THE INTERPRETER.  It's now spinning in some hot loop IN THE INTERPRETER.  The interpreter is busy gathering backedge counts (which will soon hit 14000) and function-entry counts (will only ever be 1).

When the OSR compilation starts, we start parsing bytecodes IN THE MIDDLE OF THE HOT LOOP.  For singly-nested loops the effect is to partially peel one loop iteration, then enter the loop proper.  To make it clear what the problem(s) are, my examples are written closer to the bytecodes actually experienced by the JVM, as opposed to clean Java.  Single-nested-loop example:

  public static void main(String args[]) {
    S1; i=0;
    loop:
    if(P) goto done
      S3; A[i++];
    goto loop; // <<--- OSR HERE
    done:
    S2;
  }

Suppose the interpreter has already triggered a normal compile (after 10,000 back-edge branches) and decides that since this loop as run 14,000 back-edges already and we might never exit this function (and thus re-enter it via the normal compilation), and now we need an OSR compilation.  This OSR compilation will only be validly entered at the backwards goto bytecode (and also it needs to do some horrible stack-mangling to remove the interpreter's frame and insert it's own before entering the main function proper).  What program semantics does the OSR compilation have to deal with?  The OSR compile “sees” a function that looks like we started executing bytecodes at that backwards goto; this function might look something like this:

  void OSR_main() {
    A=value on entry from interpreter;
    i=value on entry from interpreter;
    goto loop;
    loop:
    if(P) goto done
      S3; A[i++];
    goto loop;
    done:
    ...never reached...
  }

This typical optimizes fairly well.  Variably 'i' is a well-formed array loop index (whose starting value is unknown, but it is a stride-1 increasing value capped at the smaller of 'A.length' and '!P').   However for a doubly nested loop things get ugly.

  public static void main(String args[]) {
    S0;
    loop1:
    if(P1) goto done1
      S1; i=0;
      loop2:
      if(P2) goto done2
        S3; A[i++];
      goto loop2; // <<--- OSR HERE
      done2:
      S2;
    goto loop1;
    done1:
    S4;
  }

Here we (again) decide to OSR at the backedge – although there are two backedges to choose from.  In practice, the inner one is typically much more frequent than the outer one, so that's the typical OSR choice.  Not that the interpreter can tell; it only tracks backward branches and has no clue about any possible looping structure.  In any case, our example OSR compilation starts parsing bytecodes at the 'goto loop2' … leading to this structure:

  void OSR_main() {
    A=...value on entry from interpreter...
    i=...value on entry from interpreter...
    goto loop2;
    loop2:
    if(P2) goto done2
      S3; A[i++];
    goto loop2
      done2:
      S2;
      goto loop1;
      loop1:
      if(P1) goto done1
      S1; i=0;
    goto loop2
    done1:
    ...never reached...
  }

Blah!  What happened?  We lost the nested-loop structure!  After the optimizer cleans up the control-flow a little we end up with something that looks kinda like this:

  void OSR_main() {
    A=...value on entry from interpreter...
    i=...value on entry from interpreter...
    loop2:
      if(P2) {
        S2;
        if(P1) goto done1
        S1; i=0;
      } else {
        S3; A[i++];
      }
    goto loop2
    done1:
    ...never reached...
  }

Predicate 'P2' probably reports a very low but non-zero frequency (since it's really the original inner-loop exit test).  Since P2 is non-zero, variable i gets reset during the loop.  Now variable i is no longer a proper array loop index.  There's no simple affine function describing i, and we no longer can range-check-eliminate A[i++] from the original inner loop.  Performance of the OSR version can be as bad as 1/2 that of a normal compilation (although it's not usually that bad… and 1/2 normal-speed is still 5x interpreted speed!).

Recovering nicely nested loops from here is tough.  And what happens above is the GOOD case.  There's another bytecode parsing pattern/technique HS used to use here, and that one leads to having IRREDUCIBLE loops… which pretty much blows all chance of any loop optimizations in any case.

 

I hope this little chat has given you some idea of what goes on in an OSR compile… and when OSR's kick in.  If you want to test some little function or other in a tight loop… then the Right Thing To Do is to wrap it in a loop in a function in a loop in a function.  The outermost loop might OSR, but the inner-most loop(s) will be normal function entries so the inner-loop code-gen will Do The Right Thing.

Cliff

 

A short conversation on Biased Locking

A short conversation on Biased Locking, between myself, Gil Tene (also of Azul Systems) and Martin Thompson.

On 11/13/2011 12:57 PM, Martin Thompson wrote:

I’m looking a problem where some third party code has more locks that necessary.  I thought I could get around the issue if I made sure only one thread ever enters the lock and thus with Biased locking it would be very cheap.  I noticed an interesting problem whereby with Hotspot Biased locking is slightly more expensive than normal locking when uncontended. which really threw me.  Does this make any sense to you?  I plan to blog about this.

On Mon, Nov 14, 2011 at 3:11 AM, Cliff Click <cliffc@acm.org> wrote:


I’d have to do a deep drill into what Sun is doing, and what the hardware is doing, and what you’re doing to nit-pick this apart.  Our BiasedLocking is different, so results on Azul will vary.  CAS costs vary from platform to platform, so the relative gain (or loss) of BiasedLocking varies by platform.   Older X86’s or Sparc’s – CAS is expensive; BiasedLocking is a Win.  New X86’s or Azul – CAS is cheap; BiasedLocking is closer to a wash.

On 11/13/2011 11:43 PM, Martin Thompson wrote:

For me the important point is the single threaded case.  This is uncontended.  If I was to write a biased locking algorithm I could do this without using a CAS for subsequent calls.  Only another thread needs to CAS when beginning the revoke cycle if my understanding is correct?  The point of biased locking is to avoid the atomic instructions, and therefore their cost.  This benchmark is suggesting that one still bears the CAS cost even uncontended which seems like a bug to me based on the evidence.

On Mon, Nov 14, 2011 at 8:46 AM, Cliff Click <cliffc@acm.org> wrote:

Ahh… but the trick for nehalem & Azul vega chips is that uncontended CAS ops are cheap; 3 clks on Azul, 10-12 on nehalem1.  Where-as, biased locking can require bookkeeping and the bookkeeping might exceed the cost of a CAS.  All depends on how Sun implements it.  We went a different route, so without some serious spelunking I can’t tell you what Sun is doing.

You can run Sun’s JVM in a debugger & in a hot loop – and hit ^C.  You’ll almost surely be spinning in your hot-loop, and you can look at what Sun is doing from the debugger.  Or at least that’s what I’d end up doing if you manage to convince me I need to know what Sun is up too.

I get the chance tomorrow, I’ll see how we fair vs the same rev of Sun JVM.  We don’t have an exact hardware match, so maybe we’ll get some more data points.

On Nov 14, 2011, at 1:23 AM, Martin Thompson wrote:

I’ll have a play with this later.  My tests have been on Penryn, Nehalem, Westmere and Sandybridge (tock-tick-tock-tick), so I cover everything that matters on x86 🙂

I do not understand the 3 clocks quote for CAS.  In my experience a CAS, aka lock cmpxchg, is ~100 cycles plus the cost of draining the store buffers.  BTW I was talking to Intel last week and they have released a new version of VTune with Java support again.

I’m sure you have done a much better job on the implementation.  If this is the case it could be a good selling point in the concurrent world.  The more I dig the less I am impressed with Sun/Oracle implementation as you get seriously concurrent or parallel.  When I write control experiments in C/C++ for work I’m doing I keep being surprised by the results by comparison.

Martin…

p.s. I take you are on European time given the responses?

On Mon, Nov 14, 2011 at 4:16 PM, Cliff Click <cliffc@acm.org> wrote:

Sorry, should have explained more – 3 clks for Azul Vega hardware…. for a cache-hitting CAS.  Our CAS does not order memory or fence, just provides atomicity guarantees.  Our Vega FENCE op can “hit in cache” – if there are no outstanding cache misses, then again it’s a 1-clk op.  So we can do a cache-hitting uncontended lock acquire in something like 5 or 6 clks – with full CAS semantics, no BiasedLocking.  So we ripped out Sun’s BiasedLocking implementation (it had other weaknesses and design choices for which we have better underlying software already, e.g. we can stop individual threads fairly cheaply).  When we moved to X86 we re-inspected the situation and decided we needed a BiasedLocking impl of some kind, we just rolled our own.  Then it turned out that the recent crop of X86’s have a cheap “cache hitting” CAS – if it hits in L2 I think it’s like a dozen clocks – not a 100.  This might Yet Again spell the Doom of BiasedLocking… a software solution to a hardware problem (expensive CAS) to a software problem (frequent uncontended locks).

(re European time: ) Nope – just complicated life.  Some days I got the kids and I’m busy – and this kind of email, while fun, it’s not really time critical to answer.

On Nov 14, 2011, at 8:42 AM, Martin Thompson wrote:

Thanks Cliff,

Seems you and Gil had a race condition 🙂

On 11/14/2011 8:09 AM, Gil Tene wrote:

On CAS costs:

>

While there is a common misconception that CAS (or ordering) requires store buffers to be drained, and inherently involves the latency of socket-to-socket cache-line-invalidate operations, that's just not the case. A processor with nothing outstanding in the load/store pipelines has practically no cost for ordering, and ordering it's own operations “globally” is a processor-local concern: it only has to get it right with respect to it's own outstanding load and store operations. This doesn't require any flushing – just correct interpretation of state and correct application of state changes. It must order the operation correctly with respect to other ops before making them globally visible – something an out-of-order execution core like modern x86 is extremely good at.

The cause of 100+ cycle CAS cost had historically been due to the fact that processors implemented it by including global (processor-to-processor) communications steps which inherently involve memory-latency level (and/or socket-to-socket) operations.  However, there is nothing in a CAS (or in ordering operations for that matter) that inherently makes it a super-expensive (100+ cycle) operation. It was only that way because processors did not bother to optimize implementation for multi-processor, multi-core situations. The only 100+ cycle cost that is required is one where there is contention for the cache line in question, and in that sense a CAS should have been not much different from a load/store to contended cache lines.

With Vega, we went the extreme route (having 800+ cores makes you assume the common program will actually need frequent uncontended CASs).  A Vega CAS is simply an atomic read-modify-write operation with no global ordering requirements.  If you want ordering you use a separate fence instruction to force the ordering you want (e.g. “fence stst | ldst”).  Ordering would be practically free (1 cycle) if there are no outstanding loads or stores in the pipe, and would have the same cost regardless of whether there are atomic or non-atomic load/store stuff sitting in the pipe. The way we did the atomic read-modify-write was nearly trivial: the cache line is brought into L1 in exclusive mode, and then the processor's L1 cache's communications to the outside world is frozen for the duration of the operation.  This makes the operation inherently atomic, and has no effect on any other processors unless they need to invalidate one of the L1 cache lines in our processor cache during our atomic op (and if they do, the delay they see is trivial compared to the communications delay of getting to us and back).
With x86, it gets more complicated because a CAS has very specific ordering semantics (as do all LOCK instructions).  However, all this means is that the processor is responsible for maintaining that order appearance.  Once it gains exclusive access to the cache line, it can perform similar processor-local games to make sure the operation is both atomic and appears in the right order to other processors.  It has to treat it right with respect to it's own outstanding load and stores, and with respect to it's store buffer, but those are all core-local concerns, and non of them require the long latency of communicating with any other processor's L1/L2/L3 levels.

>

As a result, an uncontended CAS on modern (Nehalem and later) cores is much cheaper than before.  It is still sensitive to being mixed with other cache missing load and stores, but no more so that a regular load and store would when combined with an ordering operation barrier.
On Nov 14, 2011, at 8:42 AM, Martin Thompson wrote:

Nice clear explanation 🙂

This fits with what I’m measuring.  I’m seeing lock xadd for example take 15 cycles on Westmere and 20 cycles on Sandybridge if uncontended and with an empty store buffer.  This can get much worse if the store buffer is significant and therefore we are back in the >100 cycles space again.

I’ve had official confirmation back from Intel on my findings with lock instructions on Sandybridge.  The reason for the issues are two fold.  Firstly, there is a ~33% latency cost increase with the uncontended instructions (they cannot explain this!).  Secondly, and more importantly, is the invalidate messages are now faster.  So under contention a ping pong effect kicks in resulting in a thread loosing the lock very quickly as new claim is issued.  So micro-benchmarks are really hurtful but real world apps look better.

For your last statement is is not the case the ordering instructions do not matter because he stores cannot be re-ordered anyway on the x86 memory model?

BTW I’ve been talking to the ######## guys and they are saying nice things about the engagement with Azul.  Seems you have a number of folk pushing you in the low-latency space 🙂

On Nov 14, 2011, at 8:51 AM, Cliff Click wrote:

store/load can be reordered – and it’s very useful to do so, so X86 is good at it.

e.g.
st [MISS_in_cache_adr],0  // miss in cache – will take 1000 clks
locked op[HIT_in_cache_adr1] // some random locked op that hits in cache – without the lock will take ~1 clks
ld rax,[HIT_in_cache_adr2] // takes 1 clk
ld rbx,[rax]  // Would LOVE to start this early… if the above load into RAX could complete early

Would otherwise reorder except for the lock prefix.

On Mon, Nov 14, 2011 at 5:18 PM, Gil Tene <[gil@azulsystems.com](mailto:gil@azulsystems.com)> wrote:

> Furthermore, even for stores “cannot be re-ordered” doesn't mean “have to wait for each other and execute serially”.  It just means that their effects cannot be made visible to the outside world in an order other than stated.
>
> Imagine for examples the following sequence:
>
> st [miss_in_L3_cache_addr_A], 1
> st [hit_in_cache_addr_B], 2
> ld regX, [hit_in_cache_addr_C]
> test regX, mask
> branch somewhere if we don't like the test result
> st [hit_in_cache_addr_C], 3
> do some work that takes less than an L3 miss…
>
> This entire sequence can be executed in the time it takes the first instruction to complete it's L3 miss. It just can't be made visible until the first miss is completed and the store is made visible “first”. So the store buffer, while continuing to be stored to, has to be gated at the first store until it completes.
>
> Obviously, if the values of subsequent stores was somehow dependent on the earlier ones, or on earlier not-yet-evaluated operations (such as a predicted branch the depends on a missing load – imagine that addr_C above was cache missing), they may have to wait before being evaluated and committed, but even there speculative execution could be used to pre-evaluate and pre-store them, as long as the processor was able to yank the store back out before they become visible if the speculation was determined to be wrong (branch prediction followed by a store would be a common case for this sort of “I want to store but not absolutely sure about the value yet” situation).

On Nov 15, 2011, at 12:06 AM, Martin Thompson wrote:

> This is exactly what I was getting at.  If a lock or sfence instruction was executed next in sequence the processor would need to stall until those instructions had retired.  If it was just another straight forward store the processor could continue until the store buffer is full or another cache miss occurs that stops branch prediction.

Cliff

Final Fields, Part 2

I’ve been having waaaaaay too much fun this month, dealing with “final” fields – final is in quotes because I’ve been finding waaaaay too many Generic Popular Frameworks (TM) that in fact write to final fields long long after the constructor has flowed under the bridge.  Optimizing final fields is in theory possible, but in practice it’s busting a Lot of Popular Code.

From Doug Lea:

It might be worse than that. No one ever tried to reconcile JSR133 JMM JLS specs with the JVM specs. So I think that all the JVM spec says is:
http://java.sun.com/docs/books/jvms/second_edition/html/Concepts.doc.html#29882
Once a final field has been initialized, it always contains the same value.

Which is obviously false (System.in etc).

De-serialization plays nasty with final fields any time it has to re-create a serialized object with final fields.  It does so via Reflection (for small count of objects), and eventually via generated bytecodes for popular de-serializations.  The verifier was tweaked to allow de-serialization generated bytecodes to write to final fields… so de-serialization has been playing nasty with final fields and getting away with it.  What’s different about de-serialization vs these other Generic Popular Frameworks?  I think it’s this:

De-serialization does an initial Write to the final field, after but before ANY Read of the field.

These other frameworks are doing a Read (and if it is null), a Write, then futher Reads.  It’s that initial Read that returns a NULL that’s tripping them up, because when its JIT’d its the value used for some of the later Reads.

Why bother?  What’s the potential upside to using final fields?

  • Expressing user intent – but final fields can be set via Reflection, JNI calls, & generated bytecodes (besides the “normal” constructor route), hence they are not really final.  It’s more like C’s “const”, just taking a little more syntax to “cast away const” and update the thing.
  • Static final field optimizations (ala Java asserts).  For these, Java asserts crucially rely on the JVM & JIT to load these values at JIT-time and constant fold away the turned-off assert logic.
  • Non-static final field optimizations.  This is basically limited to Common Subexpression Elimination (CSE) of repeated load, and then the chance to CSE any following chained expressions.

I claim this last one is almost nil in normal Java code. Why are non-static final field optimizations almost nil?  Because not all fields of the same class have the same value, hence there is no compile-time constant and no constant-folding.  Hence the field has to be loaded at least once.  Having loaded a field once, the cost to load it a 2nd time is really really low, because it surely hits in cache.  Your upside is mostly limited to removing a 1-cycle cache-hitting load.  For the non-static final field to represent a significant gain you’d need these properties:

  • Hot code. By definition, if the code is cold, there’s no gain in optimizing it.
  • Repeated loads of the field.  The only real gain for final-fields is CSE of repeated loads.
  • The first load must hit in cache.  The 2nd & later loads will surely hit in cache.   If the first load (which is unavoidable) misses in cache, then the cache miss will cost 100x the cost of the 2nd and later loads… limiting any gain in removing the 2nd load to 1% or so.
  • An intervening opaque operation between the loads, like a lock or a call.  Other operations, such as an inlined call, can be “seen through” by the compiler and normal non-final CSE will remove repeated loads without any special final semantics.
  • The call has to be really cheap, or else it dominates the gain of removing the 2nd load.
  • Cheap-but-not-inlined calls are hard to come by, requiring something like a mega-morphic v-call returning a trivial constant which will still cost maybe “only” 30 cycles… limiting the gain of removing a cache-hitting 1-cycle repeated final-field load to under 5%.

So I’ve been claiming the gains for final fields in normal Java code are limited to expressing user intent.  This we can do with something as weak as a C++ “const”.  I floated this notion around Doug Lea and got this back:

Doug Lea:

{example of repeated final loads spanning a lock}

… And I say to them: I once (2005?) measured the performance of adding these locals and, in the aggregate, it was too big of a hit to ignore, so I just always do it. (I suppose enough other things could have changed for this not to hold, but I’m not curious enough to waste hours finding out.)

My offhand guess is that the cases where it matters are those in which the 2nd null check on reload causes more branch complexity that hurts further optimizations.

Charles Nutter added:

I'll describe the case in JRuby…
In order to maintain per-thread Ruby state without constantly hitting thread locals, we pass a ThreadContext object along the stack for almost all calls.  ThreadContext has **final** references to the JRuby runtime object it is associated with, as well as commonly used literal values like “nil”, “true”, and “false”.  The JRuby runtime object itself in turn has **final** references to other common literal values, JRuby subsystems, and so on.
Now, let's assume I'm not a very good compiler writer, and as a result JRuby has a very naive compiler that's doing repeated loads of those fields on ThreadContext to support other operations, and potentially repeatedly loading the JRuby runtime in order to load its **finals** too.  Because Hotspot does not consider those repeat **final** accesses that are *provably* constant (ignoring post-construction final modification), they enter into inlining budget calculations.  As you know, many of those budgets are pretty small…so essentially useless repeat accesses of **final** fields can end up killing optimizations that would fire if they weren't eating up the budget.
If we're in a situation where everything inlines no matter what, I'm sure you're right… the difference between eliding and not eliding is probably negligible, even with a couple layers of dereferencing.  But we constantly butt up against inlining budgets, so anything I can possibly to do reduce code complexity can pay big dividends.  I'd just like Hotspot in this case to be smarter about those repeat accesses and not penalize me for what could essentially be folded away.

To summarize: JRuby makes lots of final fields that really ARE final, and they span not-inlined calls (so require the final moniker to be CSE’d), AND such things are heavily chained together so there’s lots of follow-on CSE to be had.  Charles adds:

JRuby is littered with this pattern more than I’d like to admit.  It definitely has an impact, especially in larger methods that might load that field many times.  Do the right thing for me, JVM!

So JRuby at least precisely hits the case where final field optimizations can pay off nicely, and Doug Lea locks are right behind him.

Yuch.  Now I really AM stuck with doing something tricky.  If I turn off final field optimizations to save the Generic Popular Frameworks, I burn JRuby & probably other non-Java languages that emit non-traditional (but legal) bytecodes.  If I don’t turn them off, these frameworks take weird NULL exceptions under load (as the JIT kicks in).  SO I need to implement some middle ground… of optimizing final fields for people who “play by the rules”, but Doing The Expected Thing for those that don’t.

Cliff

Writing to Final Fields After Construction

[ed. Updated Oct 19, see below, heck it’s almost like THREE posts in the same month…]

Surprise!  TWO blog posts in the same month!!!

Surprise #2!  A popular open-source framework writes to final fields after object construction via generated bytecodes!  Final fields are supposed to be… well, final. I was quite shocked to realize the default Verification settings let such things go by (to support some common use-case in Reflection; de-serialization I think is the use-case).  Final fields are allowed to be written to exactly once in the constructor.  If you write to them either outside the constructor, or more than once in the constructor, or publish the “this” pointer of a partially constructed object with a final field, or write to the final via Reflection then…. All Bets Are Off.  See, for example, section 9.1.1. of this paper by Bill Pugh.  To quote:

“Another problem is that the semantics is designed to allow aggressive optimization of final fields.  Within a thread, it is permissible to reorder reads of a final field with calls to methods that may change final fields via reflection.”

In particular, the following code “behaves differently” (but legally) after the JIT kicks in:

  final Object _crunk;
  void someFunction() {
    if( _crunk == null ) // a First Read of _crunk
      init_crunk();      // Does this write to a final field?
    ..._crunk...         // Perhaps, a Second Read of _crunk or perhaps not.

In all cases there is a first-read of _crunk.  Before the JIT kicks in, there is a 2nd read of _crunk after the initialization call… because the interpreter executes each ‘getfield’ bytecode in isolation.  The JIT, however, notices that there are 2 reads of _crunk outside the constructor, and therefore it is profitable to do common-subexpression-elimination of the loads of _crunk – and legal to do so across the not-inlined-call “init_crunk()” because _crunk is a final field.  The setting of _crunk inside the “init_crunk()” call is ignored until the CPU leaves the JIT’d version of someFunction().

This optimization of the final field is the main optimization performed on final fields.  It is a crucially important optimization for Java asserts – with Java asserts, all the asserts are guarded by a test of a static final field.  Since that field is set once and known statically, the JIT’s can optimize on it’s current value.  In particular, the guarding field is typically set to false (asserts are turned off), and the JITs read the value early (at JIT time) and constant-fold the guard test away.  I.e., there’s no performance penalty for having turned-off asserts…. because the final field is read very very early, and then optimized based on it’s value.

In the case of our open-source framework, the framework is lazily de-serializing an object with final fields.  They test for a null value in a final field, which is an indication of a lazily-not-yet-de-serialized object.  If the final field is null, they call out to a complex de-serialization routine (which is sometimes not inlined into the local compilation unit by the JIT) which sets the final field.  After the test & de-serialization call, they then begin using the final field assuming it is not null.  Those later uses are optimized by the JIT to use the original value loaded from the final field… i.e. null.  The observation of the initializing write is delayed until after the CPU exits compilation unit.  The NEXT call to the same routine then loads the final field again and now observes the not-null value.

So What Do We Do?

So what do we do?  Do we ship a JVM which does a legal (and typically very useful) optimization on final fields… that happens to kill this one open source framework occasionally (after sufficient JITing under heavy load, and the right random pot-luck of inlining to trigger the behavior)?  (btw, I have been in communication with the people in the project, and they are now aware of the issue, and are working on it)    Do I make a magic flag “-XX:-OptimizeFinalFields” that you need to use with this framework  (it will be needed at least to run our JVM on existing jar files)?  How long do we need this flag?  I hate user-visible flags… they stick around forever, in ancient scripts of antiquity never to be changed again…

Right now I am leaning towards a solution which simply observes if a field is ever set outside a constructor, and disables final-field optimizations on a field-by-field case if it ever sees such an update.  No flag needed and it will “Do The Right Thing” in this case, although the JIT’d code will be somewhat pessimistic (in that no final-field optimizations will happen for all instances of that field when a single out-of-bounds update is observed).  How clever do I need to get, to keep the performance there for Everybody Else, while allowing old jar files for this framework to keep on doing what it is doing?

Cliff

Update Oct 19-

Doug Lea writes:

I just saw your blog post …

Every time I see things like this I feel sad/guilty that because we don’t give people a good way to do what they want (mainly, lazily-initialize or deserialize finals), they do [unfortunate] things instead.  If you have any good thoughts on offering better ways to do this, please let me know.  In the absence of any, maybe I ought to try another push for Fences API.

I was struck while reading this that the main problem in your intro example is that Java makes it too easy for programmers not to notice that fields and local reads of those fields are different  things.  In my concurrency course, I’ve tried to force students to notice by telling them that for one assignment, they are required to always explicitly declare/use locals. I wish I could say that this experience generalizes, but I still see too many cases of, for some volatile field:

if (field != null) field.f();

and similar brokennesses. When a language makes it hard to think clearly about something, it’s not surprising that people don’t.

Cliff writes:

So disallow mentioning a volatile variable “bare”.   It needs to be decorated, to make it more clear you are performing an action on access.  Require that for all volatile fields, you are limited to:

    "...field.**vol_read**() ..." and "field.**vol_set**(x)"?

Too bulky syntax… how about:

    "...field.**get_acq**() ..." and "field.**set_rel**(x)"?

No better.  So we switch to styles, how about Itanic load-acquire and store-release style:

    "... x.**ldacq** ..."  and "x.**strel** = ..."

Or using language more suitable to Java programmers:

    "...x.**get** ..." and "x.**put** = ..."

So your example below becomes:

    "if( field.**get** != null ) field.**get**.f();"

Obviously 2 “get” actions in a row.

But proposing new syntax for Java is water-under-the-bridge.  I’m with you on the “sigh”.  This can only be fixed in a new language.

The Greatest Trip Report

Divorce.

So ends the Greatest Trip of my life.  22 years ago I fell in love and married the woman of my dreams.  I was happy and content in love, full of energy and hope for the future.  I had a needed talent; I sallied forth to Save The World or at least make it a better place – while building a safe, secure and rich life for us.  We had a child, then another, and still more until we had 4 – each child just as precious as the last.  They are great kids.  I found my talents for programming, for complex computer language implementation skills,  for singing and for public speaking.  Life was rich, busy and fulfilling – to me.

To the other person in my life, however, things were different.  20 years later I discover she felt that her life was stifling and constraining – her dreams too long had gone unfulfilled, nay, unnoticed.  Her pain, held inside too long, turned to anger and resentment.  Too late we tried marriage counseling, honeymoon style vacations, long heart to heart talks- but too much water had flowed under that bridge.  About two years ago we gave up the hope of reconciliation and started the process of dividing two lives that had lived together for 20 years.

That process is finally ending.  We have a final legal resolution now and all that remains is the raising of our 4 kids independently.  The situation for the last two years has taken most of my time and all of my emotional energy.  It has been a place of personal growth and introspection, of deep thinking, of tears and sadness.  It has also been (eventually, after a long time of sadness) a journey of joy and discovery; of doing things I have long denied myself (just returned from my 2nd year at Burning Man!!!); and of new celebrations of life.

I’m enjoying my half-time with the kids and becoming more of a Dad and more a part of their lives.  I’m getting my energies back and am feeling more ready to slay more dragons than I have in years.

Look Out World, Here I Come!
Cliff

No JavaOne This Year…

UPDATE Oct 4 – I had a blast talking with the Disruptor dude, Martin Thompson, so I’m back in SF on Oct 5.  I’ll be hanging out in the Starbucks at 262 O’Farrell from 11 on.  Stop by if you want to chat.  (Blog material that came from the barroom discussion on Oct 2: spinning on Thread.yield probably is in fact “fixed” with some kind of exponential back-off strategy; tight spin loops need an X86 “PAUSE” instruction; locked-add and locked-CAS on Intel’s new SandyBridge do NOT scale – I should use locked-xadd for a fast X86 fence op, talks about the state of CPU affinity on Linux…).  Stop by Starbucks if you want to partake of more of these conversations!

No JavaOne this year, at least for me. I am boycotting it because of the recent shabby treatment of both the conference AND Java.  Please let me know how you feel the conference is going, because right now I am strongly thinking of putting my weight behind some other conferences and not doing JavaOne ever again.  Got any suggestions?  Must be a Bay Area conference – how about a JavaTwo?  JavaPlusPlus, anybody?    (specifically excluding the totally fabulous JVM Language Summit which is small-by-design.  If Brian could figure out how to scale that sucker up, but keep the great interactions & talks going…..)

Despite no JavaOne, I am still in the Bay Area and would enjoying visiting with a lot of people I normally only see when they visit JavaOne… so I will be in the Hotel W bar starting around 4pm TODAY (Oct 2, 2011), until dinner.  Drop by & share a drink and a war story!  (too short a notice? I am thinking of doing another geek’s drink/dinner on Weds… where’s Ron when you need him?  In any case, email or blog-comment and if the head count exceeds some modest threshold I’ll throw one together).  Here’s a war-story to get started: contrary to rumors, Azul Systems is doing quite well… and our next-gen X86 JVM is happily running 100Gig+ heaps with max-pause times in the handful of milliseconds range.  Not that I would be guilty of gloating over some Big 6-Letter Company’s difficulties in delivering a large-heap low-pause GC or anything… OK, so I am.  Come buy me a drink and I’ll tell you another story, or better yet you can tell me yours!

Cliff

PS: To David Dice: I think you’re being too modest with that little toy you sent me slides on… and in any case it makes for great blog material even IF it’s half-baked.  That’s the whole point of blogs! I think you should open up on it.  The Disruptor guys are not the first people to go there, nor will they be the last…   🙂

Me: I got my own take on where the Disruptor guys should go next for NUMA-tolerant performance hacking.