Why WeakHashMap Sucks

Came across this surprise while working on a customer app.

Symptom: with Concurrent GC, the app performance slowly degrades over time, with more and more time spent in WeakHashMap lookups – until the load is removed (typically because all transactions start timing out).  Then suddenly the app becomes performant again, the WeakHashMap lookups are fast, etc … until slowly, the app performance starts to degrade again.  With the standard Stop-The-World fast parallel collector, the app performance is always fast (except during GC pauses, which is why the customer wants to use a Concurrent GC).

Root cause #1: Customer’s HashMap hashcode did NOT includes value hashes, but it’s equal call did!
Root cause #2: A failed “equals” calls in a WeakHashMap take time, and will typically cause a ConcurrentGC to believe the Key should be kept alive for another cycle.

Let me explain some more:

The App: ships objects over the wires between 2 JVMs, via RMI.  Each RMI structure carries a java object essentially encoded as a HashMap with Keys for fields and Values for field values.  I.e., Since String has 4 fields, an object like the String “abc” would be encoded as a HashMap with 4 key/value pairs: (“value”, char[3]={‘a’,’b’,’c’}), (“offset”,int=0), {“count”,int=3), and (“hash”,int=0xab123ab}.  The same type of objects are getting shipped around, so all these HashMaps have the same 4 Keys…. and the same hashcode! (‘cause the customer’s HashMap hashcode ignores values) the ‘equals’ DID compare against values, so 2 maps with the same keys and different values will have the same hashcode but will not be equals.

These HashMap structures (suitably wrapped in a few more layers) are then stuffed into a WeakHashMap.  Alas, since all of ‘em have the same hashcode the WeakHashMap turns into a linked list. This is a major bummer.  Lookups are forced to scan the linked list doing a full HashMap compare at each step.

Now comes the ConcurrentGC trick: CGC takes time.  During this time the application is probing the table.  As the app sweeps down the linked list it forces every Key it touches to be alive during this CGC cycle.  Assuming a single lookup can finish inside of a CGC cycle, all Keys end up being forced alive every time.  Nothing ever gets removed.  CGC starts running slower cycles because the live set is growing each time a new HashMap is inserted into the WeakHashMap.

If we stop the app even for a moment a CGC cycle completes with no probes causing Keys to remain alive and all the dead keys get stripped out at once.  Immediately the table is mostly empty and fast again (and our live set has hugely shrunk making CGC cycles fast again) – and we can start probing it with little problem since the linked list is short and the odds of CGC catching us in the middle of the linked-list walking is low.  But slowly over time we fail to remove some dead keys, the list grows, and the odds of getting caught in the middle of a walk grow until it’s a surety.

………………

Suppose we look at a Stop-The-World collector.  When a GC cycle halts things, likely we’re in the middle of 1 key probe – but no more.  All the other dead keys get collected and our live set does not grow.  App performance remains good (except for the obvious GC pauses).

………………

Suppose we look at a generational collector; young-gen is STW but old-gen is concurrent.  Same problem arises if we’ve got any keys in the old-gen.  The time to sweep the old-gen is so long that we surely touch every old-gen key in the table, keeping all of them alive.   As the young-gen periodically promotes a few keys they “get stuck” in the old-gen and can never be collected until we stop probing during a full old-gen cycle.  Old-gen cycles get slower and slower as the live set grows, unless we stop the app even for a moment.

………………

Suppose we keep going with our ConcurrentGC (and crap hashcodes) and the linked list grows without bound.  Eventually the time to walk the linked list (and call equals()) could exceed a CGC cycle (I say “could” because as the list grows CGC as more work to do so it depends on the relative costs to walk what’s being kept alive vs the cost of an equals() call).  In this case, those portions of the linked list not walked inside the CGC cycle will be found dead and removed – i.e., there’s an upper bound to the list length, but it depends on the ratio of CGC-walk-time vs time-to-equals for objects on the list.  Paradoxically, a slower equals call will keep less stuff alive!
………………

Suppose we correct the hashcode problem; the WeakHashMap turns into a real hashtable instead of a linked list.  Suppose also I work for a company with a ConcurrentGC which can cycle in a matter of seconds for almost any size heap… and thus the Weak keys in the WeakHashMap get stripped out every few seconds.  Then the WeakHashMap fails it’s intended purpose of being a cache (with lazy old-key cleanup semantics).  Really, the customer wants a cache with a n explicit timeout on the entries.

I read the javadocs for WeakHashMap and they pretty clearly explain the fallacy’s… but maybe WHM was made when Concurrent collectors where a far off dream, or could only cycle very slowly.  I think technology has caught up with WeakHashMap.

Cliff

Published by

wpengine

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.