in Personal

Spot the DataRace

Looks like I need to take some more of my own medicine!  Here’s code I worked on over a year ago, and it took a year for the data race to finally come home to roost.  The following API is fundamentally flawed; no implementation of it can dodge the data race.  Here’s the deal: I’ve got a field which from time-to-time is reset to NULL, generally as part of some overall timeout-based cache-flushing policy.  It’s also in a time-critical function so I’m using lock-free accessing code with lots of comments on proper usage, Big Flashy Warnings, etc.  (The code to set the field uses locks, and isn’t shown here)

private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time

And the gratis accessor function:

public Code code() { return _code; }

And a some convenience functions to test various conditions.  Here’s one:

public boolean is_c2_code() { return code() != null && code().is_c2_method(); }

And some code using this API:

int nx = next_m.is_c2_code() ? ...next_m.code()... : ...blah...;

Ok, everybody can spot the implementation race-condition – especially after the fact!  It’s easy in isolation although somewhat harder in a 0.5MLOC program.  The issue is in the implementation of “is_c2_code()” – I’ve called “code()” twice.  And very very rarely it changes between the loads of the underlying _code field flipping from not-null for the “code() != null” test and then to null for the “code().is_c2_method()” call.  Boom, instant NullPointerException.  Grumble, mumble, I go back and fix “is_c2_code()” to only read _code once:

public boolean is_c2_code() { Code code = code(); // read _code ONCE
                                  return code != null && code.is_c2_method(); }

Crank up my test case, wait a few hours and … Lo!  The bug is NOT fixed!  Sure enough, under the right combination of pressure and temperature I blow up at the “…next_m.code()…” bit with another NPE.  And then the flaw with the API hits me:  if I call the “is_c2_code()” test, it tests one version of _code – and then when I call “code()” again in “…next_m.code()…” I get another load of _code which might hold a completely different value!  In fact, I need to do an entire conceptual operation from a single load of _code.  I can’t use any convenience functions unless they take the already-loaded _code field as an argument.  And with that epiphany I finally got it right (well, I hope anyways):

private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time
    public Code code() { return _code; }
    // Convenience function: takes a pre-loaded Code value
    static boolean is_c2_code( Code code ) { return code != null && code.is_c2_method(); }

And some code using this new API:

Code code = next_m.code(); // Read it ONCE for the whole operation
    int nx = Code.is_c2_code(code) ? ...code... : ...blah...;

And that’s my data-race lesson of the week: API’s can have data races as well as plain old code.

And now my question for you, the gentle reader: how do I enforce this over time?  In the actual code base there are about a dozen convenience functions, all now carefully labeled.  All the use-points now do a single clearly labeled read, then pass the loaded value around in all the convenience functions.  But in 6 months, when Junior Engineer comes along and looks at this class and says “I need a new function that tests _code and returns a blah”…. he likely makes another data-race API mistake.  Is there something better I can do than just comment, comment, comment?

Thanks,
Cliff