Spot the DataRace

Looks like I need to take some more of my own medicine!  Here’s code I worked on over a year ago, and it took a year for the data race to finally come home to roost.  The following API is fundamentally flawed; no implementation of it can dodge the data race.  Here’s the deal: I’ve got a field which from time-to-time is reset to NULL, generally as part of some overall timeout-based cache-flushing policy.  It’s also in a time-critical function so I’m using lock-free accessing code with lots of comments on proper usage, Big Flashy Warnings, etc.  (The code to set the field uses locks, and isn’t shown here)

private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time

And the gratis accessor function:

public Code code() { return _code; }

And a some convenience functions to test various conditions.  Here’s one:

public boolean is_c2_code() { return code() != null && code().is_c2_method(); }

And some code using this API:

int nx = next_m.is_c2_code() ? ...next_m.code()... : ...blah...;

Ok, everybody can spot the implementation race-condition – especially after the fact!  It’s easy in isolation although somewhat harder in a 0.5MLOC program.  The issue is in the implementation of “is_c2_code()” – I’ve called “code()” twice.  And very very rarely it changes between the loads of the underlying _code field flipping from not-null for the “code() != null” test and then to null for the “code().is_c2_method()” call.  Boom, instant NullPointerException.  Grumble, mumble, I go back and fix “is_c2_code()” to only read _code once:

public boolean is_c2_code() { Code code = code(); // read _code ONCE
                                  return code != null && code.is_c2_method(); }

Crank up my test case, wait a few hours and … Lo!  The bug is NOT fixed!  Sure enough, under the right combination of pressure and temperature I blow up at the “…next_m.code()…” bit with another NPE.  And then the flaw with the API hits me:  if I call the “is_c2_code()” test, it tests one version of _code – and then when I call “code()” again in “…next_m.code()…” I get another load of _code which might hold a completely different value!  In fact, I need to do an entire conceptual operation from a single load of _code.  I can’t use any convenience functions unless they take the already-loaded _code field as an argument.  And with that epiphany I finally got it right (well, I hope anyways):

private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time
    public Code code() { return _code; }
    // Convenience function: takes a pre-loaded Code value
    static boolean is_c2_code( Code code ) { return code != null && code.is_c2_method(); }

And some code using this new API:

Code code = next_m.code(); // Read it ONCE for the whole operation
    int nx = Code.is_c2_code(code) ? ...code... : ...blah...;

And that’s my data-race lesson of the week: API’s can have data races as well as plain old code.

And now my question for you, the gentle reader: how do I enforce this over time?  In the actual code base there are about a dozen convenience functions, all now carefully labeled.  All the use-points now do a single clearly labeled read, then pass the loaded value around in all the convenience functions.  But in 6 months, when Junior Engineer comes along and looks at this class and says “I need a new function that tests _code and returns a blah”…. he likely makes another data-race API mistake.  Is there something better I can do than just comment, comment, comment?

Thanks,
Cliff

Published by

wpengine

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.