jezhou

An interesting race condition at work

July 22, 2020

I encountered a pretty interesting race condition at work last week. I couldn’t find a ton of literature about it online, so I decided to write about it here on the off-chance this knowledge could possibly be helpful to someone. Here it goes!

Race conditions, quickly defined, are bugs that are a result of poor timing. If two processes read from the same record at the same time, but the second process is dependent on the first process finishing but doesn’t wait for it, then a race condition happens.

The best, most classic example of this is usually the bank withdrawal example:

Person A withdraws $20 from a $100 bank account
Person B withdraws $20 from the same bank account at the same time
Because the reads don’t block each other, there is a race condition. So what ends up happening is both withdrawals read $100, subtract $20, and persist $80 (the difference). But really, there should be $60 left in the bank account.

I’m mentioning this much used and somewhat contrived example because an interesting version of it happened last week when my pod was working on payroll reversals within the Gusto platform.

What happened?

I helped my pod recently write a feature that allows more customers to initiate payroll reversals in our application. (Right now, a lot of reversals are initiated over the phone; not a good experience!) Payroll reversals are exactly what they sound like: it is reversing a soon-to-be or potentially paid out payroll that an employer (ER) might not want to pay out anymore, for whatever reason. The reasons range from accidentally over/underpaying an employee (EE), to the EE putting in the wrong bank account number when they sign up for Gusto, etc.

The race condition, we found after a round of QA testing, was when two separate admins of a company try to both initiate a payroll reversal at once. We normally have a Rails validation that would block the creation of the second payroll reversal, using an on_create Rails validation callback; but that unfortunately didn’t kick in in time. So we encountered a super rare bug where two payroll reversals could be potentially created at once.

We luckily have other safeguards in place that prevent a ER getting their money back twice, but what if we didn’t? It would be an accounting error in favor of the ER, due to an unfixed race condition on Gusto’s platform. The golden rule when dealing with other people’s money, I’ve found, is to get it right. There is no room for error when it comes to managing paychecks. So a more senior engineer and I went on to fix it.

How was it fixed?

This is where it gets interesting. The more experienced folks reading this will point out that the most surefire wire to fix this is to do it at the database layer, either with a lock or a uniqueness constraint on a table. The uniqueness constraint is the “more correct” fix, but it involved a complicated migration where we would have to make a join table, tied to a payroll-specific table that is quite hot (aka it gets used a lot). Simply said: it would’ve taken more time than we could’ve afforded at the moment. So that was out of the question for the short-term.

Locking the row using a pessimistic row-lock seemed like the next best option, since Rails has built-in pessimistic locking and it involves no migrations whatsoever. Pessimistic locking has its own drawbacks, but it was the best way to band-aid the race condition without causing too much trouble for anyone else. The problem was interesting because typically locks are used on existing rows in a table, and not new ones being created.

In this case, we were creating payroll reversals, not just updating two separate ones. One might quickly think that locking the entire table might be a solution, but that it seemed super heavy-handed to lock the entire table over the creation of a single row.

So instead, we utilized what the senior eng I was working with called a “hack” and locked the parent record before creating the new row, and wrapped the entire thing in a transaction. This had the effect of locking the entire transaction based on its most common relationship, which in this case, was our payroll. We were effectively locking the creation of a payroll reversal using a row-lock on a different record!

This is what it looked like in Rails:

payroll.with_lock do
  payroll.touch # just updates the updated_at column
  PayrollReversal.create(permitted_params[:payroll_reversal])
end

We’ve had varying abilities to reproduce this next part, so take it with a grain of salt. But we needed to touch the record here because otherwise, Rails will throw an error about not modifying the row we’re actually locking (we were getting this error). In Rails 6, if you use with_lock (which is a conveninence method that both locks the record you’re working with and starts a transaction), you must actually update the record you’re working with. Otherwise, the lock won’t be claimed and nothing in the block will happen. (Again, take this with a grain of salt! Please email me if this is not correct.)

So what are the results of doing this? It’s hard to test on a single process, but if add an artificial block to force the transaction to wait, and you start two browser windows and call the controller action that starts the transaction, you can actually see the first transaction block the second one, which gives the desired outcome we want.

payroll.with_lock do
  payroll.touch
  PayrollReversal.create(permitted_params[:payroll_reversal])
  sleep(30) # this is the artificial block that hangs the transaction and blocks the second one
end

Two browsers, with the code above. Yay! We blocked one transaction!

The first transaction blocks the second.
The first transaction goes through and persists the record; lock is released.
The second transaction now claims the row-lock on the payroll, but now fails the Rails validation mentioned earlier since the first record is now fully persisted to the DB.

The race condition is now fixed! (Albeit in a hacky way.)

But this doesn’t seem… right?

Yeah, I agree. Again, this is not the “most correct” way of fixing something like this. Ideally, the Rails validation wouldn’t matter and everything would be enforced at the DB layer. But there are other considerations to make before deploying a fix like this: how much time can you, as a busy software engineer, commit to making a complex migration on a fairly hot table? How pressing is the race condition (as in, is it affecting customers right now)? Are you about to release a feature in which this bug is a blocking issue?

These were things we personally considered before going forward with this solution. Make sure you talk to your team and consider how much time you have to really fix something like this. As one of my coworkers likes to say: given an abundance of time, you can figure out anything.