jezhou

We should've (safely) tested in production!

March 31, 2021

For the last 2 months, I worked on updating one of Gusto’s “legacy” integrations to OAuth 2, which required us to swap out one of the gems we were using to make requests for another one. It was a pretty interesting experience, albeit one that was filled with some stress due to a super hard deadline for the cutover (set by the partner in our SLA).

We were burning the midnight oil for this upgrade, and we weren’t actually able to flip the feature flag for the OAuth 2 code path until about a week before the hard deadline. Talk about cutting it close!

Once real users started using this new codepath, we noticed a few errors pop up in the version of the gem we were using, and we had to monkey patch some fixes into the gem since upgrading to the latest was too much work in the time we had left (this is another story for another time 🙂).

There are a lot of questions here.

Why didn’t we catch these errors sooner?
And was there a way we could’ve caught these errors sooner than we did?

To answer (1), it’s important to first confirm that we did a lot of manual testing on our part using development accounts provided by the partner, and there was no signal prior to this that we would get these kinds of errors. But what we didn’t consider was that real data had some legacy values that were impossible to emulate in our developer accounts.

Example: our development accounts did not allow us to enter a string longer than 10 characters for a certain attribute, but some very old accounts we were trying to migrate had some with values way over 10 chars. This data was “grandfathered” into the partner’s new schema, probably enforced by a front-end only validation and not a back-end one (yet).

So what was happening was that the new gem codified their newer schema, and did not consider the possibility of legacy values. The gem raised an error when it tried to read these values in. This goes to show that testing on real data is preferable to testing things with happy-state developer accounts.

But how do you even test on real data when it’s not yours? It’s a pretty big ask to go to the partner and ask, “Hey, can you give us a copy of your production data so we can test some stuff on our side?”

So this leads to our (2), which is if we can’t pull data we don’t own to test on, is there any way we could’ve caught the answer sooner?

My thinking is yes. Using something like github/scientist, I think we could’ve done a rudimentary A/B test on OAuth 1 vs OAuth 2 implementation, defaulting to Oauth 1 as our control if OAuth 2 errored out (this is totally possible with Github scientist!). Going with the Scientist approach would require migrating a subset of our users much earlier on, and making both OAuth 1 and 2 requests on their behalf, and diffing the results of both and reporting any failures.

Isn’t this just “testing in production”? Yes! It totally is. But I think it’s a safe way to test in production, especially on data you don’t own, and codify any wild edge case in a traditional test as soon as it’s reported in the system – all while real users don’t notice!

Github wrote a great article post on the value prop of using the “scientist” approach here.

If you are thinking about making a similar upgrade like this, I encourage you to try the “scientific” approach and see if it works for you. The gem is marketed as a way to safely refactor critical paths in a data-driven manner.

Feel free to email me any thoughts you might have!