AUGUST 31, 2018

Why I Prefer Good Testing Over Excellent Testing – Part 6

This is the sixth article in a series that expands on a talk I've given called "Why I Prefer Good Testing Over Excellent Testing".

Times I Have Been Saved By Solid Monitoring and Roll-Back Strategies

Last time, I shared a story about how cost monitoring and telemetry about messages sent in production helped us to recognize and troubleshoot an issue that did not have any functional impact to our service, but would have caused us to incur significant costs if we hadn't discovered it. In this article, I'll describe how feature flags allowed us to quickly revert new functionality while we investigated a potential issue (without having to deploy a new build).

Story 5: The calmest production failure ever

Feature flags are a great way to manage risk in software projects. The basic idea is that they allow you separate the concepts of deploying code, and actually having it take effect. So you can deploy a new feature, and its code is in the build that goes to production, but you can use a feature flag to keep it turned off until you are ready to put it into action.

Just as importantly, it also gives you the power to turn features back OFF once they have been turned on. All without having to deploy a new build. At D2L we’re using a tool called Launch Darkly to manage our feature flags.

Launch Darkly logo
Fig. 9: Launch Darkly; a tool for managing feature flags

For the event messaging service I mentioned in the last lesson, we set up a multi-stage feature flag so that we could slowly roll out more of its functionality over time, as we gained confidence in its ability to fully replicate the legacy functionality it was intended to replace.

A feature flag with several stages
Fig. 10: Settings for a multi-stage feature flag

One day, all of the health checks for our production service started failing. Because we were operating in a mode where both the old and new messaging methods were running in parallel, we knew that failures in the new service did not mean data loss or any other negative impact to our customers. The failing health check notifications were getting annoying, though, so we temporarily reverted to a state where the new service was inactive while we investigated further.

After the issue was resolved (a configuration error during a recent update; unrelated to our service), we switched back into the parallel mode where both old and new versions of the service are active. All was well. Later that week, we decided we were ready to take the next steps, and moved into a state where only the new service was active. Our feature was now fully rolled out.

Lesson Learned: Safe roll-out and roll-back strategies allow you the luxury of thinking clearly during "emergencies".

ALL OF THIS HAPPENED WITHOUT DEPLOYING A NEW BUILD. Think about that for a second. At every moment during the time when the new service was down, we were one click in a web UI away from reverting all customers to the old version.

Not only does this mean we can resolve issues very quickly, but because we’re not freaking out it also means we have the luxury of thinking calmly and clearly about what our next steps should be. This is probably the most valuable thing you could have during an emergency – and in fact, it kind of removes the concept of an “emergency” from your team altogether.

, TF