This is the fourth article in a series that expands on a short talk I gave called "Why I Prefer Good Testing Over Excellent Testing".
Last time, I shared a story about how a random/unintentional port closure impacted the deployment of a new service. Of course, all of our tests had passed the day before, when the port was open like it was supposed to be.
Here's one final story about a failure that we consciously decided NOT to test for in advance.
In this story, my team was building a product that relied heavily on Amazon's Simple Storage Service (S3). S3 has a long track record of extremely high availability. Its ease of use and low cost have also contributed to making it a very popular option amongst all kinds of web developers. Therefore, a significant proportion of web sites and web apps available today are using S3 in at least some capacity.
S3's proven record of high availability, as well as its ubiquitousness, led us to feel comfortable NOT prioritizing any work related to building and testing contingency plans for how our app would behave if the service ever were to become unavailable. Instead, we decided to invest more time in testing and hardening things that we felt posed larger risks to our product, such as security and performance.
On February 28, 2017, Amazon S3 went down for multiple hours. During that time, our users were simply unable to use most of the functionality in our application. There was nothing we could do about it, except hope that Amazon was able to get S3 back up and running soon.
No matter how thoroughly you've tested your application or service, if one of your dependencies is down and you don't have a backup plan, your users might still be out of luck. In this case, we knew we depended on S3 and had consciously decided not to prioritize a failover mechanism, but it's also very possible to be impacted by a dependency you were not even aware of. In these cases, you could unintentionally end up without sufficient failure handling.
The big takeaway here is a reminder that you can't always fully control the uptime (or performance, security, or other dimensions) of your product. While it may not always make sense to invest in detailed contingency plans for every possible scenario, it's probably worth at least understanding the state your product will be in if any of your known dependencies are unavailable or malfunctioning. Knowing this will help you to make more informed prioritization decisions.
And of course, nothing will "never go down". Not even Amazon S3.