This is the third article in a series that expands on a short talk I gave called "Why I Prefer Good Testing Over Excellent Testing".
Last time, I shared a story about how a type of automated monitoring activity my team was not aware of completely derailed the data we were displaying in our Tool Usage report once we released it to production.
Here's another story about a problem that occurred that would not have been avoided if more time had been spent testing in advance.
My team was developing a new service as a replacement for an older, legacy version. It needed a dedicated port in each of the data centers where we host Brightspace. We asked our data center administrators for a port we could use, and they assigned us one, which was open in all data centers and not being used for anything.
Similar to the Tool Usage report story, we had done extensive testing of the functionality, performance, security, and error handling of our new service. We also spent a lot of time building very detailed monitoring and alerting capabilities, so that we could closely observe how our service was behaving in production and proactively identify problems before they began impacting clients. Furthermore, we configured a system of feature flags that would allow us to quickly toggle between different states without deploying a new build (e.g. run just the old service, run both old and new in parallel, run just the new service).
We were confident that the service would work correctly, and just in case something unexpected happened, we would find out quickly and be able to do something right away to resolve it.
On the day that we decided to activate our new service, things seemed to be going smoothly. We saw successful connections in several of the data centers, and data began to appear in our monitoring tools. However, as the roll out continued, we started to receive alerts that a large number of client instances were unable to connect to our service. This seemed odd, since so many other instances were working just fine.
After some investigation, we realized that all of the instances having problems connecting were hosted in the same data center. After following up with the data center administrators, we learned that the port we'd been assigned was actually closed in that particular data center.
However, the port was open everywhere the last time we'd checked, which was only a few days earlier. As far as anyone could tell, the change had not been intentional.So, we re-opened the port in that data center, and all the client instances hosted there began to connect to our service successfully.
This series of events was a great reminder that no matter how well you test, research, and prepare today, something could change that makes your results or conclusions invalid tomorrow.