This is the fifth article in a series that expands on a short talk I gave called "Why I Prefer Good Testing Over Excellent Testing".
So far, I've shared 3 stories about problems I feel I never would have found, even if I had all the time in the world (i.e., they would not have been prevented by doing additional testing). Next, I will discuss 2 cases where investing in monitoring and roll-back strategies (instead of exhaustively testing everything we could think of before releasing) really paid off.
My team was developing an event messaging service, and it required all instances of our product to maintain connectivity to a message broker in AWS. In order to keep the connections alive, we configured each instance to ping the broker once every 10 minutes. All messages sent to or from the broker, including ping messages, cost money. It's only a fraction of a cent, but still, it's something.
Because we had a budget to stick to, we put some monitoring in place, using Amazon's CloudWatch service. We configured alerts that would notify us if we spent greater than a certain amount of money within one month.
One day, the cost alarm went off. This surprised us, because we were just starting to roll out the new service in production, and at that point only a very small number of instances were using it. We looked at our daily cost data, and it was clear that something was running up the bill. But what could it be?
Next, we looked at the data we were collecting about our event messages. The relative proportions of different message types seemed a bit strange; there's no way that 99% of all traffic should consist of ping messages. Publish and subscribe messages typically make up the bulk of the traffic. This ultimately led us to discover a typo in our code that meant we were actually sending ping messages once every 1 second, rather than the intended once every 10 minutes. Oops.
There was no functional impact to the event messaging service when the ping frequency was incorrect. Without cost monitoring and message type telemetry data, it might have taken us a really long time to even notice this issue, much less troubleshoot it. However, in the meantime, we would have been wasting a lot of money and causing unnecessary traffic on our network, both of which are things we'd like to avoid if we can help it.
So, "working fine" doesn't necessarily mean "no problems".