JULY 4, 2018

Why I Prefer Good Testing Over Excellent Testing – Part 5

This is the fifth article in a series that expands on a short talk I gave called "Why I Prefer Good Testing Over Excellent Testing".

Times I Have Been Saved By Solid Monitoring and Roll-Back Strategies

So far, I've shared 3 stories about problems I feel I never would have found, even if I had all the time in the world (i.e., they would not have been prevented by doing additional testing). Next, I will discuss 2 cases where investing in monitoring and roll-back strategies (instead of exhaustively testing everything we could think of before releasing) really paid off.

Story 4: The off by 10,000 error

My team was developing an event messaging service, and it required all instances of our product to maintain connectivity to a message broker in AWS. In order to keep the connections alive, we configured each instance to ping the broker once every 10 minutes. All messages sent to or from the broker, including ping messages, cost money. It's only a fraction of a cent, but still, it's something.

All instances are connected to a central message broker
Fig. 6: Each instance of our product needs to maintain a connection to a central message broker in AWS.

Because we had a budget to stick to, we put some monitoring in place, using Amazon's CloudWatch service. We configured alerts that would notify us if we spent greater than a certain amount of money within one month.

One day, the cost alarm went off. This surprised us, because we were just starting to roll out the new service in production, and at that point only a very small number of instances were using it. We looked at our daily cost data, and it was clear that something was running up the bill. But what could it be?

Bar graph showing daily cost data
Fig. 7: Clearly, the cost alarm had gone off for a reason; our daily expenses had started going up exponentially.

Next, we looked at the data we were collecting about our event messages. The relative proportions of different message types seemed a bit strange; there's no way that 99% of all traffic should consist of ping messages. Publish and subscribe messages typically make up the bulk of the traffic. This ultimately led us to discover a typo in our code that meant we were actually sending ping messages once every 1 second, rather than the intended once every 10 minutes. Oops.

Pie chart showing relative proportions of message types
Fig. 8: In a normal distribution, ping messages would not make up such a large proportion of the overall message traffic.

Lesson Learned: If your thing is working fine, it doesn't mean there are no problems.

There was no functional impact to the event messaging service when the ping frequency was incorrect. Without cost monitoring and message type telemetry data, it might have taken us a really long time to even notice this issue, much less troubleshoot it. However, in the meantime, we would have been wasting a lot of money and causing unnecessary traffic on our network, both of which are things we'd like to avoid if we can help it.

So, "working fine" doesn't necessarily mean "no problems".

, TF