SEPTEMBER 20, 2018
Why I Prefer Good Testing Over Excellent Testing – Part 7
This is the seventh (and final) article in a series that expands on a talk I've given called "Why I Prefer Good Testing Over Excellent Testing".
Summary of Story Time
In the past 5 articles, I've shared stories about defects I missed that I wouldn't have found with all the time in the world, and cases where I was saved by good monitoring and roll-back strategies. To summarize the key lessons I learned:
- You don't know what you don't know
- What's true today could change tomorrow
- Things out of your control can impact you
- "Working fine" does not necessarily mean there are no problems
- Having failure handling strategies in place allows clear thinking during emergencies
Ultimately, doing really excellent testing before releasing our work would not have helped in any of the situations I described.
How do I make "good testing" work for me?
So hopefully by now you’re thinking, this all sounds great, but how would I get started doing this in my own team? Well, the first step (not surprisingly) is to do good testing.
1. Do Good Testing
- Do: test until you feel pretty confident
- Do: focus on things that would be hard to detect using automated monitoring (e.g. UI testing? business logic calculations? etc)
- Skip: tests that feel so artificial that they're probably not telling you mch about what might happen in the real world
- Skip: tests that are so expensive or time consuming to set up that they're hard to justify doing
- Skip: tests where failures would likely only uncover very low priority defects
Then, instead of continuing to test further, invest the rest of your time in planning for failure. Here are some questions you could ask yourself to get started.
2. Plan For Failure
- Roll out and roll back
- Is there a way to roll out your changes in stages? (e.g. to subsets of users, or while keeping older functionality around until you’re sure the newer version works properly, like I described with our Broadcast Event Service)
- Do you know how you’d quickly roll back to an earlier version if you needed to? (e.g. using a feature flag, or maybe temporarily redirecting traffic back to some older servers)
- Detecting issues
- What information can you collect that will signal something is going wrong? (e.g. could be cost monitoring, like with our off by 10,000 error, or maybe it’s traffic volume, CPU or memory usage, specific words appearing in a log file, etc)
- How do you want to be notified when a problem is detected? (e.g. use a service like Amazon Cloudwatch to send you or your team messages, maybe you have a dashboard showing key metrics that you display on a big screen in your team’s area, etc)
- Who needs to take action when there is an issue? (e.g. maybe you have an on-call rotation, or maybe the issues you’re monitoring for are non-critical and you can review alerts the next day as part of standup, etc)
- Responding to issues
- What are some good first steps to take when an issue comes in? It can be useful to have these recorded somewhere, since this can help to quickly rule out false alarms or identify re-occurring issues that already have a defined resolution plan
- Who else needs to be kept in the loop while issues are being investigated, and what kind of information do they need?
- This is an important one, since it can potentially save you a lot of time, or sleep if you’re doing a 24hr on-call shift! Can you implement any extra resiliency or self-healing for any aspects of your product? (e.g. for the Broadcast Event Service, before alerting on a failed outgoing message, we implemented a single retry – so we only get alerted if multiple attempts in a row fail)
These are just a few suggestions for brainstorming your failure planning - there are lots more things you could consider and ask about, but this list should at least be a good place to get started.
What does all this mean for me and my team?
So, let’s say you’ve decided to try “good testing” in combination with detailed failure planning. How will this change the way you and your team operate?
Changes in Focus
- Both devs and testers think about observability up front
- Use time not spent on last mile of testing to prep for failure handling
- Don't forget to test your monitoring!
Changes in Expectations
- Maybe you'll release more bugs (since you will no longer be testing all corner cases you can think of)
- You'll probably NOTICE more bugs (possibly just because you now have more visibility into what's actually going on)
- You might be on call (good motivation to invest in reliability and self healing!)
Changes in Success Criteria
- Instead of aiming for fewer escaped defects, aim for ability to detect and resolve issues more quickly
- Instead of "why did we miss this defect?", consider "could this defect be automatically resolved next time?"
In closing, if we boil this entire series of articles down to just one sentence, the key piece of advice I am offering is: it's a good idea to be ready to handle failure, because in all likelihood you're not going to find all the bugs anyway :)