Testing and monitoring in production is a great way to learn how your system is really performing with real users, real requests, and real data. Gathering information about production systems is nothing new, but as more teams adopt practices like continuous delivery, the information we collect can be expanded to provide a more complete view of the application.
Faster release cycles create a shift in testing approaches, particularly with more time spent making production environments test-friendly and less time spent in pre-production. This shift does not mean ignoring pre-release testing, but it acknowledges that time-to-market and iteration is more important than perfect software.
The funny thing is, whether it is a game, a desktop application or a web service, if you are a tester that has shipped a product, you have missed a bug.
This post covers a few major reasons why testing and monitoring in production is important for your QA pipeline, and how it can give you a better understanding of your APIs and websites as a whole.
- Differences between testing and production environments
- Monitoring in production compliments continuous delivery
- Plan to recover from failures, not prevent them
Differences between environments
One of the biggest arguments in favor of testing and monitoring in production is the fact that testing and production environments will always be different.
Teams spend a lot of time building test environments that perfectly replicate production with things like hardware provisioning and automated deployment scripts. But let's face it, there will always be differences, and even the smallest differences can have a huge impact on the data you collect in your tests.
Comparing a test and production environment to each another is like comparing a zoo to nature [..]. They may have similarities but the differences are plentiful.
This alone is enough of a reason to make production web service monitoring a standard part of your QA test plan, but it's just the TiP of the iceberg (pun intended). Consider some of the following reasons your testing and production environments may differ, and how these differences affect your testing:
Traffic load with real user requests
Monitoring in production helps you measure how your application is performing during peak and low traffic hours. With web applications, slow response times can make new users and customers frustrated. This is a concern for QA, and generally it's near impossible to exactly replicate traffic load in test and staging environments.
Latency with production databases
Databases in production often contain more (or different) data than testing environments. Latency in a database can cause weird errors that may not have been caught pre-production. Seeding data in testing environments is always a good idea, but it's not the same as production data.
Stub components in test environments
Many teams have stub components built-in to test certain parts of an application in isolation. This is especially true for large web applications that rely on external APIs and services. When these components are moved from test to production, there's a million little pieces between them that may behave differently than expected.
Ken Johnston and Seth Eliot from Microsoft have a great series of blog posts, starting with TIP-ing services testing blog #1: the executive summary, that take a deep dive into the pitfalls of trying to perfectly replicate production systems in the lab. I recommend reading through those (after you've finished this one, of course!) as there's some really great stories. But for context, this is a nice summation:
Let’s test our services in the real world, in the data center and possibly with real users, because our system is too complex to test in the lab (think layering, dependencies, and integration) and the environment it runs in is impossible to model in the lab (think scale, or diversity of users and scenarios).
Monitoring in production compliments continuous delivery
You probably haven't missed all the talk about Continuous Delivery. More and more teams are adopting the practice of rapid software release cycles to improve time to market and spend less time in each development phase.
This isn't a bad thing for QA! It's an opportunity to provide even more business value. By transitioning more testing and monitoring to production environments, a better understanding of the entire system is gained, and more time can be spent on tasks like exploring how a new feature affects user behavior and finding performance bottlenecks.
Alan Page, on Twitter, commented on how production monitoring is sometimes seen as a 'last-ditch' effort to catch bugs. As Alan correctly pointed out, this isn't necessarily true. A thorough test plan will prioritize production monitoring as a compliment to pre-production testing and traditional automated testing.
And it's critical to CD. Do you have an experience that makes you think it's used commonly as a last-ditch effort for quality?— Alan Page (@alanpage) July 17, 2017
Continuous delivery also helps test teams provide more value through quicker iteration times. The faster a new fix or feature can be put into production, the faster the effect of that change can be measured.
Testers may be concerned about releasing a feature that hasn't been fully tested into production so fast, but release methods like Canary Release Testing and Exposure Control can be used to slowly roll-out the change. This is a great point for QA teams to gather data on the system and how the new change affects quality.
As features, fixes, and changes get deployed more frequently, it's important to monitor the behavior of users and how they respond to ever-changing production systems. Remember: testing isn't about bug reports, it's about quality.
Plan to recover from failures, not prevent them
We all agree that bugs in production are inevitable. But that doesn't mean fixing them is free; bugs that make it all the way to end-users are generally the most expensive to fix. This is where building to recover comes into play.
With faster release cycles, testing goals move away from trying to completely erradicate failures up-front and towards building recoverable systems that will identify and alert us of issues as quickly as possible. This means sturdy monitoring and automated tests for production APIs.
John Allspaw has a great blog post on developing your ability to respond. The take is that reaction is more important than prevention. I completely agree with this, and particulary his acronymic description of the technique:
MTTR > MTBF, for most types of F: http://bit.ly/9teuCF— John Allspaw (@allspaw) November 7, 2010
(Mean time to recovery > mean time between failures, for most types of failures.)
If the focus is on completely preventing bugs with big-up-front-tests (BUFT), then you'll be burned when a bug does happen in production because you haven't developed your ability to quickly identify and react to the issue.
A blog post by Tim Hinds at Neotys outlines 7 ways to build a robust testing in production practice. Implementing these techniques prepares you to handle errors in production systems and gives you a pipeline to more thoroughly test and learn about your APIs and services.
The mindset accepts that bugs will always make it to production, and if experience tells us anything, it's that this is true. With this approach, it becomes extremely important to make your production systems testable and continuously monitor performance and user behavior.
I've outlined a few reasons why testing and monitoring your APIs and services in productions helps you gain a better understanding of your applications, improves quality throughout the entire development process, and gives test and QA teams an opportunity to greatly increase value to the business.
I hope to write more posts on this topic, and want to talk about different ways you can integrate TiP and MiP (testing and monitoring in production) in your test plan. Do you or your team have an experience with this process you want to share? Tweet at me or reach out any time and let's chat, I'd love to hear what you thought about the post!
Lastly, I gathered some awesome resources from industry experts to help you learn more about this approach:
- Synthetic Monitoring (Martin Fowler)
- Testing in production: How we combined tests with monitoring (The Guardian)
- There's no place like production (Microsoft)
- QA in production (ThoughtWorks Radar)
- Techniques to reduce API test failures and improve QA automation (Assertible)
- VIDEO I don't test often...but when I do it's in production (Gareth Bowles @ Netflix)
:: Cody Reichert
The easiest way to test and
monitor your web services
Reduce bugs in web applications by using Assertible to create an automated QA pipeline that helps you catch failures & ship code faster.Get started with GitHubSign up for free
New feature: Smarter notifications 5/17/2019