The key for solving Rumsfeld problems in modern software
For a long time, ‘testing in production’ was a fancy way of saying that someone is irresponsible. It was a slur for organisations whose testing practices are so bad that critical bugs constantly bite actual users. As our industry is rapidly transforming into an API supermarket glued together by cloud platforms, testing in production is no longer derogatory, it’s becoming necessary. Teams that don’t have an effective way to test in production are now irresponsible. And a key piece to make it all work is a hugely under-utilised gold mine of information, that most people already have: production errors.
According to Forbes, more than 80% of enterprise workloads will run on some form of cloud by 2020. Of course, we can dispute the number, but the trend is indisputable. More and more of our work is running on someone else’s platform, integrating with someone else’s APIs. For software quality, this means that there is a much higher risk of someone else changing something completely outside of our control, even after our software goes live.
A few years ago, Netflix chaos-monkeys became famous for actively causing and exploring infrastructure risks in production environments. But the moving parts affect more than just infrastructure. The real fun starts when we consider impacts on key business workflows from components outside our control. For example, a recent Chrome release changed how the browser handled cookies from Google’s own authentication service for third parties, and blocked lots of people trying to use MindMup with Google Drive. We had no changes deployed to production for a few days, so this sudden surge of problems caught us a bit by surprise. Luckily, we had tool in place to log and investigate client-side production errors, which at least gave us an early warning that something horrible had happened outside our control, and gave us enough information to quickly provide a workaround.
To contain problems outside of our control, it is critical to spot them quickly. For a major outage, people will start screaming at you via social media and sending angry emails. But smaller-scale problems are even trickier. If a similar problem affects just a small percentage of users, it can fly under the radar for a long time, and lead to a lot of user frustration and lost customers. The smaller the edge case, the more difficult it may be to reproduce. Because such problems happen in components outside of our zone of control, without any influence from us, they fall into the tricky category that Donald Rumsfeld famously called unknown-unknowns. Those are typically problems that we don’t know about, and we’re not even aware of that knowledge gap. The key to deal with Rumsfeld problems is observability, so we can at least eliminate one category of not knowing. That’s where effective production tracing and error logging comes in.
A recent survey of AWS customers published by Cloudability claims that AWS container usage grew almost 250 percent in a year, and the serverless adoption grew almost 700% in the same period. The more our software ends up depending on other people’s platforms and components, the more it becomes important to shine light on unknown-unknown problems. It’s no wonder that there is a surge in monitoring, tracing and logging tools for cloud deployments today. Amazon X-Ray became available in April last year, followed by IOPipe in August. Something new pops up in that space almost weekly. For example, Thundra was announced to address a similar problem just last week. This whole space is rapidly evolving.
As an industry, we need to get better at understanding production errors, and we need to do that quickly. Teams need to learn how to mine production errors for insights. This is the modern equivalent of tailing the log file while running an exploratory testing session. The context is different because the events happen in production, and they are generated by users instead of a tester, but the purpose is the same. The tool helps us watch out for strange and unexpected events, and investigate them to discover unknown-unknowns. The big difference between a tester’s log file and production errors is the ratio of signal to noise.
Just consider the last mile of most applications today, a client web browser. This is a component far beyond your control, and it can cause weird and unexpected edge cases. If errors happen there, your servers won’t even know anything about that. You may still get occasional IE6 users, people coming from some fringe mobile browser, or as one angry customer complained to us, the application might be broken when observed on their refrigerator panel. People connecting from an unreliable airplane connection might drop out halfway through an important operation. Automated crawlers might hit the API endpoints in a way that no legitimate user ever would. With MindMup, we occasionally get geniuses who over-configure their privacy blockers and then complain that the app cannot load files from Google Drive, which they explicitly blocked themselves. All those events generate errors in production, but to a large extent, this might just be business as usual.
Modern applications have so many moving parts that stuff goes wrong all the time, and individual glitches are probably not that important. It’s the trends that really matter. If people occasionally have problems connecting to Google Drive from our app, this might be due to network glitches or unsupported browsers, or even an administrator blocking access to certain users. But if the number of people having trouble with Google Drive surges after a new release, it’s pretty clear that something wrong happened. Finding that signal in all the noise is the key observability problem for future software quality.
Now that collecting this kind of data is pretty a much solved problem, the next generation of tooling is emerging to take advantage of it. For example, AWS recently announced support for gradual deployment using CodeDeploy. This allows teams to set up pre and post deployment conditions, and then let the infrastructure gradually shift traffic between two versions of a system while monitoring for unexpected problems. For example, run a canary deployment for 10% of the users over a short period of time, before moving everyone on to the new version. Another option would be to linearly bring 10% of the users to the new version every few minutes.
In early 2018, when I wrote this, having that kind of a production pipeline puts teams squarely into an early adopters subgroup of another early adopter segment. I’m not even sure what the right name for that is, but two things are clear. The first is that this takes testing in production to a whole new level. The second is that tools like that are becoming more widely available and accessible. Five year ago, IT giants such as Facebook or Google were boasting about being able to do these kind of deployments. Now, almost anyone can do it, even on a shoestring budget, using opensource tools or (almost) free platforms.
With software running on third-party platforms, such as the modern cloud environments, it’s very difficult to simulate a production-grade system locally, but it’s very easy to spin up multiple copies of the target environment in the cloud. So the whole distinction between the production and testing environments tends to blur a bit. At the same time, some risks that are traditionally very difficult to simulate before deployment are quite easy to identify in production. One good example is peak-time network performance when you do not actually own the network, as it is today with most cloud deployments. Another common example is actual user behaviour. Automated canary deployments protected by automated production error monitoring can make those kinds of risks very easy to contain.
You don’t need a crystal ball to see that software products are getting more interconnected and interdependent. When so much of the risk is outside our zone of control, in a whole host of third party platforms and components, techniques such as error mining become crucial for observability and operability. So if you’re not already tracking client-side errors, wire up one of the readily available solutions and I guarantee that you’ll be surprised with the results. If you already log this data, consider bringing it into the development pipeline, as it might help you significantly reduce the risk of bad deployments. I feel that we’re just starting to scratch the surface there, and that the new tools and techniques will keep emerging to explore the opportunities opened up by having all this data easily available.