Automated testing: back to the future

During the last few years, the world of computing evolved far beyond what even the writers of Back to the Future imagined. Fair enough, we still don’t have flying cars, but Marty McFly of 2015 got his termination letter by fax. In the real 2015, it would be more likely to get a smartwatch notification – and for an added insult, also a small electric shock. All this new technology is creating some incredible opportunities for software testing. Prohibitively expensive testing strategies are becoming relatively cheap, and things that we didn’t even consider automating will become quite cheap and easy kind of like McFly’s self-tying shoe laces. If your organisation is suffering from the high cost of testing, here are a few tools and trends that may help you outrun the competition.

Three important trends, currently reshaping the software landscape, are important to consider before we jump into the new opportunities. Much like Doc Brown’s DeLorean, they both create problems and open up new possibilities.

The first factor is the surge in platform fragmentation. The variety of devices running software today is just amazing, but it’s also a massive cause of pain for testing. Mobile is the new desktop, but the number of screen resolutions, devices, OS combinations and outside influences makes it horribly difficult to actually test even for the major mobile options. And the situation about to get a lot more difficult. Predicting the Internet of Things, various analysts are throwing around different figures, but they are all in billions. Gartner last year estimated that there will be 4.9 billion connected ‘things’ on the Internet by the end of 2015, and that the number will jump to 20 billion by 2020. ABI Research thinks that the 2020 number will actually be closer to 40 billion . IDC forecasts that the worldwide market for IoT will be worth around $7.1 trillion in 2020 . Regardless of how many billions you subscribe to, one thing is clear. It may be difficult to keep up with platform fragmentation today, but by 2020 it’s going to be insane. If you think testing on various Android phone versions and resolutions is painful, wait until people start trying to use your software on smart toilet paper, just because there is nothing else around to read.

The second factor is cloud hosting crossing the chasm from early adopters to the late majority. I remember reading a theoretical article about utility computing in 2001, announcing that HP will offer data-processing on tap, similar to municipal water or electric grids. Fifteen years ago that sounded more bonkers than being able to travel through time at 88 MPH. Apart from the fact that HP is not really a key player, that article was closer to predicting the future than Robert Zemeckis. Even if the majority of companies won’t shut down their data centres and fire all the sysadmins in the next five years, IDC claims that in 2017 roughly 65% of enterprise IT organisations will have some kind of hybrid half-cloud half-on-premise monster. For testing, of course, this creates a challenge. Companies no longer control all the hardware or the data. There are many more assumptions in play. The industry is moving away from expensive kits that rarely break, to virtualised improvised magicked-up systems running on commodity hardware and likely to blow up at any time. This completely changes the risk profile for software architectures and tests.

The third big ongoing trend is the push towards the front end. Ten years ago, most companies I worked with were content with making the back-end bullet-proof, and doing just a bit of exploratory testing on the front-end to cover the miniscule risk of something not being connected correctly. With mobile apps, single-page web apps, cross-platform app generators using HTML5 and the general commercial push towards consumer applications, the front-end is no longer a tiny risk. The top part runs most of the logic for many application, which means it carries the majority of risk. The testing pyramid, the corner-stone of test automation strategies in the last 10 years, is getting flipped on its head.

Luckily, these new trends do not just bring challenges, they also create amazing new opportunities.

Changing the balance of expected and unexpected risks

News travels fast on the Internet, especially news about seemingly stupid mistakes. For example, in June this year, it was possible to completely crash a Skype client by sending a message containing the string ‘http://’. In 2013, some clever music lovers found a way to hijack Spotify accounts using Unicode variants of Latin letters. Once such problems are widely reported in the news, claiming that they are unexpected in our software is just irresponsible. Yet almost nobody checks actively for such problems using automated tools. Even worse, four out of the top 25 security errors are caused by input formats and problematic values, well documented and widely published. Max Wolf compiled a list of 600 known problematic strings and published it on Github. I wrote the BugMagnet Chrome extension that makes typical problems with names, strings and numbers available on right click for any input box. It’s 2015, let’s please stop calling an apostrophe or an umlaut in someone’s name unexpected.

Many teams still hunt for these problems only with manual exploratory tests. And there are plenty of good resources out there that help to speed up manual testing, but why not just automate the whole thing and run it frequently, on all the input fields? It’s not that the tools to check for such problems don’t exist – in fact, they are all too easy to find. Security testing groups, both white and black hat, have long had automated tools that grind through thousands of known problems for hours, trying to find an exploit. Many recent high-profile security hacks were actually caused by easily predictable mistakes, which just require time to detect. The real issue is that executing all those checks takes too long to be viable for every single software change, or even every single release in a frequent deployment cycle. Combine that with the increasing fragmentation of tools and platforms, and the push towards the front-end, and the future doesn’t look very bright.

Yet, there are services emerging that have great potential to change the balance in our favour. With the abundance of cheap processing resources in the cloud, the time required to run a large-scale mutation test using known problematic values, such as Max Wolf’s list, is dropping significantly. For example, Amazon’s AWS Device Farm can run tests in parallel over lots of real devices, reporting aggregated results in minutes. Services such as BrowserStack can allow us to quickly see how a web site looks in in multiple browsers, on multiple operating systems or devices. Sauce Labs can run a Selenium test across 500 browser/OS combinations. With the increasing fragmentation of devices, I expect many more of such services to start appearing, offering to execute an automated test across the entire landscape of platforms and devices in a flash.

My first prediction for 2020 is this: Combining cloud device farms and browser farms with existing exploratory testing heuristics will lead to tools for quick, economic and efficient input mutation testing across user interfaces. Evaluating all input fields against a list of several thousand known problematic formats or strings won’t be prohibitively expensive any more. This will shift the balance of what we regard as expected or unexpected in software testing. As a result, human testers will get more time to focus on discovering genuinely new issues and hunting for really unexpected things, not just umlauts and misplaced quotes.

The other interesting trend emerging in this space is automated layout testing. Component layouts, especially across different resolutions and responsive design needs, are now almost impossible to test automatically. But there is a new set of tools rising that aims to change that. James Shore’s TDD JavaScript screencast led to the development of Quixote, a unit-testing tool for CSS. Quixote makes it relatively easy to check for actual alignment of UI components similar xUnit. The Galen Framework helps to specify layout expectations and run the tests using a browser farm, such as Browser Stack or Sauce Labs. The technical capability of executing layout tests is here today, but the tools are mostly developer oriented. It’s still not easy for people who care about the layouts the most – the designers – to describe and run such tests themselves. On the other hand, there is a whole host of easy prototyping tools emerging for designers. PopApp allows people to draw sketches on paper and post-its and then create an interactive app prototype. InVision aims to make web and mobile wireframing easy, integrating design with team workflows, collaboration and project management. Now imagine the future and all those tools combined.

My second prediction for 2020 is this: We’ll see a new set of visual languages for specifying automated tests for layouts and application workflows. Instead of clunky textual descriptions, these tools will use digital wireframes, or even hand drawn pictures, to specify expected page formats, component orientation and alignment, and progression through an application. Teams will be able to move quickly from a whiteboard discussion to an automated tests, enabling true test-driven-development flow of front-end layouts and workflows.

Assisting humans in making testing decisions

The combination of cloud and microservice deployments, together with putting more logic into the front-end and the fragmentation of platforms, makes it increasingly difficult to describe all expectations for large-scale processes. Because of that, completely different testing approaches have started to gain popularity. For example, instead of being able to decide what’s right upfront, a new generation of tools helps humans to quickly approve or reject the results of a test execution. Such tools will have a profound effect on making exploratory testing faster and easier.

One particularly extreme case of this phenomenon is the upcoming generative space game No Man’s Sky. The game developers are creating a procedural universe of 18 quintillion worlds, but instead of making them boring and repetitive, each world will be unique. Players will be able to, given several millennia of time, land on each one of those worlds and explore it. Each generated world will be grounded in reality. For example, planets at a specific distance from their suns will have moisture and water. On those planets, the buildings will have doors and windows. Animals will have a bone structure inspired by Earth’s animals. These rules are fed into what the art director Grant Duncan calls the ‘big box of maths’, that then creates variety. Each world is different, but unique. The big box of maths stretches the legs and arms of animals, paints them in different patterns, and so on. With 18 quintillion worlds, how does anyone test this model properly? You can see one world, or two, but each of those single worlds is supposed to be interesting enough for players to explore for years. The solution the developers came up with is pretty much the same as today’s space exploration. They built probes that fly around and take pictures and short videos , and the designers then look at the results to see if things are OK. It’s not perfect, but it speeds up significantly what humans would have to do anyway.

Rudimentary tools that help with this approval-style testing have been around for a while. TextTest allows teams to quickly run a process and then compare log files, text outputs or console dumps to old baseline values. BBC News created Wraith, a tool that efficiently creates screenshots of web sites in different environments or over time, compares them and quickly highlights the differences for humans to approve. Xebia VisualReview highlights visual differences between screenshots and even provides some basic workflow for accepting or rejecting differences. There are already new cloud-based services emerging in this space. DomReactor is a cloud-based service that compares layouts across different browsers. Applitools provides screenshot comparison and visual playback, and integrates with Selenium, Appium and Protractor, and even more enterprise-friendly technologies such as QTP and MS Coded UI. And that’s just the start. Over the next few years, we’ll see a lot more of that.

My third prediction for 2020 is this: there will be a new set of automated cloud services to run probes through user interfaces, and provide a selection of screenshot or behaviour differences as videos for approvals. The current generation of tools might be a bit clunky to use or configure, they lack nice workflows and require scripting, but a new generation of tools will be able to move around apps and sites smarter and easier, and make smarter decisions on what to offer for approvals.

Dealing with things that are impossible to predict

Highly complex systems often suffer from the butterfly effect of small changes. I still remember a panic day of troubleshooting about five years ago, when a seemingly simple change to a database view caused a chain reaction through a set of network services. A user with more than 20000 Facebook friends tried to log in, the system attempted to capture the network of relationships, but the new view didn’t work well for that particular case. The database server decided to run a full table scan instead of using an index, clogged the database pipe and caused the page to seem unresponsive. The user refreshed the page a few times, taking out all the database connections from the login service connection pool. That caused other things to start failing in random ways. It’s theoretically possible to test for such things with automated tools upfront, but it’s just not economically viable in most cases. And not just in the software industry.

In 2011, Mary Poppendieck wrote the fantastic Tale of Two Terminals, comparing the launch of two airport terminals – Terminal 5 at Heathrow and Terminal 3 in Beijing. Despite months of preparations, the UK terminal ended up in chaos on the first day, having to cancel dozens of flights. During the first week, a backlog of 15000 bags piled up, that ended up being shipped to a completely different airport for sorting. The Beijing terminal, however, opened without a glitch. This is because the Chinese authorities organised several drills before the launch, the final one including 8000 fake passengers trying to check in into 146 flights, requiring 7000 pieces of luggage to be processed during a three hour exercise. Now, of course, the cynics were quick to say that the Chinese government can do this because it doesn’t cost them anything to use their army for such experiments, and that the cost of running an equivalent drill would be prohibitively expensive in the UK. Yet the UK parliamentary enquiry revealed that the owners of T5 engaged 15,000 volunteers in 66 trials prior to the opening of the terminal. But they weren’t monitoring the right things. Similarly, most software stress tests and load tests today involve predictable, deterministic and repeatable scripts. Although such tests don’t necessarily reflect real world usage, and may not trigger the same bottlenecks as thousands of people who are trying to achieve different things at the same time, designing and coordinating more realistic automated tests just costs too much.

The most common solution today is to gradually release to production. For example, Facebook first exposes features to a small number of random users to evaluate if everything is going smoothly, then gradually extends the availability and monitors the performance. Lots of smaller organisations rely on services such as Google Analytics with A/B deployments to evaluate trends and figure out whether something bad unexpected happened.

Theoretically, crowdsourcing should enable us to reduce the cost of such tests before production, and engage real humans to behave in unexpected ways. The reach of the Internet is far and wide, and there are lots of idle people out there who can trade a bit of their time for peanuts. But coordinating those people is a challenge. Amazon started offering the Mechanical Turk computer interface to humans performing micro-tasks almost a decade ago. And there are some niche testing crowd-sourcing services already emerging. For example, UserTesting enables scheduling easy hallway-type usability testing, recording videos and comments during testing sessions. But crowd-sourced testing hasn’t really taken off for the same reason as the T5 launch failed. It’s difficult to look for the right things. Or more precisely, it’s difficult to process thousands of test reports and conclude anything useful. There is just too much data, and the signal-to-noise ratio isn’t that good. Application state often depends on things that are difficult to replicate, and that means that confusing test reports would just take too much of our time to consume.

Another emerging trend might turn the situation in our favour. Crash reports, made ubiquitous with mobile apps, have now become the norm for desktop applications and web sites as well. Things will go wrong, so when they do, it’s important that people in charge quickly know about it. Even more, it’s crucial to be able to sort out relevant information from accidental flukes. Problems can happen in end-users’ systems for a variety of reasons, from network glitches, over unrelated third party software, to malice and stupidity. And the more popular an application, the riskier it is to assume everything will be OK, but the more difficult it is to actually separate signal from noise with crash analytics. As an industry, we’ve learned to collect historical user interaction, network events, and a lot of other inputs to help with crash analytics over the last few years. And that propagated back into testing. In the book How Google Tests Software , Whittaker and colleagues talk about BITE – Browser Integrated Test Environment – a browser extension that collects a ton of telemetry and records all user interactions to make it easy for developers to act on a bug report. This tool was originally developed for Google Maps, where the application state depends on an ever-changing dataset and a ton of user actions such as pinching, sliding and zooming. Too many variables to control easily. BITE was actually open sourced and out there for a while, but the public version is now deprecated an no longer maintained. But combining something like that with the Mechanical Turk could make crowd-sourcing testing easy to consume. A whole new set of tools and services is emerging to provide operational awareness and help with trend reporting. Two nice examples are HotJar, an analytics service that combines trend analysis, heatmaps and user feedback, and Track.JS , a cloud-based error aggregation and reporting system for JavaScript that collects a ton of data to help with root-cause analysis.

My fourth prediction for 2020 is this: a new class of services will combine crowd-sourced coordination with powerful telemetry, analytics and visual session comparisons, to enable testing for behaviour changes and detecting unexpected problems. Such services will enable us to request an army of users to poke around, then provide a good noise-to-signal filter, to support quick session review and decision making. The new services will also record a ton of useful information about individual crowd-sourced sessions to help with analysis and reproducing state. These new services will make it cheap to schedule sessions with real humans, real devices, at statistically significant volumes, that are easy to control and coordinate. Imagine the combination of Mechanical Turk, Applitools and HotJar, recording user interactions and network traffic, and everything else you need to reproduce any particular testing session quickly.

Some crowd-sourcing services will no doubt claim that they have real testers on stand-by somewhere half-way around the world, for a fraction of the price, but commoditising testers is not the real value of my premise. Bleak results with offshore testing have hopefully already shown that this is a false economy to most companies by now. I’d really love to see value-added services, that will allow a small number of expert testers to coordinate and direct large crowds and conduct experiments. Think about instant focus groups, or smoke testing as a service. A tool will schedule and coordinate this for you, and you just get the results back in 30 minutes. And it will be cheap enough so you can run it multiple times per day.

Test automation with artificial intelligence

At the moment, most test automation is relatively unintelligent. Automation makes things faster, but humans need to decide what to test. Yet, machines have gotten a lot smarter. Using statistics to predict trends and optimise workflows has been around for at least a hundred years, but it’s really taken off in the last few years. In 2012, big data made big news when the US retail chain Target apparently guessed that a teenage girl in Minnesota is pregnant even before her family knew. Models, tools and skills are rapidly evolving in this area.

Even if you’re not a government funded nuclear research institute, cheap cloud processing power and opensource tools will make machine learning and big data analytics accessible. Google engineers recently opensourced TensorFlow , a library developed to conduct machine learning and deep neural network research for the Google Brain. Microsoft recently made its Distributed Machine Learning Toolkit opennsource . Such systems are re-shaping how YouTube offers the next video to play, how Netflix recommends titles on the homepage, and how Amazon offers related items to buy. Combined with the ton of analytic data collected by application telemetry today, this could be a powerful source of insight for testing. Combine production analytics, version control changes and bug reports and let a machine learning system loose on it. It may not be able to explain why a problem is likely to happen somewhere, but it should be pretty good at guessing where to look for issues.

My fifth prediction for 2020 is this: machine learning and AI tools will emerge to direct exploratory testing. Imagine adding a button on a page, and a helpful AI proposing that you should check how it impacts an obscure back-office report implemented five years ago. Or even better, add crowd-sourcing and coordination services to the equation, and the AI offering to automatically schedule a usability test for a particular flow through the site. Alternatively, combine machine learning conclusions with cloud-based mutation tests, to narrow down the area for automated mutation testing. Even better, machine learning could be used to predict new problematic values to test and add for mutation experiments. Imagine changing a piece of middleware, and pushing the source code up to the version control system. As part of the CI build, an AI model could come up with a hypothesis that ‘http:’ without the rest of the URL could crash an app, run a data-grid mutation test to prove it, and report back two minutes later similar to how unit test results report today. Wouldn’t that be powerful? And please, if you ever decide to create an AI proposing exploratory tests, just for the sake of good old times, make it look like Clippy.

Automated testing: back to the future

Changing the balance of expected and unexpected risks

Assisting humans in making testing decisions

Dealing with things that are impossible to predict

Test automation with artificial intelligence

Learn more

Impact mappingDeliver the right thing

Specification by ExampleCollaborative specifications and tests

Getting started with serverlessScalable cloud architectures

Recommended reading

Books

Spy on me

Full archive

Presentations and videos

Schedule a visit