Mark Striebeck from Google opened XPDay 2009 today with a talk titled “Developer testing, from the dark ages to the age of enlightenment”. Suggesting that software testing is today in a renaissance stage, Striebeck said that the community is now rediscovering “ancient” practices. Most things we use in testing today were invented a long time ago, and then forgotten, said Striebeck. In the last fifteen years, the community started rediscovering these practices and people were focused on advancing the art, not teaching. As a result, there are many good testing practices out there but having testable code is still more an art than science, according to Striebeck.

Google had a team of Test Mercenaries, who joined different teams for a short period of time to help them with testing. In most cases, they could see what was wrong after a few days and started helping the teams, but the effort wasn’t a success. When they left, teams would not improve significantly. Striebeck said that testing “wasn’t effective to teach”. Knowing what makes a good test often relied on personal opinion and gut-feel. Doing what they often do in similar situations, Striebeck said that they decided to collect data. The things that they were interested in were figuring out the characteristics of good tests and testable code and how to know in the first place that a test is effective. They decided to use a return-on-investment criteria: low investment (easy to write, easy to maintain), high return (alert to problems when it fails). According to Striebeck, Google spends $100M per year on test automation, and wanted an answer whether they are actually getting a good return on that investment. They estimated that a bug found during TDD costs $5 to fix, which surges to $50 for tests during a full build and $500 during an integration test. It goes to $5000 during a system test. Fixing bugs earlier would save them an estimated $160M per year.

To collect data, they set up a code-metrics storage to put all test execution analytics in a single place. Striebeck pointed out that Google has a single code repository, which is completely open to all of the 10000 developers. Although all systems are released independently (with release release cycles randing from a week to a month), everything is built from HEAD without any binary releases, and the repository receives several thousand changes per day with spikes of 20+ changes per minute. This resulted in 40+ millions of tests executed per day from a continuous integration service plus IDE and command line runs, they collected test results, coverage, build time, binary size, static analysis and complexity analysis. Instead of anyone deciding whether a test is good or not, the system observed what people do with tests to rank them. They looked into what a developer does after a test fails. If the code was changed or added, the test was marked as good. If people changes the test code when it fails, it was marked as a bad test (especially if everyone is changing it). This means that the test was brittle and has a high maintenance cost. They also measured which tests are ignored in releases and which tests often failed in the continuous build and weren’t executed during development.

The first step was to provide developers reactive feedback on tests. For example, the system suggested deleting tests that teams spent loads of time maintaining. They then collected metrics on whether the people actually acted on suggestions or not. The system also provided metrics to tech leads and managers to show how teams are doing with tests.

The second step, which is in progress at the moment, is to find patterns and indicators. As they now have identified lots of good and bad tests, the system is looking for common characteristics among them. Once these patterns are collected, algorithms will be designed to identify good and bad tests, and manually calibrated by experts.

The third step will be to provide constructive feedback to developers, telling developers how to improve tests, what tests to write an dhow to make the code more testable.

The fourth step in this effort will be to provide prognostic feedback, analysing code evolution patterns and warn developers that their change might result in a particular problem later on.