Improving testing practices at Google

Mark Striebeck from Google opened XPDay 2009 today with a talk titled “Developer testing, from the dark ages to the age of enlightenment”. Suggesting that software testing is today in a renaissance stage, Striebeck said that the community is now rediscovering “ancient” practices. Most things we use in testing today were invented a long time ago, and then forgotten, said Striebeck. In the last fifteen years, the community started rediscovering these practices and people were focused on advancing the art, not teaching. As a result, there are many good testing practices out there but having testable code is still more an art than science, according to Striebeck.

Google had a team of Test Mercenaries, who joined different teams for a short period of time to help them with testing. In most cases, they could see what was wrong after a few days and started helping the teams, but the effort wasn’t a success. When they left, teams would not improve significantly. Striebeck said that testing “wasn’t effective to teach”. Knowing what makes a good test often relied on personal opinion and gut-feel. Doing what they often do in similar situations, Striebeck said that they decided to collect data. The things that they were interested in were figuring out the characteristics of good tests and testable code and how to know in the first place that a test is effective. They decided to use a return-on-investment criteria: low investment (easy to write, easy to maintain), high return (alert to problems when it fails). According to Striebeck, Google spends $100M per year on test automation, and wanted an answer whether they are actually getting a good return on that investment. They estimated that a bug found during TDD costs $5 to fix, which surges to $50 for tests during a full build and $500 during an integration test. It goes to $5000 during a system test. Fixing bugs earlier would save them an estimated $160M per year.

To collect data, they set up a code-metrics storage to put all test execution analytics in a single place. Striebeck pointed out that Google has a single code repository, which is completely open to all of the 10000 developers. Although all systems are released independently (with release release cycles randing from a week to a month), everything is built from HEAD without any binary releases, and the repository receives several thousand changes per day with spikes of 20+ changes per minute. This resulted in 40+ millions of tests executed per day from a continuous integration service plus IDE and command line runs, they collected test results, coverage, buld time, binary size, static analysis and complexity analysis. Instead of anyone deciding whether a test is good or not, the system observed what people do with tests to rank them. They looked into what a developer does after a test fails. If the code was changed or added, the test was marked as good. If people changes the test code when it fails, it was marked as a bad test (especially if everyone is changing it). This means that the test was brittle and has a high maintenance cost. They also measured which tests are ignored in releases and which tests often failed inthe continuous build and weren’t executed during development.

The first step was to provide developers reactive feedback on tests. For example, the system suggested deleting tests that teams spent loads of time maintaining. They then collected metrics on whether the people actually acted on suggestions or not. The system also provided metrics to tech leads and managers to show how teams are doing with tests.

The second step, which is in progress at the moment, is to find patterns and indicators. As they now have identified lots of good and bad tests, the system is looking for common characteristics among them. Once these patterns are collected, algorithms will be designed to identify good and bad tests, and manually calibrated by experts.

The third step will be to provide constructive feedback to developers, telling developers how to improve tests, what tests to write an dhow to make the code more testable.

The fourth step in this effort will be to provide prognostic feedback, analysing code evolution patterns and warn developers that their change might result in a particular problem later on.

I will be covering XpDay 2009 on this blog in detail. Subscribe to my RSS feed to get notified when I post new articles.

I'm Gojko Adzic, author of Impact Mapping and Specification by Example. My latest book is Fifty Quick Ideas to Improve Your User Stories. To learn about discounts on my books, conferences and workshops, sign up for Impact or follow me on Twitter. Join me at these conferences and workshops:

Specification by Example Workshops

Product Owner Survival Camp

Conference talks and workshops

20 thoughts on “Improving testing practices at Google

  1. But all of what google appears to be doing is in the area of test automation and TDD/agile … what about testing?

    Are you equating Testing to automation, TDD and agile ..? Testing is quite different from the later ones ….

    In google’s terms Testing traditionally has been synomous with Automation and TDD.

    Shrini Kulkarni

  2. Striebeck said that testing “wasn’t effective to teach”. Knowing what makes a good test often relied on personal opinion and gut-feel.

    Ah. So what Striebeck meant was that testing wasn’t effective to teach poorly.

    —Michael B.

  3. Same as Shrini, I wonder where do QA fit in this Google presentation ?
    By “testing”, all they mean is “Unit Testing” ?
    what about higher level (func) testing ?

  4. So, what if “QA” doesn’t fit into Googles definitions of testing? So what? Would that be bad? Who cares if Googles definition of testing is Automation and TDD? The right question is: is it working?

    Nobody sets up a QA team for giggles. If Google can be successful without our traditional QA baggage then that’s the right thing for them to do.

    Anyway, I think you’re drawing too many conclusions from this article. It’s an article about Google is doing to improve their test automation, not whether or not Google has a traditional QA team. Don’t panic.

  5. Striebeck said that testing “wasn’t effective to teach”. Knowing what makes a good test often relied on personal opinion and gut-feel.

    Is it possible Striebeck never saw a good teacher of testing?

    Is it possible Striebeck hasn’t studied software testing as a life’s career work, and doesn’t understand it?

    It is possible Striebeck really needs to read “outliers” by Malcolm Gladwell?

  6. Really enjoyed your talk today at XpDay! A very open and frank discussion flowed from your informative presentation. Is the presentation available online?

  7. First, some of the research is proving Boehm’s work on Cost to Fix curve. But a point to find out about is did they determine when a “defect” was introduced (root cause) and when it was discovered & resolved. That ratio of time/phase/iteration would be interesting to see how the real cost of “rework” (which is what Boehm’s work is really about) was determined. Does it still hold to the “curve”.

    Second, I think in my opinion the research points out the “fact” that 1) developers have “lost” the practice of “testing” their code and that some of the “old school” methods really do still apply today, 2) Testing is not an easy thing to learn and that it takes time and experience (OJT) to become good at it, 3) Teaching a “non-tester” is not an easy thing to do and takes experience in teaching itself.

    People learn in different ways and the methods of teaching and communication vary person to person. Some people may be good at book study, others need pure hands-on, and others still may need a combination of both. Our “craft” is hard to pass on and does have subjective elements to it. But, it can be done with proper techniques if appropriately determined and applied. And the person being taught needs to be receptive to them as well.

    Third and finally, this research is slanted to the Development groups and it would be interesting to see how the “formal” QA/Test group & practices fit in this context. I think it is great that the development teams are taking this seriously and working towards reaping the benenfits of testing at this level. It all will help to make a better and more stable product. Which is what the End-User really wants, a system/tool that works reliably so they can get their work done.

    In conclusion, if Google persue’s this fully and can then spread this across all parts of its software production then they will do something only a handfull of companies have tried to do and achieved. More power to them and hopefully this will have a positive ripple effect on the rest of the industry.

  8. Hi Gojko,

    Interesting article. it is good to hear how the power of computer automation is being used to improve the development and reliability of softsystems.

    I understand that Mark Striebeck said that, “Google spends $100M per year on test automation, and wanted an answer whether they are actually getting a good return on that investment. They estimated that a bug found during TDD costs $5 to fix, which surges to $50 for tests during a full build and $500 during an integration test. It goes to $5000 during a system test. Fixing bugs earlier would save them an estimated $160M per year.”

    This seems to me remarkably consistent with the ‘power of ten markup’ between the successive phases of the software development life-cycle as originally noted by Barry Boehm in the TRW study… and subsequently by a variety of other studies at e.g. IBM. I wonder whether Google would be prepared to make some more detailed data available to support this ‘rule of thumb’ for estimating the finding defects ‘late’?

    Also, as someone with 37+ years experience in softsystems, it is somewhat amusing to read how an advocate of agile methods has to resort to describing their activities as a series of process steps (ie. a ‘waterfall’ life-cycle) in order to explain the worth of automating testing for TDD. Irony.

    Best regards,
    Grant (PG) Rule

  9. @JimHazen – It’s my understanding that Google has a very strict root-cause analysis process that they use anytime they have production bugs. You’re point brings up a great question — do they relate the root causes back to missing tests that are introduced into the system after the fact?

    @Gojko – Great article, I’m looking forward to the follow up.

  10. Thanks for writing this up. The fact that Google is able to have one code base and release from that with so much code and so many developers ought to eliminate all the whining and excuses I hear that “we are too big to do this” or “we have too many different products” etc.

    I’m confused though that if he says teaching doesn’t work, then how is imparting the information gained from the analysis of test ROI going to help? What will be different about that, did he say?

  11. My understanding was that they had problems teaching good testing practices quickly, and gave up on the idea of Test Mercenaries because it didn’t scale to their needs – they couldn’t training a single team effectively in three months so there was no way they could effectively train thousands of developers.

  12. Pingback: XPDay so far « John McFadyen

  13. Hi everyone,

    Thanks for the various comments (at the conference, but also here).

    First a clarification: the talk was about test automation (not manual testing): unit testing, integration testing, system testing…

    About the “Test Mercenaries” program:
    I was actually lucky to have an amazing team of people who are all great engineers, testers and coaches – internal (like Misko Hevery – http://googletesting.blogspot.com/search/label/Misko) or external folks (like Paul Hammant, Russ Rufer, Tracy Bialik, Peter Epstein, Sam Newman just to name a few).
    Here is how most engagements progressed: our Mercenaries joined a team, got deep into their code, and had (usually after week or 2) a good idea how to refactor the code and make it more testable. And although they did a lot of the refactoring themselves while coaching the team on test practices and writing testable code, it was in most cases not possible to convince the engineering teams to change their code and their design significantly.
    And if I know 2 things about Google engineers, then it’s the following:
    a) they are good engineers who create amazing systems, and
    b) they care – i.e. if they see something “better” they gravitate towards it.
    So, when we could not get the change that we wanted with the people that we had, we had to think about a different approach.
    I then talked to various other people outside of Google and got similar feedback: most engineers today agree that automated testing is a good thing, but not many teams get it right. That’s what lead to the project that we started and the talk at XPDay.

    Lisa: yes, it shows that it can be done. But it also comes with a price. Providing development and testing systems that scale to this level is not easy and costs some $$$. But if a company says that “IT is critical to our business” then they should be willing to make that investment. For me (and my teams) it’s a great experience to develop testing tools and infrastructure for a company that gives us the same priority as search, maps, ads…

    Finally: I will start blogging on our testing blog (http://googletesting.blogspot.com/). About the infrastructure that we built and the results that we see (and this public statement hopefully stops me from procrastinating further! :-)

  14. >>> It’s an article about Google is doing to improve their test automation,not whether or not Google has a traditional QA team. Don’t panic.

    OK … as long as this post about “automation” not (all of) testing … it is fine. I got confused by seeing “Improving testing practices” — probably the post should have been titled as “Improve automation practices” …

    I get panic if I see people confusing automation to testing.

    Shrini

  15. Pingback: Blogs From The Geeks » Blog Archive » Test Smells and Test Code Quality Metrics - Intermittent insightful nuggets

  16. Pingback: The Computer Scientist as Toolsmith – Fred Brooks at Mark Needham

  17. Hi Gojko,

    In our latest installment of our webcast This week in testing, we talked about your post (actually on your description of Mark’s presentation). I invite you to check it out, comment and if you really like what you see, and want to see more, spread the word.

    Gil Zilberfeld, Typemock

  18. “The third step will be to provide constructive feedback to developers, telling developers how to improve tests, what tests to write an how to make the code more testable.”

    For Java code at least, there is no actual need to “make the code more testable”.
    Any piece of Java code can be tested in isolation, given the proper tooling.

    You can (and should) make code simpler and therefore easier to understand, change, and test, but production code shouldn’t be made more complex just to satisfy some mocking tool. In other words, don’t confuse intrinsic testability (ie, maintainability) with extrinsic testability, which is a function of the capabilities of the mocking/isolation API used.

  19. Pingback: Karine Sabatier » Blog Archive » Links w#52

  20. Pingback: Episode 7 – Survivors | Marcelo Costa

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>