A lot of ideas in modern software development come from Zero Quality Control, Toyota’s approach to achieving product quality. Some things, it seems, have been a bit lost in translation. Here’s what ZQC can teach us about how to write better software.
Zero Quality Control takes it’s name from the idea that quality does not come from controlling and sorting out defects on the end, but from building it up front. In the words of Philip Crosby, ‘Quality has to be caused, not controlled’. Toyota’s solution consists of a design approach that aims to create mistake-proof products, early warnings and inexpensive successive tests at the source.
Poka-Yoke (mistake-proofing) is a design approach which seeks to prevent problems by making products less prone to misuse. ‘A Brief Tutorial on Mistake-proofing, Poka-Yoke, and ZQC’ by John R. Grout, and Brian T. Downs is a good introduction information to ZQC, and contains some very interesting examples from every-day life.
Computer cables are often offered as a common example of Poka-Yoke design. Video cable plugs are asymmetric, so that it’s obvious how to plug them in, and that it’s virtually impossible to turn them the wrong way. Of course, with the right application of brute force, cables can be plugged in wrongly, but the point is that it’s much easier to use them correctly. Grout and Downs also mention an interesting example with fire alarms which cannot be fitted on to the ceiling if a battery is not installed. In ‘Documentation for Telepathic developers’, I offered a translation of this concept into the software world as “It should take less code to use the API properly then to unintentionally misuse it. Assembling components the Poka-Yoke way must be straight forward, and it should really be hard to do it wrongly.
Another example of Poka-Yoke design is a typical elevator: when overcrowded, elevator will refuse to work, and a light signal will start blinking, notifying us that something is wrong. Our software classes should follow this example. When something is wrong, they should not allow us to proceed and blow up, but they should also not die silently. An elevator light signal clearly tells us that something is wrong, and so should our components. Without such signal, we might think that there is a bug in the component, and start working around it. Exceptions can be an effective way of giving more documentation, but the signal should be clean and unambiguous, in order not to mislead users or client-developers.
Another very interesting example of ingenious Poka-Yoke design given by Grout and Downs is the Bathyscaphe Trieste, a deep-sea submersible used to explore the ocean bead. In case of electric failure, it would practically be doomed, along with anyone inside the vehicle. So, ballast silos are held by electromagnets, and case of an electric failure the craft would immediately start rising to the surface. Likewise, software must be designed to prevent a complete crash, even in the face of system failure. Auto-save features are a good example. It’s not often that the power gets cut, but when it does, our users will surely appreciate that we saved most of their work.
Test-devices were designed to inexpensively check if a product was defective. Poka-yoke tests were inexpensive, repeatable and reliable. They could be given to workers on the manufacturing line, to check a product straight after they make it.
Checking at the source, rather than on the end, was one of the most important ideas described by Shigeo Shingo in his book on Zero Quality Control. Mary Poppendieck often comments on those ideas that ‘inspection to find defects is waste, inspection to prevent defects is essential’. That is basically what test-driven development is all about. Yet, some of the original ideas of ZQC often don’t make it into automated tests. Understanding the basic principles of ZQC and applying them while writing tests can significantly improve the effect of TDD.
Poka-Yoke tests worked because tests were inexpensive. With software, expenses translate directly into developer time. We have to run unit tests for any code change, and they must not start getting in the way. So unit tests have to run on developer machines, and they have to be as fast as lightning. From my experience, any unit test suite that runs longer than a minute is more an obstacle than a helping tool. People will start skipping tests, which pretty much defeats the whole point of doing them. This does not mean that we should not write tests that run longer – just that people should not be made to run them every time.
What works for us quite good is to put fast and slow tests into different suites (or even different tools), and run only the fast suites on every change. A build automation server, like Cruise Control, should run slower tests every couple of hours and let us know when something is broken. The server should also run quicker tests, just to make sure that unchecked code did not find it’s way into the source code repository (in an ideal world, this will surely never happen, but reality often surprises idealists). I suggest running the quicker tests after any code change in the repository, to get the fastest feedback.
Good test must be deterministic – failing should point to a defect or work not being done. Tests which pass or fail depending on external factors, not related to the code under test, are a waste. Spending five minutes to check if a test failed because there really was a bug defeats the whole point of test automation. Even worse, people might start overlooking genuine problems, because they will start expecting the suite to fail with a few errors. Red light should mean ‘stop!’, not ‘stop, maybe!’.
Good tests must allow successive checking, to make sure that the item is correct several times during the production, and to allow people to re-test it when a problem is fixed. Tests which are not easily repeatable typically depend on external systems, or some longer setup. For example, a test may depend on a payment provider to verify the transaction. Or, it may involve a database, which has it’s own rules about integrity constraints or duplicated data.
Unit and component tests which depend on setting up external systems typically signal that code under test should be split into a part that encapsulates a particular business rule, and a part that interacts with an external system. Doing so will not just make code easier to test, but will lead to a cleaner design and easier code evolution and maintenance in the future. The test can then focus on a business rule, repeatable and reliable. Communication with the external system should be moved to integration tests, and does not have to be executed on every code change.
With data-driven tests, if duplication can cause problems for successive checks, clean up the data after the test when possible. Database testing library DbFit, for example, automatically rolls back the active database transaction on the end of the test. If similar cleanup is not possible, the next best thing is to have restore the active database from a clean backup before tests. Tests involving external storage, like a database, typically fail the ‘fast as lightning’ criteria, so they should not be executed on every code change in any case. So, restoring a base version before test runs every couple of hours should be OK.
Zero quality control and the Toyota production system in general have been taken up by a lot of other industries, in the form of “Lean initiatives”. Mary and Tom Poppendieck are the most active promoters of lean initiatives in the world of software, and their books on Lean Software Development offer some very interesting advice for organising and managing a software shop the lean way. If you are interested in finding more on how TPS and ZQC can help with software development, here are some links to get you started on your journey:
Get practical knowledge and speed up your software delivery by participating in hands-on, interactive workshops:
Get future articles, book and conference discounts by e-mail.