Lambeth council in south London has a historic reputation for controversial policies. At one point, it banned the police from using the council facilities. In 1985, it refused to set the budget, as a protest against government policies. After an audit, the leader and 30 other councillors had to repay the losses personally, and were banned from holding a political office for five years. For conservative media in 1980s, Lambeth was the prime example of ‘Loony left’, and it looks as if it is on good track to regain that reputation. Lambeth recently closed two libraries to save money, but ended up paying three times more to secure the buildings than it would have cost to keep them open.

The councillors have learned a lesson well known to anyone working in software delivery. On the chart mapping unexpected and lethal dangers, hidden maintenance costs are way up there next to the Spanish Inquisition. Many of the good design and coding techniques, we now take for granted, were invented to keep the maintenance costs in check. Test automation is a good example. However, teams with great development practices often meet their Lambeth Library moment when massive test suites spiral out of control. Instead of rocket boots that speed up delivery, large suits of tests often end up being more like cement shoes.

There are many reasons why tests fly under the radar until the maintenance starts costing too much, but chief among them is that anything related to testing is often second-grade. Far too often, teams treat automation code as ‘just test code’, so people don’t even think about applying good development and design practices. Because tests are not client facing, teams don’t consider keeping them clean and well organised.

There are plenty of good resources on writing maintainable code, but good information on organising large test sets from a documentation perspective is scarce. So here are my top five tips on how to avoid the Lambeth moment.

Organise by function, not by sequence of implementation

Many teams end up organising their tests along the same lines as their work items. All the tests resulting from a user story are grouped and attached to that story, while tests for the next story end up in a different bucket. A common way to record such tests is with subtasks in a task management system, proving once again that JIRA is the right solution for the wrong problem. This is typical for large organisations, with separate groups of test automation specialists, who will at some point, cross their hearts and hope to die, magically automate all of it. Never mind the queue theory or the ever-increasing backlog. It’s also common for manual test scenarios, exploratory charter sessions, and anything else that JIRA will take (and it can, of course, take more crap than a supermassive black hole).

This system works reasonably well for the first few months, and then falls apart. The problem with this approach is that iterative delivery causes teams to frequently rewrite and revisit old functionality. When people extend, reduce or change features, the tests for those features have to follow. Coming up with new tests for slightly changed functions all the time is not really productive, so testers copy old artefacts and merge with new work items.

However, stories and features aren’t exactly well aligned. A single user story can require changes to several features, and a single feature might be delivered through dozens of stories. That’s why organising tests by stories leads to shotgun surgery. When tests are organised by sequence of implementation, it becomes increasingly difficult to know which tests are still relevant, and which need to be amended or thrown away. Small iterative changes often impact only fractions of previous features, but it’s not easy to decide which parts are still relevant. This friction does not show up at first, but bites really badly a few months in. Early on, copying and merging all the relevant items is easy. With more moving parts, the process of discovering all the relevant information becomes less like checking a list and more like trying to assemble a puzzle, in the dark, with the ominous sound of the vacuum cleaner approaching. Similarly, protecting against unforeseen impacts and functional regression isn’t such a big deal early on. When there are only a few features, people can test them easily. Once the features build up, preventing regression becomes like trying to vacuum up in the dark, with someone constantly hiding and returning pieces of trash.

A good way to look at this is to think of a bank statement. A story creates a change to the system, and the test artefacts describe that particular transaction. Imagine that each month your bank statement comes with all the purchases you’ve ever made, but without the current balance. The only way to tell how much you can spend would be to add everything up from the start of time. Not really the ideal thing to do every time you want to buy something. Plus, it’s horribly error-prone. That’s exactly what people make their teams do by organising tests by sequence of implementation.

To avoid this problem, start organising test artefacts by function, not by sequence of implementation. Some typical examples of that kind of organisation are breaking test suites down by business areas and user work-flows, by feature sets, or even by technical components, such as user interface screens. Make sure that the catalogue is easy to navigate, so you’ll be able to find everything about a particular feature quickly, regardless of how many user stories touched it.

Similar to card transactions on a statement, user stories are only relevant for a short period before/after the implementation. So don’t worry about keeping the link between tests and stories too strict. For stories currently in play, it’s often enough to just link from the story tracking system to the relevant tests in the feature catalogue. For an even easier approach, tag or mark the test artefacts with the story number or ticket ID, and then link from the ticket tracking tool to the search result page for that particular tag. This will ensure that you can discover all valid artefacts for a story quickly.

As a bonus tip, please keep your test artefacts in a version control system, even if they are intended for manual execution. Avoid sticking stuff into wikis, shared file folders, and especially avoid ticket management systems. Sure, that’s easy early on, but think about the future. How will you branch, merge, roll forwards and backwards later when you need to support multiple versions? How will you audit the changes once your team grows? Version control systems have only been around for forty-six years, isn’t it high time to start using them instead of abusing ticketing tools?

In a large set of tests, it’s unavoidable to have some overlap of responsibility. In order to test the back-office payment reports, we’ll need to log a few user purchases first. So it’s naturally tempting to run the purchasing tests first and just reuse the results. Plus, linking the two sets of tests allows us to avoid repeating documentation related to the purchases. Later on, when testing fraud alerts, we can also re-use the same purchase tests, we just need to add a few more edge cases.

Execution chains are common for both manual and automated tests, and they look like a win-win situation in theory, but in reality they are a ticking time bomb. The chains become a precise sequence of things that need to happen every single time. Even if you don’t care about the initial tests. Even if the chain becomes twenty items long. Even if exercising a particular edge case later in the chain requires stupid unnecessary duplication early on, just so that the right data is available. And the fireworks really start when something in the middle of the chain has to change.

It’s perfectly valid not to duplicate documentation. There is no need to insist that people read through a ton of stuff about all the related concepts in order to check if a report works well. But that doesn’t mean that the execution chain has to follow the documentation chain. Use cross-links to explain things (eg ‘Credit card purchases are documented in this test…’) but let each test set up its own context and stay relatively independent. A common solution to avoid duplication is to group related tests into a suite with a common set-up, which you can then execute for each single test if required. And potentially cache the results and execute only once before the entire suite. Organising tests by a functional breakdown, as suggested in the first tip, supports this perfectly, as set-ups can be nested through the hierarchy.

Using cross-links only for documentation and forcing fully isolated execution is definitely not the fastest way to knock up lots of test cases. But it brings long-term resilience. A change to some reference data required for card purchasing won’t have any impact on back-office report tests. People will be able to execute each test separately, and get feedback quickly when fixing bugs or introducing new features. Finally, each test can set up the data the best possible way to exercise all the tricky boundary conditions, without having to worry about messing up set-ups for any other tests in the logical chain of events.

As a bonus tip, try saving anything that humans need to read, such as specifications of automated tests and manual test plans, into a format that is easy to convert to a web site. For example, markdown, or plain text. You’ll then be able to publish a documentation web site as the last step of an automated build process, and documentation cross-links can be actual hyperlinks helping people navigate it easily.

Reuse automation blocks, not specification blocks

Large systems require large test suites, and such suites often have tests that replicate or repeat parts of other tests. This is especially true if you follow the previous tip, and make each test relatively independent – many set-up tasks will be similar, or completely the same. To create some sense of the world, and prevent a maintenance hell, teams often create reusable building blocks for such tests.

The problem is that the reuse often comes from the test specifications. People agree on a single way to set-up a shopping cart and execute a purchase, and the same syntax gets reused every single time a purchase is required. Even when checking something that doesn’t require all the information about the purchased item, only the total amount. Even when testing end-of-day inventory reports, where individual customer details aren’t important. Even when testing how quickly fifty thousand purchases can go through the system, when the actual purchases are irrelevant as long as they all get accepted.

Far from solving maintenance problems, reusing specification blocks often just hides the real issues under the rug, to collect more dirt and provide a fertile ground for bugs to grow. It temporarily hides the problem, allowing people to copy-paste irrelevant and overcomplicated details. Kind of like letting people copy-paste code from Stack Overflow. For a short time after creating such a test, the author will know exactly what’s going on. Six months later, someone will only be able to go through the mechanics of test execution, without any idea of how to fix a problem or change the test correctly.

Instead of creating reusable specification blocks, let people describe the key information as narrowly as possible, focusing only on the details relevant for a particular test. If you want to check how quickly fifty thousand transactions go through, just explain it like that. Don’t list individual transaction details. Automate that test by spinning through a loop and invoking a test function that creates a purchase, possibly passing some valid dummy data. You can then reuse that same test function for preparing inventory reports, asking for a particular item, but use generic customer details. You can also use it from the shopping cart test, that will focus both on the customer information and the purchased items.

The benefit of this approach is that the specifications will be shorter, more direct to the point, easier to understand and maintain. Want to add another test that checks how quickly 100.000 transactions can go through? No problem, it’s just one line of text. Want to introduce a new mandatory step in purchasing the shopping cart? No problem, just change it in one place – all the tests keep working immediately. Plus, pushing all the reuse into the automation blocks ensures that you can get a lot of help from your IDE when refactoring.

For Given-When-Then tools, this translates to stop obsessing about using the same step definitions over and over again, and focus on the reuse in step implementation code. For technical test tools, this often translates to creating a layer of utility functions that execute workflows.

As a bonus tip, consider reusing those automation blocks across different sets of tests. It’s likely that such components can be useful to run smoke tests, set up performance tests, and even help humans prepare the context for exploratory testing.

Document ‘Why’, specify ‘What’, automate ‘How’

Herculean tales are filled with dangerous quests, laborious efforts against great odds and glorious victories. The inciting incident, the reason why the hero sets on the journey, is never given much space. As long as there’s a plausible excuse for skulls to get bashed and bones to be crushed, nobody cares too much why, nor if the same thing could be achieved easier. Likewise, the final victory is a very small part of the whole story. The perilous journey takes centre stage in all great adventures. Lord of the Rings would be a lot less interesting if Gandalf magically teleported Frodo directly to Mount Doom. Or if, heaven forbid, they dealt with Sauron in good time before the whole thing went south.

Unfortunately, too many people seem to look for inspiration in epic poetry when writing tests. They often focus completely on how to execute a test, in painstaking detail, with all the laborious tasks. First open this page. Then click on that button. Then add an item to the shopping cart. Then rinse and repeat. The reason why the thing needs to be tested in the first place is scarcely mentioned, and the actual checks at the end are just incidental detail. Because the entire workflow is so epic, people often add a summary on what’s being tested as a comment, or scenario description on top. But nobody bothers to write down the expected benefits and the risks of the feature under test. As long as some keys need bashing and some input fields need crushing, don’t ask why. Sure, XWing jets can communicate fine with rebel headquarters during an attack in a completely different galaxy, but a simple vector map needs to be carried by hand on a USB stick.

The big problem with such tests is that they are only meaningful in the short term. A few weeks after composing such an epic saga, everyone will still remember what it is about. But a few months later, people will only be able to follow it through and check if everything still works the same way. Integrating changes will be almost impossible, because the key contextual information about risks and benefits isn’t there. Minor changes to referential data formats break such tests beyond fixing. Something as simple as a link changing into a button can cause hundreds of such tests to be thrown away. The only relevant piece of information on epic tests is the e-mail of the author, so you can find them to do it all over again.

In most such cases, pushing the entire stack of information one level down helps to keep it under control. Instead of Tom from the fifth floor being the only person who knows why a test exists, document the key risks and expected value so that anyone can understand and update it. Instead of capturing what’s actually being tested in a comment above the executable part, the test specification needs to focus on the key rules and conditions to validate the risks. The mechanics of executing the test should be in the automation layers below.

For tools using Given-When-Then, this tip effectively translates to having only a single When entry. Push all the mechanics of executing that test one level below, to the implementation of the When step. Then reuse lower-level automation blocks to make it easy to maintain. Use the scenario names and descriptions to explain the reason behind the test, not what’s being tested.

As a bonus tip, try making the Given and the Then parts passive – they should only explain data. Make the When part active, ensure it contains a verb. This will prevent test execution mechanics from creeping into preconditions or postconditions, and make it absolutely clear what is the feature under test.

Optimise titles for discovery

In the early Roman republic, women were so unimportant that their families didn’t even bother to give them proper names. Girls were known by their family name, and sisters would be differentiated with numerical adjectives. The third female child in a Julian family would simply be Julia Tertia. Showing the same level of respect for test cases, many teams simply use names such as ‘Scenario 1’, ‘Scenario 2’ and so on. More artistically inclined people give their tests names such as ‘Simple cases’ and ‘Additional scenarios’. No wonder nobody can prioritise any of that stuff later.

Test names are another concept that isn’t at all important until a few hundred tests build up, and then all of the sudden they become critical. A good test name will tell someone whether they are reading the right spec at all, or if they should skip a document altogether and look somewhere else. A good test name provides the context for changing the tests in the future, and helps people decide if a test should be extended, rewritten or thrown away when new functionality comes. A good test name is critical to discover who in the company should clarify if a broken test detected a bug or just an unforeseen consequence of an intended change. A good test name helps developers pin-point problems after code changes even without looking at the details of the execution. A good test name is critical to find all the relevant things that need to be discussed before a major system change.

Optimise scenario and test titles so you can easily discover all the relevant details of a feature. Spend five minutes thinking about a reasonable test name and you’ll save hours later. My heuristic for a good test is to imagine that I’m looking online for information about the feature under test. What would you type into Google to find it? Probably not something overly generic, because too many results will come back. Probably not something like ‘Additional scenarios’, because you’ll be swamped with irrelevant results. Probably not something too long, because it contains too many keywords and will match too many things. Search keywords are nice because they are key words – relatively few, direct to the point, and specific enough to describe only a single concept.

As a bonus tip, avoid repeating the data from the test in the name. Don’t bother with ‘Approving transactions over 100 USD’, because that limit may change in the future. If you follow the previous tip, the limit should already be clearly documented in the specification. Call the test something like ‘Approving transactions based on amount’, so your titles stay valid even when the data changes.

For some more tips on how to make large sets of test easier to manage, check out Fifty Quick Ideas To Improve Your Tests.