A more polished version of this article is in my book Fifty Quick Ideas To Improve Your Tests

It can be tempting to add a ton of scenarios and test cases to acceptance criteria for a story, and look at all possible variations for the sake of completeness. Teams who automate most of their acceptance testing frequently end up doing that. Although it might seem counter-intuitive, this is a sure way to destroy a good user story.

The core of the problem is that a large number of complex scenarios is OK for a machine to process, but not for humans to understand easily. Fast iterative work does not allow time for unnecessary documentation, so acceptance criteria in user stories often ends up being the only detailed specification most teams produce. If that specification is complex and difficult to understand, it is unlikely to lead to good results. Complex specifications don’t invite discussion - people tend to read them in isolation and selectively ignore parts they feel are less important. This can lead to an illusion of precision and completeness. The overwhelming amount of data makes everyone feel secure that someone has the entire picture, but in fact people can easily understand things differently.

Here is a typical example:

Feature: Payment routing
   In order to execute payments efficiently
   As a shop owner
   I want the payments to be routed using the best gateway
   
     Scenario: Visa Electron, Austria
      Given the card is 4568 7197 3938 2020
      When the payment is made
      The selected gateway is Enterpayments-V2
     Scenario: Visa Electron, Germany
      Given the card is 4468 7197 3939 2928
      When the payment is made
      The selected gateway is Enterpayments-V1
     Scenario: Visa Electron, UK
      Given the card is 4218 9303 0309 3990
      When the payment is made
      The selected gateway is Enterpayments-V1
     Scenario: Visa Electron, UK, over 50
      Given the card is 4218 9303 0309 3990
      And the amount is 100
      When the payment is made
      The selected gateway is RBS
     Scenario: Visa, Austria
      Given the card is 4991 7197 3938 2020
      When the payment is made
      The selected gateway is Enterpayments-V1
    ....

…. followed by ten more pages of similar stuff.

The team that implemented the related story suffered from a ton of bugs and difficult maintenance, in my opinion largely caused by the way they captured their examples. This long, monolithic list of examples was difficult to break up, so it was only possible for one pair of developers to work on it instead of splitting work. Because of that, it took several weeks for the implementation to be initially available for business users to try out. The large chunk of work needed to be properly tested, so another week or so passed until it was ready to go live, and then the fireworks started. The false sense of completeness prevented the delivery team from discussing important boundary conditions with business stakeholders, and several important cases were obvious to different people in different ways. This surfaced only after a few weeks of running in production, when someone spotted increased transaction costs.

Although each individual scenario might seem understandable, pages and pages of this make it hard to see the forest for the trees. Of course, this wasn’t implemented as one huge If-Else statement, but some critical variations came in only on page six or seven. Complemented by hotfixing under pressure to resolve production issues, the result was pretty much a patchwork of special cases.

These examples tried to show that, in cases when several processors could potentially execute a card transaction, high risk transactions should be sent to a more expensive processor with better fraud controls, and low risk transactions need to go to a cheaper processor even if it has worse fraud controls. Important business concepts such as transaction risk score, processor cost or fraud capabilities were not captured in the examples - the technical model differed from the business model. Changing small thresholds for risk scoring required huge changes to a complex network of special cases in software, and a ton of unexpected consequences. When a cheaper payment gateway suddenly had better fraud control than a more expensive one, several weeks of testing were needed to adjust the system to benefit from it.

An overly complex specification is often a sign that the technical model is misaligned with the business model, or that the specifications are described at the wrong level of abstraction. Even when correctly understood, such specifications lead to software that is hard to maintain, because small changes in business can lead to disproportionately huge changes in software.

It is far better to focus on illustrating user stories with key examples - a small number of relatively simple scenarios that will be easy to understand, evaluate for completeness and criticise. This doesn’t mean throwing away precision - quite the opposite - it means finding the right level of abstraction and the right mental model that would allow a complex situation to be described better.

The payment routing case could be broken down into several groups of smaller examples. One group would describe transaction risk based on country of residence - something that business users intuitively had in their minds without ever expressing it directly. Another group of examples would describe how to score a transaction based on payment amount and country of purchase. Several more groups of examples would describe individual transaction scoring rules, focused only on the relevant characteristics of a purchase. Then one overall group of examples would describe how to combine different scores - regardless of how they were calculated. A final group of examples would describe how to match the transaction risk score with the compatible payment processors, based on processing cost and fraud capabilities. Each of these groups might have five to 10 important examples, and would be much easier to understand. Taken together, these key examples would allow the team to describe the same set of rules, much more precisely but with far fewer examples than before.

Key benefits

Describing stories with several simple groups of key, focused examples leads to specifications which are easier to understand and implement. Smaller numbers of examples make it easier to evaluate completeness and argue about boundary conditions, so they make it easier to discover and resolve inconsistencies and differences in understanding. Stakeholders can have a better, more focused discussion with several smaller groups of focused examples than a with huge spreadsheet of endless possibilities. As a result, the outcome will be much more likely to satisfy the original business needs faster, and it will be less likely that unexpected consequences will be discovered only in production.

Breaking down complex examples into several smaller and focused groups leads to more modularised software, which reduces future maintenance costs. Continuing with the transaction processing example, modelling examples that explain individual scoring rules would give enough hints to the delivery team to capture those rules as separate functions, so that changes to an individual scoring threshold do not impact all the other rules. This would avoid unexpected consequences when rules change. Adding another processor, or changing the preferred processor after a cheaper one gets better fraud control, would require small localised changes instead of causing weeks of confusion.

Describing different aspects of a story with smaller and focused groups of key examples allows the team to divide work better. Two people can take the country-based scoring rules, two other people could take the routing based on final score. When everything is described as a combinatorial explosion of examples, it’s difficult to divide work. Smaller groups also become a natural way of slicing the story - for example some more complex rules could be postponed for a future iteration, but a basic set of rules could go live in a week and provide some nice business value.

Finally, focusing on key examples significantly reduces the sheer volume of scenarios which need to be checked. Assuming that there are six or seven different scoring rules in combination, and that each has five key examples, the entire problem space can be described with roughly eighty thousand examples which combine all variations (five to the power of seven). Breaking it down into groups would allow us to describe the same problem space with forty or so examples (five times seven, plus a few overall examples to show that everything holds together). This significantly reduces the time required for describing and discussing the examples, but also for checking the implementation, and provides a much better starting point for any further exploratory testing.

How to make this work

The most important thing to remember is that if the examples are too complex, your work on refining a story isn’t done. There are many good strategies for dealing with complexity. Here are four that I often use as a good starting point:

  • Look for missing concepts
  • Group by commonality and focus only on variations
  • Split validation and processing
  • Summarise and explore important boundaries

Overly complex examples, or too many examples, are often a sign that some important business concepts are not explicitly described. Risk score is a good example of that in the payment routing case. Discovering these concepts will allow you to offer alternative models and break down both the specification and the overall story into more manageable chunks. You can use one set of examples to describe how to calculate the risk score, and another to describe how to use the score once it is calculated.

A very common case of hidden business concepts is mixing validation with usage - for example, if the same set of examples describes how to process a transaction and all the ways to reject a transaction without processing (card number in incorrect format, invalid card type based on first set of digits, incomplete user information etc). The hidden business concept in that case is ‘valid transaction’, and it would be much better to split a single large set of complex examples into two groups - determining if a transaction is valid, and working with a valid transaction. Those groups can then be broken down further based on structure.

Long lists often contain groups of examples similar in structure or values. For example, in the payment routing story there were several pages of scenarios with card numbers and country of purchase, a group of examples involving two countries (registration and purchase), and some examples where the actual transaction value was important. Identifying commonalities in structure is often a good first step to discovering meaningful groups. Each group can then be restructured to show only the important differences between examples, reducing the cognitive load.

The fourth good strategy is to identify important boundary conditions and focus on them - ignoring examples that do not increase our understanding. For example, if the risk threshold for low-risk countries is 50 USD, and 25 USD for high risk countries, then the important boundaries are:

  • 24.99 USD from a high risk country
  • 25 USD from a high risk country
  • 25 USD from a low risk country
  • 49.99 USD from a low risk country
  • 50 USD from a low risk country

Discussing and documenting these five examples will probably give the team 90% of the value they could get from twenty pages of similar examples. Don’t aim to replace testing fully with examples in user stories - aim to create a good shared understanding, and give people the context to do a good job. Five examples that are easy to understand and at the right level of abstraction will do a much better job at that than hundreds of very complex test cases.