At the Agile testing user group meeting on 4th May 2010, Douglas Squirrel presented ideas on running effective root cause analysis that he uses at YouDevise to facilitate continuous improvement. Saying that reflective practices are more important than pure technical practices for software development teams today, Squirrel suggested that an effective root cause analysis workshop can be instrumental to reflect on organisational processes and improve them.
Target a specific event
Instead of generic and abstract problems – such as “why do support people never find …”, Squirrel suggested targeting a specific event, a particular incident. This helps put things into a more concrete perspective and also shifts the focus from a single group of people (in this case support) to the wider organisation, so that the entire organisation can benefit from the analysis.
Everyone affected attends
It is crucial for everyone affected by that particular incident or event to attend the root cause analysis, to avoid the danger of missing information. It also helps to avoid the blame culture – as it is very easy to blame people who are not present. Squirrel especially pointed out that executives should attend root cause analysis meetings. First of all, they are capable of fixing very deep resource problem and having senior executives at a meeting also sends a message of importance.
Establish a No-Blame culture
To get to the right information, Squirrel suggested that there has to be a “no blame” culture during a meeting. Otherwise people won’t be direct and honest. In one case, by allowing people to speak openly about what went wrong, they discovered that a group of people stayed working after midnight to fix a problem – and started addressing the cause of that instead the original issue.
Poll to identify problems
Squirrel also suggested to ask everyone “What do you think is the problem?”. This provides different contexts and uncovers additional information. Talking about a server configuration issue, he said that the traders at the meeting complained that they weren’t notified about the fix and lost twelve hours of trading. Polling to identify problems exposed a communicational issue between the operational staff and the traders.
Write a lot
After the initial set of problems is identified, start popping the why stack and get to the “fifth why”. Every moment you should be either writing down why something happened or asking ‘Why?’. This keeps the discussions short and focused.
Move down, then across
Polling and the discussion will identify many problems, and to manage this effectively Squirrel advised tackling them one at a time, otherwise the group will get distracted with reasons and not identify solutions. “You know you are at the fifth why because it hurts, and there is usually a pause ” when the real problem is identified, said Squirrel, adding that “if it doesn’t hurt, you are not doing it right”. One of the key take-aways for me from the practical exercise that followed is that people sometimes use humour to disguise things that hurt or divert the discussion from that, so cynical or humorous remarks are also a sign that we might be going in the right direction.
Set proportionate tasks
When the real root cause of the problem is identified, don’t get carried away and “retrain your development team because of five minutes of downtime”, said Squirrel, “but define tasks proportionate to the problem”. “It’s not necessary to solve problems, but make progress”, said Squirrel. Instead of gold-plating solutions, he suggested acting quickly. “If you do it wrong, it will come back again”. Solutions that take too long will never get done, so Squirrel suggested thinking about what you can do in a week or even in a hour, and building up the solution the next time a problem happens. If a solution that takes less than a week to implement cannot be identified, escalate or bring in consultants suggested Squirrel, adding that the root cause analysis process helps to build a business case for the solution.
Look for patterns across sessions
Running root cause analysis sessions frequently and acting quickly to build up incremental solutions helps to avoid complacency and ensure that the organisation is continuously improving. Identifying patterns in problems that are analysed might point to deeper organisational issues, or solutions that never get implemented. If tasks don’t get finished, this might mean that there is no buy-in to implement change – in which case Squirrel advised getting the buy-in from senior management upfront. It can also expose cross-cutting cultural problems with the organisation, which can be addressed separately.
If you found this summary interesting, you can watch the video of this presentation including the interactive exercise that Squirrel organised after the presentation.