Most often, our instinct is to respond to an event by looking for what we think should have been in place or what should have happened to have prevented the incident.
For example, a high-level alarm was generated at the panel in the control room when there was an incident where a vessel containing hydrocarbon product overflowed.
Investigators immediately landed on the cause as: “The cause of the overflow incident can be attributed to the failure of the Panel Operator to react to the high level alarm”.
This is a common way of looking at causes of an incident.
We can even take it a step further and immediately land on the solution: “Discipline the Panel Operator” or “Re-train the Panel Operator”.
This type of reasoning allows for a quick investigation, but the question of why the incident happened has not been answered. We merely identified the barrier that we think had failed. After the investigation, there has not really been any learning for the organization to prevent the incident from recurring.
However, if we seek to understand “Why the level went over the top”, we may get a very different understanding. Firstly, from a process point of view, we need to understand why the vessel inlet flow had exceeded the vessel outlet flow (this is a requirement for the vessel to go over the top). Secondly, instead of blaming the Panel Operator, we can try to understand what the Operator was doing at the time.
In this particular example, we learnt that an instrument further upstream in the process had failed, triggering a sudden increase in product to the vessel. The automatic vessel level control system was unable to respond in time resulting in the level overflowing. At the time, the failed instrument had generated multiple process upsets that had distracted the Panel Operator when the high level alarm had triggered.
A more thoroughly investigated incident, resulted in a completely different set of solutions that addressed the potential for a high rate of inflow. Structured solution development then resulted in an automatic trip system that shut off the entire process before the level could go out of control.
Had we stopped at the initial investigative approach, we would have repaired the instrument and applied consequence management to the Operator. A few years (or even months) down the line, we may still have had a repeat incident.
Conventional root cause analysis techniques certainly have their place if you need a quick reaction to an incident, but it is important to understand its limitations and circle around to a more thorough investigation after the dust from the initial incident has settled to ensure that we prevent a recurrence.