A Simplified, Yet Rigorous Approach to Root Cause Analysis

 

There is no such thing as a root cause. There is no such thing as human error.

Accidents happen when we lose control.

We have many options for conducting root cause analyses (RCA) including TapRooT™ and CAST/STPA. The method described here is the method developed by GATE to simplify the analysis while maintaining adequate rigor.

Trevor Kletz famously noted:

“There is no such thing as a root cause, just a point at which we stop asking why.”

We agree. There are usually many causes and no identifiable ‘root’. But the idea of a root cause is a useful fiction. We can identify causes that we can do something about.

What do we mean by “no such thing as human error”?

Humans are not perfect. We make mistakes. It is almost always possible to claim that a particular human error is the root cause but this is useless information. To prevent reoccurrence, we need to understand why the human did something that, in retrospection, seems erroneous.

The GATE Root Cause Analysis (RCA) process is based on the idea that accidents occur when we lose control. The control action in question may be done by an instrumented system, by a human, by an organization, etc. In any case a controlled system can be modeled as shown in Figure 1. This model is well understood when used to describe an industrial control loop such as liquid level control in a tank.

Figure 1: Control System Model

It is also applicable to human controller, for example, for a human driving a car. The controlled system is car. The controller is the human.

The controller can be thought of as having two components – A process model and a control algorithm. The human’s process model is everything he/she knows about the car, about driving, about the road being used, current weather conditions, current traffic load, etc.

The process model reacts to and makes sense of the feedback. Feedback includes measured speed, road feel, weather, responses of other drivers such as brake lights, horns, etc. The human takes several control actions including braking, accelerating, steering. These actions are determined by his/her control algorithm in concert with the current state of his/her process model.

Errors can and do occur in every part of the control loop: Malfunction of the controlled system (failure of any component of system), wrong or missing feedback, process model flawed or incorrectly interprets feedback, control algorithm specifies incorrect control action, desired control action not accurately implemented.

The GATE RCA Process

The GATE RCA Process is based on Decision Theory, especially Naturalistic Decision Theory, Cognitive Task Analysis, STPA/CAST, and expertise in writing and evaluating procedures.

In the next sections, we will examine the following process closer:

  1. Describe the Event and Loss to be Analyzed

  2. Identify the Participants, stakeholders, SMEs

  3. Develop a Timeline

  4. Identify and Review Documents such as Operating Procedures and Work Permits that are directly related to the event.

  5. Deepening (key actions)

  6. Identify Key Causes

  7. Identify Ways to Prevent Reoccurrence

Step 1: Describe the Event & Loss

This is a high-level description of the event and loss that will be analyzed. The description will include:

  • Location

  • Date and Time

  • Participants / victims

  • High-level description of the event

The event will likely fall into one or more of these categories:

  • Injury

    • Blunt forces

    • Heat or Cold

    • Electricity

    • Toxic Substance

  • Loss of Containment

  • Fire/Explosion

  • Weather Event

  • Dropped object

  • Transportation Incident

  • Equipment Damage/Failure

  • Structural Damage/Failure

Care must be taken in this description to avoid jumping to conclusions on causes. It should include only observable, verifiable facts.

Step 2: Identify the Participants, Stakeholders & SMEs

The participants are the people directly involved in the incident. They will be interviewed during the analysis to develop the timeline (Step 3) and for deepening (Step 5).

Stakeholders may include:

  • Managers Responsible for the Facility

  • Owner(s) of the Facility including Business Partners

  • Community

  • Regulators, etc.

Subject Matter Experts (SMEs) include:

  • Design engineers with relevant expertise

  • Operators not involved in the incident who have worked in this or similar facilities.

  • Safety Professionals

Step 3: Develop the Timeline

The timeline is critical to everything that follows. We develop the timeline to identify the relevant action:

  • Who did what and when?

  • What relevant equipment actions/malfunctions occurred and when?

  • Did control systems behave in unanticipated ways? Describe.

If multiple individuals were involved in the accident a strategy should be developed and applied for generating the timeline. Participants may be interviewed separately or in a group. The suggested approach is:

  1. Interview the participants individually and develop multiple versions of the timeline.

  2. Identify discrepancies.

  3. Conduct further interviews with the individual participants to try address those discrepancies.

  4. And/or conduct a joint interview with all participants to address the discrepancies as a group.

Step 4: Identify & Review Relevant Documents

If the incident involves a task, then relevant information will be documents that describe the task such as operating procedures, JSAs PTWs, etc.

At this point it can be useful for the facilitator to consult with uninvolved subject matter experts to develop a deeper understanding of the system and the task prior to the deepening interviews.

However, care must be taken not to jump to conclusions on the root cause(s) at this phase.

Step 5: Deepening

Once the timeline has been established and the facilitator has been versed on the system, we focus attention on each of the key actions individually. These are identified from the timeline.

The key control actions may have been taken by humans or by equipment. The involved stakeholders are interviewed to develop a deeper understanding of the actions.

During deepening we seek to understand, for each relevant action:

  • Who knew what and when?

  • Who was consulted? When? About what? What did they advise?

  • How and why equipment items malfunctioned.

The fundamental consideration at each point is to understand:

Why did we lose control?

Figure 2: High-Level Cause Map

Figure 2 is used to identify the important topics of discussion for each interview.

Wrong Human Action will often be attributed to one or more of the following:

  • Lack of Knowledge or Skill

  • Procedures or Procedure Following

  • Inadequate/ Inaccurate Situation Assessment

  • Decision-making

If the failure is attributable, at least in part, to a mechanical failure, Figure 2 suggests many possible causes. Of these possible causes, control failure is likely the most common and should be investigated carefully.

In accidents where transportation played a role, use Figure 2 as a preliminary guide.

Safeguards are provided to prevent an accident and/or to mitigate the effects of an accident. Safeguards can also cause an accident (for instance, unintended consequences of a complex cause and effect chart). Where safeguards failed to prevent or mitigate or actually caused the accident use Figure 2 as a discussion guide.

Step 6: Identify Key Causes

There is no such thing as a root cause. For most accidents there are multiple causes. For each of those causes the backward chain to a ‘root’ cause extends indefinitely as long as we keep asking another ‘Why?’, but there is a point of diminishing returns.

The right place to stop a root cause analysis is when we have identified causes that can be meaningfully addressed such that the risk of reoccurrence is meaningfully decreased.

Step 7: Identify Ways to Prevent Reoccurrence

While keeping in mind that there is no such thing as a root cause, there will be causes which the organization can address in order to decrease the risk of reoccurrence. These causes may be hierarchically organized:

  • Things that operators can do

  • Things that direct line managers can do

  • Things that corporate leadership can do

  • Things that community, regulators, industry can do

 

Viking Can Help

With our legacy of experience in process design, materials selection, risk assessment, HAZOPS, and systems analysis along with our specific knowledge of human error research, cognitive task analysis, naturalistic decision making and procedure writing, we can provide effective and efficient root cause analyses for accidents and near misses large and small.

Previous
Previous

10 Steps to Safely Handling Failed Parts for Laboratory Testing

Next
Next

Development of the NACE “MR-01-75” and NACE “TM-01-77” Standards: Part II – Accelerated Material Qualification Testing in Sour Environments at Near Atmospheric Pressure