Sherlock Holmes, Thomas Magnum and you. Yep, maintenance professionals are right up there with the great Magnum P.I. (and his stache) when it comes to solving the trickiest cases.
But what if there was an even easier way to solve those complex problems? Its a little technique we call Root Cause Analysis (RCA).
In this post, we will show you how to get to the bottom of why things break, and all the different ways you can fix them. Get out your detective notebook because we are going to take a deep dive into all things Root Cause Analysis.
What is root cause analysis?
By definition, root cause analysis is the process of finding the underlying cause for an effect we observe or experience. In the context of failure analysis, RCA is used to find the root cause of frequent machine malfunctions or a significant machine breakdown.
Like those meddling kids in Mystery Inc., youll use your detective skills to determine:
RCA is a reactive process, meaning its performed after the event occurs. But once a root cause analysis is done, it takes the shape of a proactive mechanism since it can predict problems before they occur.
If you fix a symptom of the problem, but you dont fix the actual cause of the problem, theres a high chance the failure will happen again.
For example, suppose you replace the broken belt but dont change the misaligned part causing the belt to overheat and break. In that case, you could bet your paycheck that the belt is going to fail again. RCA tries to follow the chain of cause and effects to pinpoint the problem that will make all the other faults disappear when finally eliminated.
The RCA process does not guarantee an outcome
Conducting root cause analysis can be very complicated. It involves a vast amount of data collection and review. The result of a root cause analysis isnt always black and white. It cant always tell you if the problem you identified is the root cause.
RCA is a craft that requires specialized knowledge and in-the-field experience. Meaning youre likely the best person for the job here. Otherwise, any fixes implemented will likely be just a cosmetic solution to the problem. In the worst-case scenario, the changes made could actually make the situation worse.
Despite these limitations, RCA is still a powerful tool for understanding and improving the fundamental nature of systems and procedures.
Over the years, RCA has evolved to work within various fields, each with its own unique needs and approach. The most apparent use of RCA is in the medical field. The TV show House is an excellent example of RCA in action.
In the show, a complex and bizarre medical case usually shows up at the hospital. The doctors are stumped! That is until the unconventional wildcard Dr. House jumps in and saves the day with his crazy theories and methods.
The good doctor uses root cause analysis to dig into an issue and keeps digging until the real cause of the patients symptoms is finally revealed. A happy ending for all!
Aside from the healthcare field, many other industries use root cause analysis regularly. Some of them are:
These industries will generally use one specific type of root cause analysis that fits their situation best. Below are some examples of different types of RCA methodologies used by various fields and industries.
Different types of RCA
RCA comes in different forms depending on the problem youre trying to solve. Heres what they look like:
When to perform a root cause analysis?
When youre doing an RCA to determine the source of a fault, youll usually find 3 basic types of problems:
Keep in mind that RCA requires a significant investment of time, manpower, and money. And it will likely cause further disruption in the specific production line or the system youre working on. So bearing that in mind, you dont need to (and you shouldnt) do RCA for every single fault.
Unfortunately, there is no cut-and-dry rule when to run an RCA and when not to. As the expert and the experienced professional, youre generally the best person to determine whether or not to run a root cause analysis.
If the same fault occurs over and over, its worth investigating. If the same defect is repeatedly happening, you can assume that it wont be cleared simply by fixing the visible problem. There is an underlying reason for the recurring faults. These types of incidents need to be investigated with RCA.
To determine if a failure is critical, you can look at the cost to the plant or the total downtime due to the particular failure. When a critical failure occurs, it needs to be investigated to identify the root cause to help avoid this situation in the future. Explosions at an oil rig and airplane crashes are examples of critical failures that need to be investigated.
There are critical machines and critical subprocesses in any system. A failure of these types of machines will halt the entire operation because there may not be a backup or mitigation plan for that particular machine. In this case, how critical the machine is will determine whether or not to do RCA.
The 3 Rs of Root Cause Analysis
No doubt youve heard these 3 Rs: reduce, reuse, recycle or maybe even reading, writing, arithmetic. But RCA also has its own system of 3 Rs: Recognize, Rectify, Replicate.
The actual cause of a problem is not always apparent, and simple cosmetic fixes usually dont do much to correct the underlying fault. Even though RCA can be an elaborate time-consuming exercise, we do it to pinpoint the actual cause so we can take corrective actions that will eliminate future issues. As mentioned earlier, RCA can also be done to identify the reason for an unexpected positive outcome.
This first step is when you notice somethings not working quite right. The machine is leaking fluid, making a weird sound, or not running as productively as it usually does. This is when its time to put on your detective cap and find out whats going on.
Once youve recognized the root cause, its time to start a corrective course of action. If the root cause is addressed, the same problem should not be cropping up again. If the same problem reappears, its likely because the cause you identified was not actually the root cause.
In this case, you might have to go through the RCA process again to make sure that you get to the actual root cause.
For example, you notice the machine is leaking fluid, so you patch the hole in the metal. If you stop seeing fluid on the ground under the machine, youve solved the problem, and youve taken care of the root issue. But if a leak crops up again in a week, its time to run another RCA to find out if there are other holes in the metal or if gaskets are failing.
Once youve identified and rectified the root cause, your next step is to ensure it will not happen again at any point during the process or system. Sometimes youll want to do an RCA to get to the bottom of an unexpectedly good outcome. In that case, you will test whether the same factors can be replicated in other scenarios and environments.
Suppose there were issues with faulty parts coming off the line, but youve since fixed the issue. The next step would be to replicate the problem to test whether you actually fixed the root issue.
In that case, youd need to replicate what happened during this period to ensure that you got to the bottom of the issue.
How to do a root cause analysis
RCA can be accomplished using many different tools and techniques. And even though those processes may look different, they all arrive at the same end goal: fixing the root cause of the issue.
To do a root cause analysis the right way, you should follow four basic steps.
Step 1: Define the problem
Start with the obvious: What is the problem? By defining the problem, the symptoms, and what you can see happening, you set the scope and direction of the analysis.
Without a specific problem statement, its hard to create a path to a solution. A well-defined problem statement also helps determine the scale and scope of the potential solution to be implemented. When youre writing your problem statement, keep these three pieces in mind:
Step 2: Collect the data
Collect all available data related to the incident. Ask yourself, What proof is there? How long has this problem existed? What is the impact of the problem? Be sure to record any other data you think might help you determine the issue.
Take, for example, machine failure in a manufacturing plant. These are examples of types of information youll want to document.
Inspecting the machine in person also provides information that could be beneficial for root cause analysis. It will be easy for facilities that run predictive maintenance to collate data quickly.
Step 3: Map out the events
Establish a timeline of events. This will help you determine which factors among the data collected are worth investigating. RCA needs data points that potentially lead to the root cause. Putting events and data in chronological order helps to differentiate causal events from non-causal events.
From the data collected, you can identify correlations between various events, their timing, and other data collected. Remember that correlation does not mean causation.
Questions to ask yourself when looking for correlations:
The next step is to map out a causal graph. These graphs are used to represent the relationship between events that happened and the data collected.
But its important to not stop investigating when you find a correlation between events. Correlation means there is a link between two events, but it doesnt automatically mean that one event caused the other. Thats why its essential to continue your sleuthing until you find a causal relationship. Find out what event caused another event. This will help you find the actual root cause.
From the data collected, chronological sequencing, and clustering, we should be able to create a causal graph (or use one of the root cause analysis tools we discuss later). You can use this graph to represent the relationship between various events that occurred and the data collected. The different paths are given different probability weights. They can serve as a visual tool to track down the root cause.
Example of a causal graph. Source: Adam Kelleher on Medium
Step 4: Solve the root of the problem
Once youve identified the root cause, you can quickly determine the best solution to fix it. You can then map it against the scope defined in your initial problem statement. If the solution works with your available resources, it can be implemented.
Fixing the root cause should eliminate the issues. If the symptoms occur again, its time to return to the drawing board and conduct RCA again.
Once the problem is solved, you will need to take proactive steps to ensure it doesnt happen again. There can be multiple solutions applied to solve a single issue.
For example, the root cause could be the wear of a bearing, which happened much earlier than expected. In this case, the procedure has to be adjusted to change the bearing at an earlier time. Similar steps to avoid recurrence of fault can be changes in the maintenance schedule, different modes of maintenance, changes in design, different OEM vendors, etc.
The implemented solution will have to be in line with the available resources. So, if the root cause is pushing the machine too hard, the obvious answer is to shorten the machine run time. However, if the production schedule doesnt allow for shortened runtimes, another solution might be scheduling more preventive maintenance.
Tried-and-true RCA tools and techniques
There are many tried and trusted frameworks available to execute RCA. None of these methods are foolproof, but they provide a solid base for how to go about root problem investigation. Each method has its own list of benefits and shortfalls. Some methods are more suitable for different industries and types of problems.
You and your company should have your own unique protocol when conducting RCA. In some instances, external consultants might be brought in to conduct RCA. In such cases, the consultants will generally have their own preferred technique or a combination of techniques they use. This is one of the reasons why it is hard to create a universal template for RCA that everyone can follow.
Lets look at the different forms of root cause analyses.
5 Why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota factories. It is addressing everything with a why, just like a curious child. Keep asking why until youve reached the root cause. You can continue this process until you reach a stage where there is no need to ask why again. At that point, you should have reached the root cause of the problem.
As a rule of thumb, asking and finding answers to 5 subsequent whys should be more than enough to reveal the root cause of most problems. Hence the name 5 why analysis.
Benefits of the 5 Whys:
When to use the 5 Whys:
Fishbone diagram (a.k.a. Ishikawa diagram)
The Ishikawa method for root cause analysis emerged from quality control techniques employed in the Japanese shipbuilding industry by Kaoru Ishikawa. The shape of the resulting diagram looks like a fishbone, which is why it is called a fishbone diagram. This diagram is built on the idea that multiple factors can lead to a failure/event/effect.
The 5 M framework (shown above) from the Toyota Production System uses RCA with the Ishikawa method. The 5 Ms are:
The problem or fault is written down at the far right end, where the fish head would be. The cause of the problem is represented along the horizontal line. Further effects and their respective causes are written down along the fish bones representing each of the 5 Ms. This process continues until the team is convinced that the root cause is identified.
Benefits of the fishbone diagram:
When to use a fishbone diagram:
Failure mode and effects analysis (FMEA)
FMEA is a proactive approach to root cause analysis, preventing potential failures of a machine or system. It is a combination of reliability engineering, safety engineering, and quality control efforts. It tries to predict future failures and defects by analyzing past data.
A diverse cross-functional team is essential when using FMEA. You will need to clearly define and communicate the scope of the analysis to your team members. Each subsystem, design, and process is closely reviewed. The purpose, need, and function of each system are questioned. Potential failure modes are brainstormed. Failure of similar processes and products in the past can also be analyzed.
The potential effects and disruptions that could be caused by each of the identified failure modes are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, you can address this by changing one or more factors outlined in the image above.
Benefits of FMEA:
When to use the FMEA methodologies:
Fault tree analysis (FTA)
Fault tree analysis is a method for root cause analysis that uses boolean logic (using AND, OR, and NOT) to figure out the cause of failure. It was developed in Bell laboratories to evaluate an Inter Continental Ballistic Missile (ICBM) launch control system for the U.S Air force.
Fault tree analysis example. Source: Six Sigma Study Guide
Fault tree analysis tries to map the logical relationships between faults and the subsystems of a machine. The fault you are analyzing is placed at the top of the chart. If two causes have a logical OR combination causing effect, they are combined with a logical OR operator. For example, if a machine can fail while in operation or while under maintenance, it is a logical OR relationship.
If two causes need to occur simultaneously for the fault to happen, it is represented with logical AND. For example, if a machine only fails when the operator pushes the wrong button AND relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND symbol. In the image above, AND is the blue symbol, and OR is the purple symbol.
Benefits of using a fault tree analysis:
When to use a fault tree analysis:
A Pareto chart indicates the frequency of defects and their cumulative effects. Italian economist Vilfredo Pareto recognized a common theme with almost all frequency distributions he could observe. There is a vast imbalance between the ratio of failures and the effects caused by them.
He proposed that in any system, 80% of the results (or failures) are caused by 20% of all potential reasons.
The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between cause and effect is evident in many different distributions, from wealth distribution among people to failures in a machine.
Paret chart for shirt defects. Source: Tulip.co
With the 80-20 principle in mind, you can use Pareto analysis to dig into failures and possible causes. To start, draw a bar graph that includes the frequency of faults and causes. With this graph, its easier to see the skew between causes and failures. Usually, youll see how a small percentage of factors cause the majority of faults.
Next, youll analyze the causes that contribute to the largest number of faults and take corrective action to eliminate the most common defects.
Benefits of using pareto charts:
When to use a pareto chart:
Root cause analysis is very open-ended and has a lot of widely used tools in various industries. We covered the major ones in the sections above, but these systems also deserve some recognition. A few honorary mentions:
CMMS to save the day
If youre feeling overwhelmed by all the different methods, metrics, and charts, not to worry, weve got your back. A computerized maintenance management system, or CMMS, can help you easily create, record, and track data used in root cause analysis.
With Limble, you can also create your own 5 Whys template and save it for use in the future. This makes it easy for anyone to quickly start a 5 Why RCA, repeating the same steps for consistent results.
To create your own 5-Whys template in our CMMS, you create a work order template, your space to record what happened. Below that, you can add child instructions asking the why. Your first Why can be to run a test to determine if the fault was a fluke or something is actually broken. You can also use custom tags to pull reports on just those specific work order templates. This gives you a clean, well-documented approach to RCA. You can easily show management, look like the star detective, get a promotion, and a big fat raise.
OK, so maybe it wont all happen in that exact order. But at the very least, it will make your life a lot easier when it comes to fixing issues.
Root cause analysis examples
RCA example #1: The case of the faulty parts
Injection molding machines are widely used around the world to create plastic in almost any shape or form. The part the machine produces should match specifications within the allowable tolerance.
Lets say there is a high incidence rate of faulty products, and we need to get to the bottom of it.
First, the problem needs to be well defined. This includes explaining the exact defect the plastic output is having. By observing the output, we can determine if it is one of the four primary defects within injection molding. They are:
Lets presume that the defect is part distortion. First, write down the problem, including the number of defects occurring as a percentage. Once that is completed, collect all the available data. Pull any maintenance logs can be pulled from your CMMS, review, manuals from the injection mold machine manufacturer, etc.
Collect information on each defective product. From this, measure the deviation from specifications. Take the heat signature of the product once it comes out of the mold, then measure the temperature of molten plastic in the barrel.
We know that part distortion almost always occurs due to temperature problems. But we cannot be sure where the temperature problem isis it in the barrel while heating or in the mold while cooling?
By analyzing the data you collected, you would be able to identify that. For this example, well assume the heat signature of the finished product is different from the expected one.
This determines that the problem is in the cooling process. Further investigation concludes that the root problem is the wrong spatial arrangement of cooling liquid conduits.
Changing the conduit arrangement that best fits the mold currently being produced will solve the problem of part distortion.
RCA example #2: The mystery of the blown fuse
Next, lets say a machine stopped because it overloaded and the fuse blew.
Investigation shows that the machine overloaded because it had a bearing that wasnt being sufficiently lubricated.
Your investigation continues, and you find that the automatic lubrication mechanism had a pump that was not pumping sufficiently. A review of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isnt an adequate mechanism in place to prevent metal scraps from getting into the pump. This enabled scraps to get into the pump and damage it.
The apparent root cause of the problem is metal scrap contaminating the lubrication system. Fixing this problem should prevent the whole sequence of events from happening again. The real root cause could be a design issue if no filter prevents the metal scrap from getting into the system. Or if it has a filter that was blocked due to a lack of routine maintenance, then the actual root cause is a maintenance issue.
Compare this with an investigation that does not find the causal factor: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply reoccur until the root cause is dealt with. (This example originally appeared here).
Nice work, detective.
Additional RCA Resources
Now is not the time to cut corners
Root cause analysis is complex and should not be done on a whim. Your team might decide to cut corners to save on time and speed up the process. But if you want to get to the bottom of any complex event, rushing the process can be detrimental to the whole project. When you have a good reason to conduct RCA, it is in your best interest to create an environment where the process can be executed successfully.
If you want to know how a CMMS could make your job less stressful, get started with Limble on a free trial, or set up a demo with our team.