The Anatomy of an Isolated Incident:

challenger

I read about the death of Bob Ebeling today. He was a NASA contract Engineer from Morton Thiokol who tried to stop the launch of the space shuttle Challenger in 1986. On January 26, 1986 soon after the launch, the Challenger was engulfed in flames. All seven crew members lost their lives in this terrible accident. Famous Nobel laureate Richard Feynman was part of Rogers Commission which investigated the Challenger accident.  Feynman wrote about this investigation in depth in his 1988 book “What Do You Care What Other People Think?”

In today’s post, I will be looking at Isolated Incidents. There are times in my career where I am taken aback by isolated events.  These events happen very rarely, and thus it is not easy to understand the root causes. I will use the Challenger accident as the primary example to look at this. There have been 135 NASA space shuttle missions between 1981 and 2011. Of the 135 missions, 133 flights went as planned, with two ending in disaster.

The O-Ring Fiasco:

The Roger Commission identified that the Challenger accident was caused by a failure in the O-rings that were used to seal a joint on the right solid rocket booster. Bob Ebeling was among the group of Engineers who had warned NASA against the launch based on his concerns about the seals. The O-rings were not proven to work under cold conditions. It was noted that the temperature was below freezing on the day of the launch. Feynman famously demonstrated this by immersing an O-ring in a glass of ice water, and demonstrating that the O-rings were less resilient and that it retained its shape for a very short amount of time. This lack of resilience caused the failure of the seals leading to the Challenger catastrophe.

vlcsnap-2016-03-20-14h56m01s705

The Roger Commission indicated the following issues led to the Challenger accident:

  • Improper material used for the O-ring.
  • Lack of robust testing – the O-ring material was not determined to function as intended by NASA. Even though the O-ring manufacturer gave data to prove the lack of functionality at low temperatures, NASA management did not heed this.
  • Lack of understanding of risk from NASA management.
  • Potential push from management to launch the space shuttle to meet a rush deadline.

Feynman also wrote about the great disparity in the view of risk by the NASA management and the engineers. NASA management assigned a probability of 1 in 100,000 for a failure with loss of vehicle. However, when Feynman asked the engineers, he got values as low as 1 in 100. Feynman reviewed the NASA document that discussed the risk analysis of the space shuttle and was surprised to see extremely low probability values for failures. In his words;

The whole paper was quantifying everything. Just about every nut and bolt was in there. “The chance that a HPHTP pipe will burst is 10-7”. You can’t estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate. It was clear that the numbers for each part of the engine were chosen so that when you put everything together you get 1 in 100,000.

Feynman also talked about an engineer being candid with him about his probability value of 1 in 300. He said that he calculated the risk as 1 in 300. However, he did not want to tell Feynman how he got his number!

The Anatomy of an Isolated Incident:

I have come to view the Isolated Incident cause-effect relationship as an equation. This is shown below.

Isolated Incident = Cause(s) + System weak points + Enabling Conditions

The Challenger Accident can be summarized:

Challenger Accident = Material limitation of the O-ring + NASA Management Policies + Cold conditions

The System Weak Point(s) are internal in nature. The enabling conditions, on the other hand, are external in nature. When you combine all the three factors in a perfect storm, you get an isolated incident. If we do not know all of the three factors, we are not able to solve the isolated incident. By itself alone, none of the factors above may cause the problem.

Another example is – when demand goes up, and production doubles. If the process is not robust enough to handle the spike in production, then isolated events can happen.

Pontiac’s Allergy to Vanilla Ice Cream:

I will finish this post with a fantastic story I read from Snopes:

The Pontiac division of General Motors received a complaint in the form of a letter. The letter was from a frustrated customer. He had been trying to contact the company for a while.

He wrote in the letter that he and his family were used to buying ice cream after dinner on a frequent basis. The type of ice cream that is purchased depended upon the mood of the family. He had recently purchased a new Pontiac car, and he had been having issues on his ice cream trips. He had figured out that the new car is allergic to vanilla ice cream.

If he purchased any other flavor, his car would start with no problem. However, if he purchased vanilla ice cream, his car will not start.

“What is there about a Pontiac that makes it not start when I get vanilla ice cream, and easy to start whenever I get any other kind?”, he asked in the letter.

The letter was delivered to the Pontiac President who was very amused by it. He sent an engineer to investigate the fantastic problem. The engineer went with the family three nights to get ice cream in the new car. The first night the family got chocolate ice cream, and the car started with no problem. The second night, they got strawberry. The car again was fine. On the third day, the family got vanilla ice cream; lo and behold the car would not start.

This was repeated on multiple days, and the results were always the same. The engineer was a logical man, and this stumped him. He took notes of everything. The only thing that he could see was time. The family always took the shortest amount of time when they purchased vanilla ice cream. This was because of the store layout. The vanilla ice cream was quite popular and was kept at the front of the store. Suddenly, the engineer identified why the isolated incident happened. “Vapor lock”, he exclaimed. For all the other flavors, the longer time allowed the engine to cool down sufficiently to start without any issues. When the vanilla ice cream was bought, the engine was still too hot for the vapor lock to dissipate.

Always keep on learning…

In case you missed it, my last post was Kintsukuroi and Kaizen.

Advertisements

Is Murphy’s Law alive and well?

James_Edward_Murphy

In today’s post, I will be trying to look into Murphy’s Law.

There are multiple versions existing for this law, the most common version being – “whatever can go wrong will go wrong”. Some other variations are as follows;

  • If there is a possibility of several things going wrong, the one that will cause the most damage will be the one to go wrong.
  • If everything seems to be going well, you have obviously overlooked something.
  • Nature always sides with the hidden flaw.

Murphy’s Law makes a pessimist out of the most optimistic man. Is it true that the universe has a tendency for causing things to fail? Does Murphy make a rational man go “Why me?”, when something unexpected happens?

A common version of Murphy’s law is the case of buttered toast. The buttered toast always fall on the buttered side. Let’s look into this deeper.

Does buttered toast listen to Murphy?

The following section is taken from “The Australian Journal: A Weekly Record of Literature, Science, and Art, Volume 23”, from 1888. The highlighted section shows that the idea of buttered toast/bread falling on its buttered side is common, even in the 1800’s.

AustralianJournal

Interestingly, studies have shown that buttered toasts fall on their buttered side almost 62% of the time. This would mean that it is not a fifty-fifty chance like flipping a coin. Why? This seemingly curious “bad luck” can be explained with science. Delving deep into the case of buttered toast, it becomes clear that the following factors always remain the same;

  • The toast always starts with the buttered side face-up
  • The height of the fall is similar (2-3 feet). This is because the toast is held at waist height generally, and in the case of falling from a table, the standard table height is between 2-3 feet.

These two factors increase the chances for the toast to fall on the buttered side. In fact, studies have shown that when toasts are thrown up in the air, the likelihood decreases to fifty-fifty. Alternately, when the toast is dropped from a height of 7-8 feet, the likelihood of buttered toast falling on the unbuttered side goes up back to about 62%. The reader can find more about this here and here.

Does Murphy still seem threatening?

Factors which cause Murphy to visit:

I have compiled a list that explains why Murphy is prevailing.

  • Nature of humans: Humans always remember when something bad happens to them. Do you remember the last time your car broke down and you had to call for it to be towed away? Do you remember the other 99.9% of time, where you did not have any problems with the car? Since your brain likes to avoid making mistakes, it likes to recall the bad times more so that you do not repeat the same mistakes. The downside of this is that it can make you start noticing only the bad events. Think of a large white paper with a small black dot. Our attention is on the black dot, and not at the remaining 99.9% of the white space.
  • Law of large numbers: The bad thing about events with relatively small probabilities is that they will still happen. No matter how small the probability, with enough chances the event will happen. The probability of winning the powerball lottery is 1 in 292,201,338. Even with such a small probability, people still win the lottery on a regular basis. The probability of somebody winning a lottery goes up when the prize gets really high (>$300 Million). This is because, a larger amount of tickets are sold during that time. As Law of large numbers dictates, with enough chances even the low probability event of winning a lottery happens.

Combining the Nature of Humans, and the Law of large numbers, you have the perfect storm that allows Murphy to rule the world. The egocentric view of humans tends to make events about them, when from a probability standpoint, it could have happened to anybody. There is a profound difference between asking “What are the chances of it happening” and “What are the chances of it happening to me?”

  • Law of Nature: It is the law of nature that everything degrades over time. Eventually, all products will fail. A good example is when you move into a new house, and after about 7 years, more than one appliance starts to breakdown. First it was the refrigerator, and now it is your washer as well. The fact that the two appliances were bought together might escape your mind, and you will blame Murphy.
  • Poor Processes: In relation to item 3 discussed above, if you have poor processes, the chance of multiple things to fail goes up. A good example is poor preventive maintenance procedures. Multiple equipments can break down at the same time, if they are not maintained properly. If one equipment can go bad, there is a good likelihood for another to go bad at the same time, if the same poor preventive maintenance program was being used. A poorly designed system can become a playground for Murphy.
  • Special Causes: Sometimes the unlikely event(s) happens due to special causes. Sometimes this special cause can be an enabling condition that allows multiple things to breakdown. The special cause at times is people. People are inherently inconsistent, and they can add inadvertent variation to the process that makes thing go wrong.
  • Complexity and Chaos: Murphy’s law is very much relevant in the presence of complexity and chaos. In the presence of disorder and uncertainty, the reliability of a system can breakdown easily. Any order from constraints is disrupted and this allows more things to go wrong. I welcome the reader to visit Cognitive Edge website to learn more about this.

Final Words and the story of Arthur Ashe:

As detailed in the buttered toast section, it is imperative that one tries to understand why something went wrong. What are the factors affecting the process? What are the chances of the event to happen? Is there indeed a pattern or is the pattern created by the perception? The buttered toast is a rigged game where there is high likelihood of the toast to fall on its buttered side when dropped from a height of 2-4 feet.

WHY ME?

Arthur Ashe, the legendary Wimbledon player was dying of AIDS which he got due to infected blood he received during a heart surgery in 1983.

From the world over, he received letters from his fan, one of them conveyed: “Why does God have to select you for such a bad disease?”

To this Arthur Ashe replied: The world over–50,000,000 children start playing tennis, 5,000,000 learn to play tennis, 500,000 learn professional tennis, 50,000 come to the circuit, 5000 reach the grand slam, 50 reach Wimbledon, 4 to semi finals, 2 to finals. When I was the one holding the cup, I never asked god “Why me?”

And today in pain, I should not be asking GOD “why me?”

Always keep on learning…

The Mysterious No Fault Found:

nofault

As a Quality Engineer working in the Medical Device field, there is nothing more frustrating than a “no-fault-found” condition on a product complaint. The product is returned by the customer due to a problem while in use, and the manufacturer cannot replicate the problem. This is commonly referred to as no-fault-found (NFF). I could not find a definite rate on NFF for medical devices. However, I did find that for the avionics industry it is 40-60% of all the complaints.

The NFF can be also described as “cannot duplicate”, “trouble not identified”, “met all specifications”, “no trouble found”, or “retest ok”. This menacing condition can be quite bothersome for the customer as well as the manufacturer. In this post, I will try to define some red flags that one should watch out for, and a list of root causes that might explain the reasons behind the NFF condition. I will finish off with a great story from the field.

Red flags:

The following list contains some of the red flags that one should watch out for, if no-fault was found with the product that was returned. This list is of course by no means meant to be an exhaustive list, but might provide some guidance.

  • Major incident associated with the complaint – If the return was associated with a major incident such as a serious injury or even worse, death, one should test the unit exhaustively to identify the root cause.
  • Unit was returned more than once – If the unit was returned for the same problem, it is an indicator of an inherent root cause creating the problem. Sometimes, an existing condition can act as an enabling condition and can create more than one effect. In this case, the problem may not be the same for the second or third return. Alternatively, the enabling condition can be present at the customer’s site.
  • Nonstandard Rework(s) performed on the unit during production – I am a skeptic of reworks. A rework is deviation from the normal production. And sometimes, fixing one thing can cause another thing to fail.
  • The product is part of the first lots produced after a major design change – If the product validation process is not adequate or if proper stress tests were not performed, the unit can be produced with latent issues/bugs.
  • More than one customer reporting the same problem – If there is more than one source reporting the problem, it is a clear indication of an inherent issue.

Potential root causes for NFF condition:

The following list contains some of the root causes that might be associated with a no-fault condition. This list is of course by no means meant to be an exhaustive list.

  • Adequacy of test methods – If the test method is susceptible to variations, it may not catch failures. This cause is self-explanatory.
  • Excess stress during use – Reliability Engineering will tell you that if the stress during use exceeds the inherent strength of the product, the product will fail. This stress can be environmental or can be due to use beyond the intended use of the product. An example is if the product is used at a wrong voltage.
  • New user or lack of training – If the end user is not familiar with the product, he/she can induce the failure that might not occur otherwise. This is not an easy root cause to figure out. Sometimes this is evident by the appearance of the product in the form of visible damages (dents, burn marks etc.)
  • High number of features – Sometimes, the higher the number of features, the more the complexity of the product and worse the ease of use of the product. If the product is not easy to use, it can create double or triple fault conditions more easily. A double or triple fault condition occurs when two or three conditions are met for the fault to happen. This is considered to be isolated in nature.
  • Latent bugs/issues – No matter how much a new product design is tested, all the issues cannot be identified. Some of the issues are left unidentified and thus unknown. These are referred to as latent issues/bugs. This is the reason why your mobile phone or your computer requires updates or why some cars are recalled. These bugs will result in failures that are truly random and not easy to replicate.
  • Problem caused by an external accessory or another component – The product is sometimes used as part of a system of devices. Sometimes, the fault may lie with another component, and when the product is returned, it may not accompany all the accessories, and it will be quite hard to replicate the complaint.
  • Lack of proper validation methods – Not all of the problems may be caught if the validation methods are not adequate. This cause is similar but not the same as latent bugs/issues. Here, if there was no stress testing performed like transportation or environmental, obvious failures may not be caught.
  • Customer performed repairs – Sometimes, the failure was induced by something that the customer did on the product. This may not always be evident unless revealed by the customer.
  • Customer bias – This is most likely the hardest cause to identify on this list. Sometimes, the customer may “feel” that the product is not functioning as intended. This could be because they experienced a failure of the same brand at another time, and the customer feels that the entire product brand is defective.
  • Other unknown isolated event – Murphy’s Law states that “whatever can go wrong will go wrong.” Law of Large Numbers loosely states that “with enough number of samples, even the most isolated events can happen.” Combined together, you can have an isolated incident that happened at the customer site and may never happen at the manufacturer site.

The mystery of diathermy burns:

I got this story from the great book “Medical Devices: Use and Safetyby Bertil Jacobson MD PhD and Alan Murray PhD. Sometimes, a surgery that uses a device like an RF Generator can cause burns on the patient from the heat induced by the device. This is referred to as “diathermy burns”.

A famous neurosurgeon retired and started working at a private hospital. Curiously, after a certain date, five of his patients reported that they have contracted ugly, non-healing ulcers. These were interpreted as diathermy burns. These burns were present on the cheek bones of the patients who were placed face-down for the surgery and on the neck region of the patient who were operated in the supine position (face-upward). The surgeon has had a long uneventful and successful career with no similar injuries ever reported.

No issues were found with the generator used for the surgery. A new generator was purchased, and the problem persisted. The manufacturer of the generator advised replacing the wall outlet. The problem still persisted. The surgery routines were updated and rigorous routines involving specific placement of electrodes were put in place. The problem still persisted.

A clinical engineer was consulted. He also could not find any fault with any of the equipment. At that point he requested witnessing the next operation. During this, it was discovered that the new assistant surgeon was placing his hands heavily on the patient’s head during the operation. Thus, the diathermy burns were actually pressure necroses caused by the assistant surgeon. These apparently can be misinterpreted as diathermy burns!

This story, in a curious way, implies the need to go to the gemba as well! Always keep on learning…