I wanted to call this post “Failing Successfully”, but I changed my mind and decided to paraphrase the famous epistemologist of randomness and risk, Nicholas Taleb.
“Every plane crash has lowered the probability of next plane crash. That is a system that is overall anti-fragile. You never let a mistake go to waste”.
The concept of antifragility is a strong concept. This is something beyond resiliency. Resiliency is about getting back up when you fall. Antifragility is gaining from the fall and getting back up stronger. There is famous Japanese proverb that says – “Fall seven times, stand up eight.” To me this is the essence of resilience. However, antifragility is falling seven times, and standing up each time stronger than before. In Taleb’s words, antifragility makes things gain from disorder.
We can say that we learn more from mistakes and from failures. Failures challenge our mental models and it shows that there was something that we did not consider in our model. From an information theory standpoint, failures have more information content whereas successes have none or minimal information content. When we succeed we do not understand if it is because our mental model is correct or if it is because of something else. We do not look any further. In a similar vein, when we fail we still do not know if it is due to our incorrect mental model or if it is something else. However, we will be more determined to look into why we failed. Nicholas Taleb has also said;
“It does not matter how frequently something succeeds if failure is too costly to bear.”
Safe to Fail Environment:
Our aversion to failures is generally related to consequences. This is where the concept of “safe to fail” probing comes. The concept of “safe to fail” is to knowingly create environments where we might fail, but the failures cause minimal damage. This is causing failures in a controlled environment. We are encouraged to experiment as often as possible so that we can uncover any potential weak spots. Dave Snowden from Cognitive Edge (co creator of Cynefin framework) has done a lot of work in this. He talked about the importance of safe to fail experiments within a complex system as follows;
One of the main (if not the main) strategies for dealing with a complex system is to create a range of safe-fail experiments or probes that will allow the nature of emergent possibilities to become more visible.
I have underlined the “emergent possibilities” in his statement. The trick with a complex system is to understand all of the possible emergent outcomes since there are no clear linear cause and effect relationships between the parts, and this is why failures are sometimes unpredictable and can have devastating consequences. The following principles identified are inspired by Dave Snowden.
- If it is not broke, why is it not broke? Success does not mean absence of failure points.
- Experiment as often as possible with the anticipation of failures.
- Monitor the experiments and have resources available to react to failures.
- Teach others to experiment and create an environment that is not only tolerant to failures but encourages innovation and creativity.
- Be a lifelong learner and share what you have learned.
A pretty good example for all this is Netflix’s Chaos Monkey. Chaos Monkey is a software service that creates “chaos” on purpose in a safe to fail environment. From Netflix’s blog;
We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.
Chaos Monkey runs only during certain hours when there are resources available and this is again to ensure the fail to safe environment. Netflix claimed that Chaos Monkey keeps on surprising their team by uncovering many hidden failures points.
There are many failure scenarios that Chaos Monkey helps us detect. Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.
Learning from failures and getting stronger from it is an organic principle. This is how an individual or an organization grows. Getting up from a fall is resilience, but getting from a fall and learning and getting stronger from it is antifragility. Either way, never let a mistake go to waste and reduce the next failure’s probability!
I will finish with a great story about Tom Watson Jr., CEO of IBM in the1950’s.
It is said that while Tom Watson Jr. was the CEO, he encouraged people to experiment and learn from failures. One of his VPs led a project that failed and cost IBM millions of dollars. The VP was distraught when he was called to Tom Watson’s office. He expected to be fired for his mistake and quickly typed up a resignation letter. The VP gave the letter to Tom Watson and was about to leave the office. Tom Watson shook his head and said, “You think I will let you go after giving you millions of dollars worth of training?”
Always keep on learning…
In case you missed it, my last post was The Forth Bridge Principle.