Practicing To Fail In Order To Succeed Big
Introducing GameDay scenarios into some of these Web-scale companies has initiated a difficult cultural shift from a steadfast belief that systems should never fail — and if they do, focusing on who’s to blame — to actually forcing systems to fail. Rather than expending resources on building systems that don’t fail, the emphasis has started to shift to how to deal with systems swiftly and expertly once they do fail — because fail they will. Resilience Engineering: Learning to Embrace Failure, Communications of the ACM, Nov 2012.
The COO challenged me: “and this problem will never happen again will it?” My response was “you can’t afford a 100% guarantee!” Her idea of leadership and management was to get someone to say “I promise it will never happen again.” This way she had someone to blame if something went wrong.
This is a classic example of “bad management habits” I find in organizations. It sounds somewhat reasonable to demand an answer to this kind of question. We are pretty much guaranteed that if we look through the chain of events we can almost assuredly find a point where if someone did something different then the problem would not have happened. We are just sure we can arrange it so that no one will ever make a mistake.
See Hunting For The Guilty — The Story Of The Bulb
I once got a team to sign off on a plan (long story, not a good one) by asking them “if the inputs you get are perfect, can you make this schedule?” They said yes while knowing there was no way such inputs could ever be perfect.
In a Dilbert comic Dogbert was saying “You need to have more ‘gotcha’ fees, that’s how airlines make their money.” It is also how credit card issuers get more money out of us. If we make one mistake then we get an increased interest rate and a fee. These companies are happy we made the mistake. Our mistakes are part of their business plan. It is easy to set up processes that appear reasonable, but that are in fact error prone.
Compare with Eliminating Honesty Buffers
I had laid out the plan to deliver our next big product. I specifically showed where if something was late, did not get delivered or finished on time, we could absorb such a situation without impacting the schedule. One of the business managers yelled out in absolute disbelief “you are planning to fail?!”
The business manager’s notion was that the right way to plan was to assume everything will go as planned, not to have any room in a plan for fixing issues, and to get an agreement from everyone (on the implied threat of losing their job if they could not agree) that they can and will deliver on time. Do you think this was in an environment where most projects were on time? No, of course not. In fact, no projects were ever on time, it was only a question of how late they would be.
However, it was clearly not the business manager’s fault, even knowing our track record, if the project was late because he extracted a promise of perfection out of each project manager. I heard more than once the lament “they promised they could do it!”
For a humorous view see Avoid This Requirements vs. Schedule Tradeoff Trap
I asked an engineering team for when they could deliver their part of the project. They gave me a date. I then asked them what is their recovery plan if they missed that date. Their response? They didn’t plan to miss the date. I then asked them if they always made their dates. They said no (in fact, they never did — nor did anyone else). Again, the team had been conditioned to never admit to missing a date and to not plan for that eventuality.
The point to all of this is that errors are a part of business and life. I’ve often argued, for example, that risk management is not about eliminating risks, but about being ready to minimize the impact if the risk becomes a reality. The same for errors. Errors will happen and how we plan to handle errors is important. When I was in the Air Force and in charge of system security, we would regularly run security test and evaluations (i.e., break the system, open a hole in security, etc.) and see how everyone responded. Quickly fixing security problems became efficient and routine.
Security is a process. It is a martial art you can learn to apply by study, thought, and constant practice. If you do not drill and practice regularly, you will get rusty at it, and it will not serve you when you need it. Even if you do become expert at it, an attacker may sometimes overpower you. The better you get at the process, however, the smaller the number of opponents that can do you harm, the less damage they can do, and the quicker you can recover. Who Must You Trust? Communications of the ACM, July 2014.
See more in Making Your Project Risk Free
In all the cases above, we persevered with evaluating risks and planning for contingencies. The resulting projects did encounter problems (as they all will) the only difference was these problems didn’t impact the schedule nor the quality of our projects as they had previous projects. Planning for and testing plans for handling issues helped turn cultures of denial and blaming, as a method of handling issues, to cultures of routinely handling issues as they occurred.
Are you planning for and testing how your team will handle potential problems as they arise?