Deep Dive Into The Little Problems Now
Essentially, it was by luck. At ongoing hearings, senior bank executive Clive van Horen explained that the error had escaped numerous internal bank controls that should have caught it. The error was finally uncovered after troubling questions were raised over the bank’s overdraft application process by the Consumer Action Legal Centre and ASIC, and a new manager decided to take a deep dive into it. That’s when the coding error was revealed; the bank then took another 17 days to fix it. IEEE Spectrum, April 2, 2018. Commonwealth Bank of Australia Tries to Explain Coding Errors Found After 4 Years.
Our military space based early warning system had a collection of random intermittent errors. The errors only occurred occasionally and for many of them the solution was simple: reboot the mainframe. Of course when we rebooted the mainframe that processed the incoming data from our satellites, we were blind to any nuclear missile launches for that short few minutes it took to reboot. Since it would only take from 15 to 30 minutes for a surprise nuclear missile attack to impact the US from anywhere in the world, these few minutes of blindness could be exploited to achieve an advantage by an adversary.
Rebooting the mainframe was just a normal thing to do at this unit when I arrived as a new Air Force second lieutenant just out of college. Everyone knew it was a problem but it had become a common occurrence and we had what had become an acceptable way of handling it. It just struck me as not right, not something we wanted to be doing. So each and every time the problem occurred, assuming it was even reported by anyone, I would go collect up the logs and core dumps of the problem. With some persistence over what was a few months while also working on my primary projects, I slowly zeroed in on when and where the problem was happening.
One day I finally found it. The tricky part was that an operating systems API call, not the normal application logic, appeared to be failing. Normally the system calls are considered the most reliable parts of the operating system. They are supplied by the manufacturer, in this case it was IBM, and used by all their customers. The chances that it would fail were slim to none. I took the evidence and went to see the lead of the operating systems team. I told him, another lieutenant, that if this was not the problem, I’d buy him lunch. Well, later that morning, the operating system’s chief told me that he owed me lunch. I got lunch and we fixed an annoying national security issue that had persisted for many years.
The key was for someone to take the time and persistently chase down these little issues. By itself it was not a huge issue, but in fact we had about a dozen such issues that when they occurred we would simply reboot the mainframe. By the time I had rotated out of the unit for my next assignment, I had solved well over half of these intermittent issues and our daily downtime, the time we were blind to a strategic missile attack during the cold war era, had noticeably been reduced.
What seemingly little problems does your project have that warrants a deep dive to finally resolve them?