As widely reported in the media, early August was a tough time for Delta Airlines. The firm suffered a major computer systems outage that caused widespread cancellations and stranded hundreds of thousands of passengers—both its own and those of other airlines.
Making matters more interesting, it was the not the airline industry’s first computer-systems-related outage that year. The Delta system crash came about three weeks after a router failure at Southwest Airlines caused the cancellation of more than 1,000 flights. In May, computer issues with JetBlue forced passengers to be checked in manually at some airports.
I won’t dig into the specifics of these outages, other than to note the Delta problem was reported to be a malfunctioning power control module. What caused the problem isn’t really the issue. The real question is how a multi-national, Fortune 100 company could let such a thing happen.
Even small and medium-sized businesses can now afford backups, and large-scale systems generally have redundancy built in. Delta should have been able to switch its system operations to other servers and restore any lost data quickly—if its systems went down at all.
Delta Chief Operating Officer Gil West has indicated in a statement that Delta had backups, but for some reason not all systems switched over as intended. In other words, Delta had safeguards in place, but they failed.
That leaves us wondering, did Delta test its backup systems adequately? Did they perform “emergency drills,” taking their systems offline during slow periods and switching to their backup systems to make sure they worked properly? Did they have sufficient redundant power supplies, such as backup batteries and generators, and if so, did they test them adequately?
We may never know exactly what caused Delta’s protections to break down, but we do know that the system didn’t have sufficient fault tolerance. That had to be a hard lesson for Delta, and one that they will be working to overcome for the foreseeable future.
Admittedly, for a company like Delta, achieving guaranteed, 100% fault tolerance isn’t easy or inexpensive. For large, complex systems, truly redundant disaster recovery—which ensures organizations have little, if any, downtime—is expensive. Nevertheless, one has to wonder how that cost compares with the lost revenue and customer goodwill that Delta experienced.
For the average business owner, whose IT systems are not as extensive or complex, newer solutions make it possible and reasonably affordable to have near perfect fault tolerance. The task for business owners is to determine their risk tolerance and then decide how much protection is enough.
Next month, we’ll talk about some of the red flags that Delta missed—and discuss how business owners can identify them and determine whether they should address them.