In Snippets over the past few months, we've talked a bit about the idea of how complex systems operate: from system encroachment and safety (following the Amazon Web Services outage) to the role of humans as loosely-coordinated system operators last week.
This week we have a story to share via James Hamilton, a VP at AWS in charge of infrastructure efficiency and stability, who also writes an excellent blog on data center design, operation & safety. A few weeks ago, he shared with us an illuminating story about hazard management in a complex environment: namely, the data center managing a major US airline's operations. (No word on whether it's the same airline that has been in the news a lot this week for, let's just say, reasons.)
His account is worth reading in full, but the basics of what happened are as follows: data centers (including that in question) are furnished with switches and backup generators that kick in whenever utility power either goes down or veers outside of acceptable quality parameters. This technology protects the servers, networking gear, and other equipment from surges, power outages, and other bad things; it works successfully almost 100% of the time.
The crux of this story is that all of this protection equipment is itself expensive, running in the millions of dollars; so the airline implemented auto-shutoff safety provisions designed to protect the safety equipment from being damaged in turn.
Do we see the problem here? By protecting the protection-equipment, you've partially undone the safety that you'd so painstakingly created! Murphy's Law strikes again (and again, and again):
As Cook put it perfectly in his foundational paper: "The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes may actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures. Not uncommonly, these new, rare catastrophes have even greater impact than those eliminated by the new technology. These new forms of failure are difficult to see before the fact; attention is paid mostly to the putative beneficial characteristics of the changes. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure."
We see this principle quite clearly in Hamilton's example, whereby technology and process created to prevent damage to a $1M power generator introduced a path to $100M failure for the airline. (Incidentally, this was similar to the system failure that led to a power outage at the Super Bowl a few years back.)
Two kinds of pressure were responsible: technological (the push to upgrade systems to newer technology with a lower local failure rate) and financial (the understandable desire to shield million dollar capital equipment from blowing up!).
When we recognize these pressures, we see them everywhere, particularly around two long-term trends in tech. The first is that modern software is rapidly expanding into complex, hazardous industries: power generation, health care, financial services, transportation, and more. The second is that many of the old technological back-ends underpinning the systems that do work, and which have been held together by sweat and duct tape for decades, are in danger of becoming so old that no one knows how to maintain them anymore:
In other words, we shouldn't be surprised to see newer, weirder, and more unexpected forms of failure begin to creep into systems that up until now were at least predictable with their problems.
This is a significant opportunity for anyone who actually understands this stuff; it wouldn't be surprising if one or more of the major important software companies being created today that end up as decacorns were the ones explicitly focused on the problem of safety and "system jiu-jitsu".
Still, it's also worth appreciating that some of the most facepalm-worthy system failures don't necessarily come from complex, nuanced forces. Many still come from good, old fashioned poor decision making. As Rick from the comments section in Hamilton's article tells us amusingly: "I still remember how a major data center in Colorado Springs had redundant power lines into the prem, but both of the power leads looped around into a parallel structure 30 feet outside the prem, and the backhoe got them both about 10 feet later." Nice.