http://www.dallasnews.com/business/airl ... -flood.ece says
A week after a systemwide technical outage resulted in one of the largest disruptions in Southwest Airlines' history, the Dallas-based carrier has zeroed in on a root cause that is both simple and confounding.
At 1:09 p.m. on July 20, a lone router at Southwest Airlines' Love Field data center failed, creating a chokepoint that crippled hundreds of the company's software applications.
The router, like the thousands of others housed there, had a backup system in place. But according to the company's CEO Gary Kelly, the unique way the router failed, what he described as a "partial failure," didn't signal the backup that it was needed, allowing a singular disruption to metastasize into a crisis.
Kelly compared the failure to a once in a thousand year flood.
I've seen this stuff way back when I worked on 'high availability' in the 90s.
Something causes the device to look 'alive' to the outside world but it is in a state where it can't do the work the rest of the infrastructure expects it to be doing.
That's a nasty problem to solve from both a computer science and a marketing point of view.
If you want quick failover times and predictable failover behavior you respond to incoming 'heartbeat requests' using low-level hardware and/or software using pre-provisioned resources. That works great for 'hard failures' like the kind that happen when someone pulls the wrong cable out of a patch panel, or a construction crew digs up a circuit by accident, etc. But it's totally incapable of handing the 'soft-fail' situations that happen with higher levels of software that need more and more resources to do their jobs can't get those resources (be they CPU cycles, memory space or bandwidth, database locks, etc). So to be more realistic you make the 'heartbeats' more complicated, but this leads to longer amounts of time to determine failure, more complexity, more frequent failure events, false failure events, etc.
WN just got caught out by this problem, big time.
Can't be any fun for the nerds who have to answer to everyone from the CEO on down.