Moderators: jsumali2, richierich, ua900, PanAm_DC10, hOMSaR

 
User avatar
Revelation
Topic Author
Posts: 24770
Joined: Wed Feb 09, 2005 9:37 pm

WN Blames System Wide Outage On One Router

Sun Jul 31, 2016 7:25 pm

Interesting: they say their entire system crashed due to one router, and I believe them.

http://www.dallasnews.com/business/airl ... -flood.ece says

A week after a systemwide technical outage resulted in one of the largest disruptions in Southwest Airlines' history, the Dallas-based carrier has zeroed in on a root cause that is both simple and confounding.

At 1:09 p.m. on July 20, a lone router at Southwest Airlines' Love Field data center failed, creating a chokepoint that crippled hundreds of the company's software applications.

The router, like the thousands of others housed there, had a backup system in place. But according to the company's CEO Gary Kelly, the unique way the router failed, what he described as a "partial failure," didn't signal the backup that it was needed, allowing a singular disruption to metastasize into a crisis.

Kelly compared the failure to a once in a thousand year flood.


I've seen this stuff way back when I worked on 'high availability' in the 90s.

Something causes the device to look 'alive' to the outside world but it is in a state where it can't do the work the rest of the infrastructure expects it to be doing.

That's a nasty problem to solve from both a computer science and a marketing point of view.

If you want quick failover times and predictable failover behavior you respond to incoming 'heartbeat requests' using low-level hardware and/or software using pre-provisioned resources. That works great for 'hard failures' like the kind that happen when someone pulls the wrong cable out of a patch panel, or a construction crew digs up a circuit by accident, etc. But it's totally incapable of handing the 'soft-fail' situations that happen with higher levels of software that need more and more resources to do their jobs can't get those resources (be they CPU cycles, memory space or bandwidth, database locks, etc). So to be more realistic you make the 'heartbeats' more complicated, but this leads to longer amounts of time to determine failure, more complexity, more frequent failure events, false failure events, etc.

WN just got caught out by this problem, big time.

Can't be any fun for the nerds who have to answer to everyone from the CEO on down.
Wake up to find out that you are the eyes of the world
The heart has its beaches, its homeland and thoughts of its own
Wake now, discover that you are the song that the morning brings
The heart has its seasons, its evenings and songs of its own
 
User avatar
atypical
Posts: 797
Joined: Mon Aug 18, 2014 12:28 am

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 1:13 am

Revelation wrote:
Something causes the device to look 'alive' to the outside world but it is in a state where it can't do the work the rest of the infrastructure expects it to be doing.


I don't buy it unless their infrastructure is an old Linksys they got on sale from Circuit City. There are a variety of applications that they should be catching SNMP messages from the router. Even on a partial failure the router was passing less traffic and probably showing other symptoms of trouble like dropped packets. But even seeing the router passing less traffic is enough to swing a manual failover without any risk in the least if they are physically redundant. This release of information just makes them look more incompetent, not less. They would have been better off not saying anything rather than a message this poor.
 
rg787
Posts: 117
Joined: Sun Nov 21, 2010 2:28 pm

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 2:14 am

Isn't the DHCP server on these high end structures handled by an actual server rather than a router?
 
msycajun
Posts: 1130
Joined: Thu Jun 16, 2016 4:13 am

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 2:18 am

I'm glad I'm not the only one with router troubles
 
User avatar
lightsaber
Moderator
Posts: 20538
Joined: Wed Jan 19, 2005 10:55 pm

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 2:33 am

Way to long of a resolution. But then again, I worked on a project held hostage by a router for 4 hours. It restarted... but didn't. So plausible. The software was old and couldn't be reconfigured as quickly as modern software. So we had to wait for the IT team to shut down the data center and bring everything back online.

This is why WN must get to more modern/flexible software. Old school mainframe software requires that mainframe to provide the dependability; running it on some server isn't as robust as the old school mainframes. The mainframe OS provided the software to provide robustness. Now it must be the database software. When does WN transition.

Lightsaber
Winter is coming.
 
stlgph
Posts: 11228
Joined: Tue Oct 12, 2004 4:19 pm

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 3:18 am

Of course it's possible it's on one router.

The entire Metro North Railroad was shut down for a night because a cleaning lady unplugged a computer so she could mop the floors.

America, never stop progressing.
if assumptions could fly, airliners.net would be the world's busiest airport
 
b747400erf
Posts: 3165
Joined: Wed Jun 19, 2013 4:33 am

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 5:14 am

wanted for questioning

Image
 
User avatar
nighthawk
Posts: 4890
Joined: Sun Sep 16, 2001 2:33 am

Re: WN Blames System Wide Outage On One Router

Mon Aug 01, 2016 9:40 am

stlgph wrote:
Of course it's possible it's on one router.

The entire Metro North Railroad was shut down for a night because a cleaning lady unplugged a computer so she could mop the floors.

America, never stop progressing.


Was it some kind of new fangled electronic mop?
 
blacksoviet
Posts: 1742
Joined: Thu Apr 21, 2016 10:50 am

Re: WN Blames System Wide Outage On One Router

Thu Dec 22, 2016 11:59 pm

How many years can a router be online before failing?
 
Jerseyguy
Posts: 2183
Joined: Sun Oct 30, 2005 12:05 pm

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 5:19 am

Apparently A.net got a router on the cheap from WN :duck:
 
User avatar
BobPatterson
Posts: 3416
Joined: Thu Nov 26, 2015 7:18 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 6:34 am

blacksoviet wrote:
How many years can a router be online before failing?


Years and years. I've seen reports of routers that are 15-20 years old still in use.

But, probably, most are replaced much sooner because of system design upgrades and requirements, new capabilities and features, just as we replace software and hardware that becomes obsolete.
Facts are fragile things. Treat them with care. Sources are important. Alternative facts do not exist.
 
vxg
Posts: 100
Joined: Sun May 02, 2004 12:31 pm

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 7:02 am

In this day and age why does an airline company even have their own data center? They need to be outsourcing all that stuff to a cloud provider who caters to enterprises with high compliance and security requriements - so maybe not AWS but something like IBM, Microsoft, Oracle or a mix of them. They shoudl focus on hiring top airline talent focused on making their core business great and let the experts handle the servers, power, cooling, routers, etc. Running your own data center is going the way of the dinosaur.
 
blueflyer
Posts: 4352
Joined: Tue Jan 31, 2006 4:17 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 7:50 am

vxg wrote:
so maybe not AWS but something like IBM, Microsoft, Oracle or a mix of them.

What's wrong with AWS? Apparently, without Amazon, there wouldn't be a SYD DFW.
http://www.zdnet.com/article/qantas-fly ... ation-aws/
 
blacksoviet
Posts: 1742
Joined: Thu Apr 21, 2016 10:50 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 7:57 am

BobPatterson wrote:
blacksoviet wrote:
How many years can a router be online before failing?


Years and years. I've seen reports of routers that are 15-20 years old still in use.

But, probably, most are replaced much sooner because of system design upgrades and requirements, new capabilities and features, just as we replace software and hardware that becomes obsolete.

How fast are these 20-year old routers? 10 Mbps? Are those the routers that use the old BNC cables?
 
VSMUT
Posts: 4674
Joined: Mon Aug 08, 2016 11:40 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 12:41 pm

The Economist had an article on the server and computer systems of airlines and banks back in August when Delta suffered an outage as a result of a fire. It pretty much summed up that airlines and banks were among the first adopters of massive IT systems, and haven't done anything to replace or renew that vital piece of infrastructure ever since. They really are 20+ years old at the legacy airlines (and banks too). It's some sort of mentality along the lines of "As long as it works, we don't want to invest in it".
 
seanpmassey
Posts: 97
Joined: Mon May 23, 2016 3:22 pm

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:00 pm

VSMUT wrote:
The Economist had an article on the server and computer systems of airlines and banks back in August when Delta suffered an outage as a result of a fire. It pretty much summed up that airlines and banks were among the first adopters of massive IT systems, and haven't done anything to replace or renew that vital piece of infrastructure ever since. They really are 20+ years old at the legacy airlines (and banks too). It's some sort of mentality along the lines of "As long as it works, we don't want to invest in it".


It's not that they don't want to invest in it. It's that it's a massive undertaking, and it's going to have significant costs in terms of time, resources, and dollars. It's also a huge risk to the business, and it may not deliver on its intended results. There are business analysts who spend a lot of time figuring out the costs of rebuilding the applications, the costs associated with outages, and the costs associated with the risks to the business of both options and determining what is more financially feasible.

While the mainframe hardware is likely newer - IBM still builds mainframes - running software that has evolved with the business over 20-50 years with multiple subsystems for managing different aspects of the business. Trying to rebuild the application on another platform, using a modern architecture, is a large undertaking when you're trying to hit a static target. Airline and bank systems are a moving target, though, with new business rules being added or updated on a regular basis.

It's also worth noting that the cloud isn't outage proof. AWS has entire availability zone failures at least once per year, and Azure has had some high-profile failures in the last couple of years. While it is possible to architect around this, it adds significant expense and complexity to the application.
 
VS11
Posts: 1672
Joined: Mon Jul 02, 2001 6:34 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:27 pm

seanpmassey wrote:
It's also worth noting that the cloud isn't outage proof. AWS has entire availability zone failures at least once per year, and Azure has had some high-profile failures in the last couple of years. While it is possible to architect around this, it adds significant expense and complexity to the application.


Not only that. You still need to connect to AWS from wherever you are. It depends on the telcom infrastructure at each airport and what the airlines are allowed to do. Check-in desk computers are not necessarily part of each airline's network. At some airports they are part of shared networks. Airlines as tenants may not be allowed to use their telcom providers which really can restrict them how they design their physical networks.
 
dtw2hyd
Posts: 8445
Joined: Wed Jan 09, 2013 12:11 pm

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:42 pm

You will be surprised how many corporations don't even do a live test of their infrastructure for a major failure. Most do just table top drills to pass audit.

VXG, not every application fits into cookie cutter AWS VM and shared storage. Cloud is great in theory and good for lot of applications, but there are still complex mission critical systems in need of their own dedicated infrastructure.
All posts are just opinions.
 
User avatar
Revelation
Topic Author
Posts: 24770
Joined: Wed Feb 09, 2005 9:37 pm

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:44 pm

blacksoviet wrote:
BobPatterson wrote:
blacksoviet wrote:
How many years can a router be online before failing?


Years and years. I've seen reports of routers that are 15-20 years old still in use.

But, probably, most are replaced much sooner because of system design upgrades and requirements, new capabilities and features, just as we replace software and hardware that becomes obsolete.

How fast are these 20-year old routers? 10 Mbps? Are those the routers that use the old BNC cables?


A 1996 vintage router probably would have 10/100 'fast' ethernet, that's right when the transition away from coax was happening. I remember the cubicles in the building I was in getting rewired right around then. Coax was really nasty to deal with, especially given how many interfaces could be 'daisy-chained'. I remember all the T-connectors and barrel terminators for 'thin-net' ethernet. The thick stuff was even worse. Typically the 'vampire tap' was used, so you had to drill into the cable and hope you ended up with a usable connection instead of a short or open circuit.

Personally I can't see any competent IT shop keeping a 20 year old router in service. The vendor certainly would not support it. It would have all kinds of unpatched security vulnerabilities. The performance would be pitiful. It could not keep up with any modern packet source. It would be dropping so many packets that it would not be reliable.

I think we find the same to be true with home wireless routers. I know I replaced a ~5 year old one a year or so ago as it just did not keep up with all the new gadgets in the house. It would still 'work' if the load was low enough, but as newer tablets and phones showed up, it was so slow it was not viable to keep it. The new one is fast enough to keep up with phones, tablets, and media streaming devices.
Wake up to find out that you are the eyes of the world
The heart has its beaches, its homeland and thoughts of its own
Wake now, discover that you are the song that the morning brings
The heart has its seasons, its evenings and songs of its own
 
User avatar
ssteve
Posts: 1421
Joined: Fri Dec 02, 2011 8:32 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:48 pm

Been using grids for batch job submission for over a decade of my career, and the machines that keep running but just fail to work... those suck. They tend to say, "sure, give me work to do" and proceedingly fail to do it... and jobs have to be manually killed and restarted. For various reasons the autonomy to detect that situation is difficult.
 
User avatar
ssteve
Posts: 1421
Joined: Fri Dec 02, 2011 8:32 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:55 pm

Revelation wrote:
A 1996 vintage router probably would have 10/100 'fast' ethernet, that's right when the transition away from coax was happening. I remember the cubicles in the building I was in getting rewired right around then.


I worked in a building where ethernet is run over the old token ring cabling... much of the ethernet cabling in that building is still from the 80s.
 
VS11
Posts: 1672
Joined: Mon Jul 02, 2001 6:34 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 2:56 pm

Revelation wrote:
A 1996 vintage router probably would have 10/100 'fast' ethernet, that's right when the transition away from coax was happening.


They certainly had Gigabit routers at the time. Moreover, 10/100 is generally for local area networks. It was a data center router that failed so probably not a LAN router. Besides the issue could have been with the telecom company operating the data center, not Southwest per se.
 
blacksoviet
Posts: 1742
Joined: Thu Apr 21, 2016 10:50 am

Re: WN Blames System Wide Outage On One Router

Fri Dec 23, 2016 6:53 pm

ssteve wrote:
Revelation wrote:
A 1996 vintage router probably would have 10/100 'fast' ethernet, that's right when the transition away from coax was happening. I remember the cubicles in the building I was in getting rewired right around then.


I worked in a building where ethernet is run over the old token ring cabling... much of the ethernet cabling in that building is still from the 80s.

That is amazing. How fast is the connection? You must have some old network cards installed.
 
luv2cattlecall
Posts: 827
Joined: Fri Sep 28, 2007 6:25 am

Re: WN Blames System Wide Outage On One Router

Sat Dec 24, 2016 1:24 am

seanpmassey wrote:
VSMUT wrote:
The Economist had an article on the server and computer systems of airlines and banks back in August when Delta suffered an outage as a result of a fire. It pretty much summed up that airlines and banks were among the first adopters of massive IT systems, and haven't done anything to replace or renew that vital piece of infrastructure ever since. They really are 20+ years old at the legacy airlines (and banks too). It's some sort of mentality along the lines of "As long as it works, we don't want to invest in it".


Analysts can easily paint a "too expensive to upgrade" picture, but it's hard to account for missed opportunities such as:

Bag fees
Change fees
Redeyes
The failed Volaris/WestJet partnership
International, until just recently
Reducing the need for call center assistance
Flights in 1 minute increments instead of 5
Upgraded front cabins
Automated irops recovery
 
User avatar
77west
Posts: 972
Joined: Sat Jun 13, 2009 11:52 am

Re: WN Blames System Wide Outage On One Router

Sat Dec 24, 2016 2:17 am

I had this issue just this week at a client, the router went into "Zombie Mode" as we call it. It was alive, but behaving strangely and dropping packets and just being weird. We killed it. But it took half the day to figure out what component was to blame (it was a complex site only just inherited)
77West - AW109S - BE90 - JS31 - B1900 - Q300 - ATR72 - DC9-30 - MD80 - B733 - A320 - B738 - A300-B4 - B773 - B77W
 
memphiX
Posts: 57
Joined: Wed Dec 02, 2015 2:46 pm

Re: WN Blames System Wide Outage On One Router

Sat Dec 24, 2016 2:39 am

VS11 wrote:
Revelation wrote:
A 1996 vintage router probably would have 10/100 'fast' ethernet, that's right when the transition away from coax was happening.


They certainly had Gigabit routers at the time. Moreover, 10/100 is generally for local area networks. It was a data center router that failed so probably not a LAN router. Besides the issue could have been with the telecom company operating the data center, not Southwest per se.



Gigabit routers were available at the time, but they were VERY expensive and were rarely seen outside of backbone hubs of major carriers at the time like MCI, SBC, AT&T, Verizon, TWTC..
I am guessing that it was probably one of those Cisco 7200VXR routers. They were rock solid and would stay online for years with no issues. I doubt that WN had the need nor wanted to pay 20x times more for a GSR (which was/is a Gige router) for LAN routing.
 
strfyr51
Posts: 5087
Joined: Tue Apr 10, 2012 5:04 pm

Re: WN Blames System Wide Outage On One Router

Sat Dec 24, 2016 11:31 am

ok let me get this straight. WN had a meltdown due to a single router? They HAD to know what happened when after the UA/CO merger when CO tried to put the entire system on routers rather than mainframes and the system crashed in 28 minutes. WN has as many or more airplanes as United and the routing has to be as complex of not more complex as ours. CO couldn't fathom how we needed 4 mainframes to run our system. But they sure know it now ! WN needs a dedicated computer center if they don't already have don't already have one. A cleaning lady is a pretty weak excuse for a major and premiere airline wouldn't you say??

Popular Searches On Airliners.net

Top Photos of Last:   24 Hours  •  48 Hours  •  7 Days  •  30 Days  •  180 Days  •  365 Days  •  All Time

Military Aircraft Every type from fighters to helicopters from air forces around the globe

Classic Airliners Props and jets from the good old days

Flight Decks Views from inside the cockpit

Aircraft Cabins Passenger cabin shots showing seat arrangements as well as cargo aircraft interior

Cargo Aircraft Pictures of great freighter aircraft

Government Aircraft Aircraft flying government officials

Helicopters Our large helicopter section. Both military and civil versions

Blimps / Airships Everything from the Goodyear blimp to the Zeppelin

Night Photos Beautiful shots taken while the sun is below the horizon

Accidents Accident, incident and crash related photos

Air to Air Photos taken by airborne photographers of airborne aircraft

Special Paint Schemes Aircraft painted in beautiful and original liveries

Airport Overviews Airport overviews from the air or ground

Tails and Winglets Tail and Winglet closeups with beautiful airline logos