As far as I know, the kind of failure that occurred with the 737max MCAS system has never occurred before. Thus, questions related to it was not on the FMEA Forms for that kind of system (I'm sure they will be in the future).
As it applies to a FMEA: If the error is at the open block question of "Can you think of..." stage. That is, in my opinion, at the point where it is not reasonably foreseeable, and no criminal charges would likely be filed or likely stick in court. Now if they directly missed on a question directly asked... that, in my opinion, would likely be chargeable.
Thanks 2175301, great post, fascinating insight into the workings of what has to be one of the strictest-regulated industries.
However I do no agree with your opinion of where the error most likely occurred in MCAS 1.0 FMEA: in my opinion, the "catastrophic" classification should have stemmed directly from the "know failure modes" section, not from the "open questions" section in the back.
It's true that the kind of MCAS failures that killed 300+ people never occurred before, but that's only because no such design was ever allowed on an airliner. The nearest comparable system is - AFAIK - the MCAS system on KC-767A flying for the Italian Air Force, but it's wired with dual AOA input channels and input sanitation, i.e. it automatically disables itself on an AOA disagree condition (I have no idea about control authority).
In automation design, a single-input, unlimited authority controller with no input sanitation and no sane manual override option is almost guaranteed
to fail catastrophically in a single sensor failure scenario (N.B.: in most automation applications, a "catastrophic" failure often only results in damage to equipment, but it's still the most unwanted outcome of operations). With AOA sensors having a none-too-high MTBF, the frequency calculation was straightforward (and grim reality confirmed it, sadly).
If anything, probably the flaw was difficult to identify because it was so fundamental, and none of the questions in the "known failure modes" section were anything like "is your design vulnerable to single-sensor failures and
does the controller have unlimited authority and
did you neglect to put in place even rudimentary input sanitation filters and
did you change the function of the only cut out switch that could have disabled the controller without disabling the actuator?"
Not because any of the failure modes were unknown, but because no one could imagine that a group of professionals could line up this frankly unthinkable combination of basic design criteria violations in a single piece of equipment.
We may disagree; but, my experience is that unless questions are rather direct in the list of questions section (and they often are written several different ways to get you to think of a failure mode) that most people will not see them. If the key type of questions were there I have a high confidence that the failure that occurred from a continuous false reading (be it from a sensor, wiring fault. etc.) would have been identified and classified appropriately; forcing a revision to the design (and again to the FMEA) in order to progress. Overall preventing the events.
As far as whys such seemingly obvious errors occur. When I graduated from College with my Engineering Degree I could not believe all the mistakes that people were making. Decades later, and my own series of mistakes... (which I often exclaimed to at least myself how could I have missed that - perhaps not in so kind of words) and experience with high level Human Performance (I was a founding member of our Plant's Human Performance Committee and have years of training in it - some provided by the same company that trains Aerospace and Medical on Human Performance - and dealing with it); and dealing with FMEAs.... (where sometimes we catch things, and sometimes we don't: Where I may see a miss, and others see other misses that I missed)... I no longer try to explain other than to say that we are human and humans make mistakes. Most everything is obvious after the fact (as is this case). But, in the case of FMEAs a team of people looked at them and thought they were correct (or as correct as they could make them). Unfortunately, misses occur - and sometimes significant misses occur.
I have never seen a significant miss if the base questions that identified the failure mode was on the FMEA. Thus, I suspect that the questions were not their that would lead people to foresee this failure. They classified the system as acceptable based on the questions that were there, and did not think enough outside of their structured box to add the issue to the "Can you think of... box." The nature of the FEMA forms constantly add the events that occur so that they should be caught before hand in the future. But, those forms did not exist long ago. I think the formal FEMA process is about 25 years old.
As far as fundamental engineering principals... It's amazing how often engineers don't apply those. I've seen it too many times to count; and with some of the most basic things. Example: How could a historic well known small turbine manufacture design a Safety Related emergency service steam turbine based pump to supply water to cool the reactor- that has to start from cold with 500 F steam and be up to full load in a minute.. and run for hours.... forget to design it for thermal expansion... (I was the lead investigator on the root cause for why that turbine had problems - 3-4 months out of my life). But, I believe experiencing and seeing all of this has made me a better engineer.
Have a great day,
Of course, from the little we know now, the type of mistake you describes as "the questions were not their that would lead people to foresee this failure" is technically still a possibility. I understand that your experience make this possibility real. But a single sensor activating a safety critical control surface is the kind of mistake so obvious that it could'n have passed the required review. Go back in time when Boeing and FAA issued the EAD where there have to describes the problem and read the reaction back then. The amount of peoples chocked about that single sensor design was colossal, from engineer, from pilots, from anyone with basic interest in safety. Based on that, you have to think about others possibilities to explain the FMEA failure.
The EAD is entirely redacted around the pilot mitigation of the single sensor failure to avoid the risk of erratic activation of a safety critical control surface. There are journalists that have published information about how the MCAS was hidden to the pilots because it was believed to be like others existing systems that can move the horizontal stabilized actuator. Based on this, I found far more possible that the FMEA did correctly identify the risk of an erratic single sensor failure, but that the mitigation of that risk was wrong. Normally most peoples logically wanted to add a redundancy in that case, like on any others modern aircraft, but the obsolete architecture of the 737 was not designed to do that properly without a major redesign (the one that actually Boeing take so many months to do). At some point, the idea of using the pilot to mitigate the MCAS activation of the safety critical control surface was proposed. There is nothing wrong in exploring all ideas, but that idea was not correctly assessed from a safety point of view. I my opinion it was at this stage that most of the Emmental cheese holes aligned:
* Risk of single AoA sensor failure.
* Activation of a safety critical control surface.
* Abuse of the notion of "trim runaway" to mitigate the risk by the pilot.
* "Same type rating" objective inciting to hide the MCAS.
* Consequently, no appropriate documentation for the pilots.
* Consequently, no appropriate training for the pilots.
* Multiple discontinue activation of the MCAS can put the stab in a extreme position.
* Loss of elevator authority on extreme stab trim position.
* Too small trim wheels unusable at high speed.