More gaps being admitted in Boeing's testing methodology:https://arstechnica.com/science/2020/02 ... e-testing/
NASA and Boeing have been conducting a joint assessment of these software problems, and they're expected to report their findings in a week, on March 6. But on Friday, Mulholland was prepared to discuss two issues with Boeing's software verification that the company intends to fix.
First of all, he acknowledged the company did not run integrated, end-to-end tests for the whole mission. For example, instead of running a software test that encompassed the roughly 48-hour period from launch through docking to the station, Boeing broke the test into chunks. The first chunk ran from launch through the point at which Starliner separated from the second stage of the Atlas V booster. Unfortunately for Boeing engineers, the mission elapsed timing error occurred just after this point in time. "If we would have run the integrated test through the first orbital insertion burn time frame, we would have seen that we missed the burn," Mulholland said.
Looking at logic strings
During the validation process, Boeing engineers also did not test every complex "logic string" in the software code, Mulholland said. Essentially, this means they checked the basic code but did not follow through every possibility during complex logic strings such as if/then/or etc. Boeing has performed an audit to discover their gaps in testing, and the next step is to do both end-to-end integrated testing as well as fill in those gaps.
So they conducted testing, but didn't do a full up, integrated end-to-end test for the entire mission; they broke the testing up into chunks. All of those chunks worked, but due to some weird quirk in the methodology, the error that lead to the MET error occurred between one chunk and another chunk.
And they also failed to test any possible logic string in the software code... just the basic code.
In contrast, this article is a few years old, but talks about SpaceX's approach to software testing and validation. Big spoiler alert: SpaceX does have 100% code coverage, and are obsessive when it comes to software testing:https://lwn.net/Articles/540368/
Quoting Fred Brooks (of The Mythical Man-Month fame), Rose said "software is invisible". To make software more visible, you need to know what it is doing, he said, which means creating "metrics on everything you can think of". With a rocket, you can't just connect via JTAG and "fire up gdb", so the software needs to keep track of what it is doing. Those metrics should cover areas like performance, network utilization, CPU load, and so on.
The metrics gathered, whether from testing or real-world use, should be stored as it is "incredibly valuable" to be able to go back through them, he said. For his systems, telemetry data is stored with the program metrics, as is the version of all of the code running so that everything can be reproduced if needed.
SpaceX has programs to parse the metrics data and raise an alarm when "something goes bad". It is important to automate that, Rose said, because forcing a human to do it "would suck". The same programs run on the data whether it is generated from a developer's test, from a run on the spacecraft, or from a mission. Any failures should be seen as an opportunity to add new metrics. It takes a while to "get into the rhythm" of doing so, but it is "very useful". He likes to "geek out on error reporting", using tools like libSegFault and ftrace.
Automation is important, and continuous integration is "very valuable", Rose said. He suggested building for every platform all of the time, even for "things you don't use any more". SpaceX does that and has found interesting problems when building unused code. Unit tests are run from the continuous integration system any time the code changes. "Everyone here has 100% unit test coverage", he joked, but running whatever tests are available, and creating new ones is useful. When he worked on video games, they had a test to just "warp" the character to random locations in a level and had it look in the four directions, which regularly found problems.
"Automate process processes", he said. Things like coding standards, static analysis, spaces vs. tabs, or detecting the use of Emacs should be done automatically. SpaceX has a complicated process where changes cannot be made without tickets, code review, signoffs, and so forth, but all of that is checked automatically. If static analysis is part of the workflow, make it such that the code will not build unless it passes that analysis step.
When the build fails, it should "fail loudly" with a "monitor that starts flashing red" and email to everyone on the team. When that happens, you should "respond immediately" to fix the problem. In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build. They found that "100% of software engineers don't like Justin Bieber", and will work quickly to fix the build problem.