> The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.
How does a pair determine which of the pair did the calculation correctly?
> Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.
Who sits down and determines that 8 is the correct number? Why not 4? Or 2? Or 16 or 32?
They probably set an acceptable total loss rate for the mission and worked backwards to determine how many replicas of each system they need to achieve that while minimizing total cost/weight.
So the answer is "some engineers sat down after talking to management".
For issues that have never occurred before, probabilities are the wrong tool. The right thing to do is list all the behaviour the vehicle must never exhibit and think of ways it still might, despite all redundancies -- maybe even despite every single component working as intended.
Lots of mission failures in history were caused by unexpected interactions between fully functional components. Probabilities of failures don't help with that.
I'm a big fan of Dissimilar Redundancies (but didn't know that was the term until today) for building system software.
Build for various Linux distros, and some of the BSDs. You'll encounter weird compile errors or edge cases that will pop up. Often times I've found that these will expose undefined behaviour or incorrect assumptions that you wouldn't notice if you were building for a single platform.
Not just CPUs, they run a whole different (but also simpler) fallback program in case the main computers fail. I think they were more worried about programming errors but this should avoid all shared failures between the main computers (be it programming or hardware).
Even if different teams write software in different languages, they end up creating very similar bugs because the bugs crop up in the complexities of the domain and insufficiencies of the specification.
N-version programming doesn't work as well as you think. See Knight and Leveson (1986).
(N-version programming does guard against "random" errors like typos or accidentally swapping parameters to a subroutine call. But so does a good test suite and a powerful compiler.)
How does a pair determine which of the pair did the calculation correctly?
There's also space systems that use 3 processors and a majority vote for the correct output, but that's different.
https://en.wikipedia.org/wiki/Lockstep_(computing)
Example: https://www.st.com/resource/en/datasheet/spc574k72e5.pdf
> each FCM consists of a self-checking pair of processors.
Who sits down and determines that 8 is the correct number? Why not 4? Or 2? Or 16 or 32?
So the answer is "some engineers sat down after talking to management".
Lots of mission failures in history were caused by unexpected interactions between fully functional components. Probabilities of failures don't help with that.
Build for various Linux distros, and some of the BSDs. You'll encounter weird compile errors or edge cases that will pop up. Often times I've found that these will expose undefined behaviour or incorrect assumptions that you wouldn't notice if you were building for a single platform.
Even if different teams write software in different languages, they end up creating very similar bugs because the bugs crop up in the complexities of the domain and insufficiencies of the specification.
N-version programming doesn't work as well as you think. See Knight and Leveson (1986).
(N-version programming does guard against "random" errors like typos or accidentally swapping parameters to a subroutine call. But so does a good test suite and a powerful compiler.)
I other words, how over engineered is it.