, or one-in-a-billion, is the famed number given for the maximum probability of a catastrophic failure, per hour of operation, in life-critical systems like commercial aircraft. The number is part of the folklore of the safety-critical systems literature; where does it come from?
First, it’s worth noting just how small that number is. As pointed out by Driscoll et al. in the paper, Byzantine Fault Tolerance, from Theory to Reality, the probability of winning the U.K. lottery is 1 in 10s of millions, and the probability of being struck by lightening (in the U.S.) is more than a 1,000 times more likely than
So where did come from? A nice explanation comes from a recent paper by John Rushby:
If we consider the example of an airplane type with 100 members, each flying hours per year over an operational life of 33 years, then we have a total exposure of about 107 flight hours. If hazard analysis reveals ten potentially catastrophic failures in each of ten subsystems, then the “budget” for each, if none are expected to occur in the life of the fleet, is a failure probability of about per hour [1, page 37]. This serves to explain the well-known requirement, which is stated as follows: “when using quantitative analyses. . . numerical probabilities. . . on the order of per flight-hour. . . based on a flight of mean duration for the airplane type may be used. . . as aids to engineering judgment. . . to. . . help determine compliance” (with the requirement for extremely improbable failure conditions) [2, paragraph 10.b].
 E. Lloyd and W. Tye, Systematic Safety: Safety Assessment of Aircraft Systems. London, England: Civil Aviation Authority, 1982, reprinted 1992.
 System Design and Analysis, Federal Aviation Administration, Jun. 21, 1988, advisory Circular 25.1309-1A.
(By the way, it’s worth reading the rest of the paper—it’s the first attempt I know of to formally connect the notions of (software) formal verification and reliability.)
So there a probabilistic argument being made, but let’s spell it out in a little more detail. If there are 10 potential failures in 10 subsystems, then there are potential failures. Thus, there are possible configurations of failure/non-failure in the subsystems. Only one of these configurations is acceptable—the one in which there are no faults.
If the probability of failure is then the probability of non-failure is So if the probability of failure for each subsystem is then the probability of being in the one non-failure configuration is
We want that probability of non-failure to be greater than the required probability of non-failure, given the total number of flight hours. Thus,
which indeed holds:
Can we generalize the inequality? The hint for how to do so is that the number of subsystems () is no more than the overall failure rate divided by the subsystem rate:
This suggests the general form is something like
Subsystem reliability inequality:
where and are real numbers, and
Let’s prove the inequality holds. Joe Hurd figured out the proof, sketched below (but I take responsibility for any mistakes in it’s presentation). For convenience, we’ll prove the inequality holds specifically when but the proof can be generalized.
First, if the inequality holds immediately. Next, we’ll show that
is monotonically non-decreasing with respect to by showing that the derivative of its logarithm is greater or equal to zero for all So the derivative of its logarithm is
Let , so the range of is
Now we show that in the range of , the left-hand side is bounded below by the right-hand side of the inequality.
Now taking their derivatives
Because in the range of , our proof holds.
The purpose of this post was to clarify the folklore of ultra-reliable systems. The subsystem reliability inequality presented allows for easy generalization to other reliable systems.
Thanks again for the help, Joe! Read the rest of this entry »
In a previous post, I discussed the notion of Schrödinger CRCs, first described by Kevin Driscoll et al. in their paper Byzantine Fault Tolerance, from Theory to Reality. The basic idea is that error-detecting codes do not necessarily prevent two receivers from obtaining messages that are semantically different (i.e., different data) but syntactically valid (i.e., the CRC matches the respective data words received). The upshot is that even with CRCs, you can suffer Byzantine faults, with some probability.
… So what is that probability of a Schrödinger’s CRC? That’s the topic of this post—which cleans up a few of the ideas I presented earlier. I published a short paper on the topic, which I presented at Dependable Sensors and Networks, 2010, while Kevin Driscoll was in the audience! If you’d prefer to read the PDF or get the slides, they’re here. The simulation code (Haskell) is here.
Software contains faults. The question is how to cost-effectively reduce the number of faults. One approach that gained traction and then fell out of favor was N-version programming. The basic idea is simple: have developer teams implement a specification independent from one another. Then we can execute the programs concurrently and compare their results. If we have, say, three separate programs, we vote their results, and if one result disagrees with the others, we presume that program contained a software bug.
N-version programming rests on the assumption that software bugs in independently-implemented programs are random, statistically-uncorrelated events. Otherwise, multiple versions are not effective at detecting errors if the different versions are likely to suffer the same errors.
John Knight and Nancy Leveson famously debunked this assumption on which N-version programming rested in the “Knight-Leveson experiment” they published in 1986. In 1990, Knight and Leveson published a brief summary of the original experiment, as well as responses to subsequent criticisms made about it, in their paper, A Reply to the Criticisms of the Knight & Leveson Experiment.
The problem with N-version programming is subtle: it’s not that it provides zero improvement in reliability but that it provides significantly less improvement than is needed to make it cost-effective compared to other kinds of fault-tolerance (like architecture-level fault-tolerance). The problem is that even small probabilities of correlated faults lead to significant reductions in potential reliability improvements.
Lui Sha has a more recent (2001) IEEE Software article discussing N-version programming, taking into account that the software development cycle is finite: is it better to spend all your time and money on one reliable implementation or on three implementations that’ll be voted at runtime? His answer is almost always the former (even if we assume uncorrelated faults!).
But rather than N-versions of the same program, what about different programs compared at runtime? That’s the basic idea of runtime monitoring. In runtime monitoring, one program is the implementation and another is the specification; the implementation is checked against the specification at runtime. This is easier than checking before runtime (in which case you’d have to mathematically prove every possible execution satisfies the specification). As Sha points out in his article, the specification can be slow and simple. He gives the example of using the very simple Bubblesort as the runtime specification of the more complex Quicksort: if the Quicksort does its job correctly (in O(n log n), assuming a good pivot element), then checking its output (i.e., a hopefully properly sorted list) with Bubblesort will only take linear time (despite Bubble sort taking O(n2) in general).
The simple idea of simple monitors fascinates me. Of course, Bubblesort is not a full specification, though. Although Sha doesn’t suggest it, we’d probably like our monitor to compare the lengths of the input and output lists to ensure that the Quicksort implementation didn’t remove elements. And there’s still the possibility that the Quicksort implementation modifies elements, which is also unchecked by a Bubblesort monitor.
But instead of just checking the output, we could sort the same input with both Quickcheck and Bubblesort and compare the results. This is a “stronger” check insofar as different sorts would have to have exactly the same faults (e.g., not sorting, removing elements, changing elements) for an error not to be caught. The principal drawback is the latency of the slower Bubblesort check as compared to Quicksort. But sometimes, it may be ok to signal an error (shortly) after a result is provided.
Just like for N-version programming, we would like the faults in our monitor to be statistically uncorrelated with those in the monitored software. I am left wondering about the following questions:
- Is there research comparing the kinds of programming errors made in radically different paradigms, such as a Haskell and C? Are there any faults we can claim are statistically uncorrelated?
- Runtime monitoring itself is predicated on the belief that the implementations of different programs will fail in statistically independent ways, just like N-version programming is. While more plausible, does this assumption hold?
I’m working on a NASA-sponsored project to monitor safety-critical embedded systems at runtime, and that’s started me thinking about cyclic redundancy checks (CRCs) again. Error-checking codes are fundamental in fault-tolerant systems. The basic idea is simple: a transmitter wants to send a data word, so it computes a CRC over the word. It sends both the data word and the CRC to the receiver, which computes the same CRC over the received word. If its computed CRC and the received one differ, then there was a transmission error (there are simplifications to this approach, but that’s the basic idea).
CRCs have been around since the 60s, and despite the simple mathematical theory on which they’re built (polynomial division in the Galois Field 2, containing two elements, “0″ and “1″), I was surprised to see that even today, their fault-tolerance properties are in some cases unknown or misunderstood. Phil Koopman at CMU has written a few nice papers over the past couple of years explaining some common misconceptions and analyzing commonly-used CRCs.
Particularly, there seems to be an over-confidence in their ability to detect errors. One fascinating result is the so-called “Schrödinger’s CRC,” so-dubbed in a paper entitled Byzantine Fault Tolerance, from Theory to Reality, by Kevin Driscoll et al. A Schrödinger’s CRC occurs when a transmitter broadcasts a data word and associated CRC to two receivers. and at least one of the data words is corrupted in transit and so is the corresponding CRC so that the faulty word and faulty CRC match! How does this happen? Let’s look at a concrete example:
11-Bit Message USB-5 Receiver A 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 Transmitter 1 ½ 1 1 0 1 1 ½ 0 1 1 ½ 1 0 0 1 Receiver B 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1
We illustrate a transmitter broadcasting an 11-bit message to two receivers, A and B. We use USB-5 CRC, generally used to check USB token packets (by the way, for 11-bit messages, USB-5 has a Hamming Distance of three, meaning the CRC will catch any corruption of fewer than three bits in the combined 11-bit message and CRC). Now, suppose the transmitter has suffered some fault such as a “stuck-at-1/2” fault so that periodically, the transmitter fails to drive the signal on the bus sufficiently high or low. A receiver may interpret an intermediate signal as either a 0 or 1. In the ﬁgure, we show the transmitter sending three stuck-at-1/2 signals, one in the 11-bit message, and two in CRC. The upshot is an example in which a CRC does not prevent a Byzantine fault—the two receivers obtain different messages, each of which passes its CRC.
One question is how likely this scenario is. Paulitsch et al. write that The probability of a Schrödinger’s CRC is hard to evaluate. A worst-case estimate of its occurrence due to a single device is the device failure rate.” It’d be interesting to know if there’s any data on this probability.