## Posts Tagged ‘reliability’

### 10 to the -9

January 24, 2010

$10^{-9}$, or one-in-a-billion, is the famed number given for the maximum probability of a catastrophic failure, per hour of operation, in life-critical systems like commercial aircraft.  The number is part of the folklore of the safety-critical systems literature; where does it come from?

First, it’s worth noting just how small that number is.  As pointed out by Driscoll et al. in the paper, Byzantine Fault Tolerance, from Theory to Reality, the probability of winning the U.K. lottery is 1 in 10s of millions, and the probability of being struck by lightening (in the U.S.) is $1.6 \times 10^{-6},$ more than a 1,000 times more likely than $10^{-9}.$

So where did $10^{-9}$ come from?  A nice explanation comes from a recent paper by John Rushby:

If we consider the example of an airplane type with 100 members, each flying $3000$ hours per year over an operational life of 33 years, then we have a total exposure of about 107 flight hours. If hazard analysis reveals ten potentially catastrophic failures in each of ten subsystems, then the “budget” for each, if none are expected to occur in the life of the fleet, is a failure probability of about $10^{-9}$ per hour [1, page 37]. This serves to explain the well-known $10^{-9}$ requirement, which is stated as follows: “when using quantitative analyses. . . numerical probabilities. . . on the order of $10^{-9}$ per flight-hour. . . based on a flight of mean duration for the airplane type may be used. . . as aids to engineering judgment. . . to. . . help determine compliance” (with the requirement for extremely improbable failure conditions) [2, paragraph 10.b].

[1] E. Lloyd and W. Tye, Systematic Safety: Safety Assessment of Aircraft Systems. London, England: Civil Aviation Authority, 1982, reprinted 1992.

[2] System Design and Analysis, Federal Aviation Administration, Jun. 21, 1988, advisory Circular 25.1309-1A.

(By the way, it’s worth reading the rest of the paper—it’s the first attempt I know of to formally connect the notions of (software) formal verification and reliability.)

So there a probabilistic argument being made, but let’s spell it out in a little more detail.  If there are 10 potential failures in 10 subsystems, then there are $10 \times 10 = 100$ potential failures.  Thus, there are $2^{100}$ possible configurations of failure/non-failure in the subsystems.  Only one of these configurations is acceptable—the one in which there are no faults.

If the probability of failure is $x,$ then the probability of non-failure is $1 - x.$  So if the probability of failure for each subsystem is $10^{-9},$ then the probability of being in the one non-failure configuration is

$\displaystyle(1 - 10^{-9})^{100}$

We want that probability of non-failure to be greater than the required probability of non-failure, given the total number of flight hours.  Thus,

$\displaystyle (1 - 10^{-9})^{100} > 1 - 10^{-7}$

which indeed holds:

$\displaystyle (1 - 10^{-9})^{100} - (1 - 10^{-7})$

is around $4.95 \times 10^{-15}.$

Can we generalize the inequality?  The hint for how to do so is that the number of subsystems ($100$) is no more than the overall failure rate divided by the subsystem rate:

$\displaystyle \frac{10^{-7}}{10^{-9}}$

This suggests the general form is something like

Subsystem reliability inequality: $\displaystyle (1 - C^{-n})^{C^{n-m}} \geq 1 - C^{-m}$

where $C,$ $n,$ and $m$ are real numbers, $C \geq 1,$ $n \geq 0,$ and $n \geq m.$

Let’s prove the inequality holds.  Joe Hurd figured out the proof, sketched below (but I take responsibility for any mistakes in it’s presentation).  For convenience, we’ll prove the inequality holds specifically when $C = e,$ but the proof can be generalized.

First, if $n = 0,$ the inequality holds immediately. Next, we’ll show that

$\displaystyle (1 - e^{-n})^{e^{n-m}}$

is monotonically non-decreasing with respect to $n$ by showing that the derivative of its logarithm is greater or equal to zero for all $n > 0.$  So the derivative of its logarithm is

$\displaystyle \frac{d}{dn} \; e^{n-m}\ln(1-e^{-n}) = e^{n-m}\ln(1-e^{-n})+\frac{e^{-m}}{1-e^{-n}}$

We show

$\displaystyle e^{n-m}\ln(1-e{-n})+\frac{e^{-m}}{1-e^{-n}} \geq 0$

iff

$\displaystyle e^{-m}\left(e^{n}\ln(1-e^{-n}) + \frac{1}{1-e^{-n}}\right) \geq 0$

and since $e^{-m} \geq 0,$

$\displaystyle e^{n}\ln(1-e^{-n}) + \frac{1}{1-e^{-n}} \geq 0$

iff

$\displaystyle e^{n}\ln(1-e^{-n}) \geq - \frac{1}{1-e^{-n}}$

Let $x = e^{-n}$, so the range of $x$ is $0 < x < 1.$
$\displaystyle\ln(1-x) \geq - \frac{x}{1-x}$

Now we show that in the range of $x$, the left-hand side is bounded below by the right-hand side of the inequality.
$\displaystyle \lim_{x \to 0} \; \ln(1-x) = 0$

and
$\displaystyle - \frac{x}{1-x} = 0$

Now taking their derivatives
$\displaystyle \frac{d}{dx} \; \ln(1-x) = \frac{1}{x-1}$

and
$\displaystyle \frac{d}{dx} \; - \frac{x}{1-x} = - \frac{1}{(x-1)^2}$

Because $\displaystyle x-1 \geq - (x-1)^2$ in the range of $x$, our proof holds.

The purpose of this post was to clarify the folklore of ultra-reliable systems.  The subsystem reliability inequality presented allows for easy generalization to other reliable systems.

Thanks again for the help, Joe! (more…)

### N-Version Programming… For the nth Time

April 27, 2009

Software contains faults.  The question is how to cost-effectively reduce the number of faults.  One approach that gained traction and then fell out of favor was N-version programming.  The basic idea is simple: have developer teams implement a specification independent from one another.  Then we can execute the programs concurrently and compare their results.  If we have, say, three separate programs, we vote their results, and if one result disagrees with the others, we presume that program contained a software bug.

N-version programming rests on the assumption that software bugs in independently-implemented programs are random, statistically-uncorrelated events.  Otherwise, multiple versions are not effective at detecting errors if the different versions are likely to suffer the same errors.

John Knight and Nancy Leveson famously debunked this assumption on which N-version programming rested in the “Knight-Leveson experiment” they published in 1986.  In 1990, Knight and Leveson published a brief summary of the original experiment, as well as responses to subsequent criticisms made about it, in their paper, A Reply to the Criticisms of the Knight & Leveson Experiment.

The problem with N-version programming is subtle: it’s not that it provides zero improvement in reliability but that it provides significantly less improvement than is needed to make it cost-effective compared to other kinds of fault-tolerance (like architecture-level fault-tolerance).  The problem is that even small probabilities of correlated faults lead to significant reductions in potential reliability improvements.

Lui Sha has a more recent (2001) IEEE Software article discussing N-version programming, taking into account that the software development cycle is finite: is it better to spend all your time and money on one reliable implementation or on three implementations that’ll be voted at runtime?  His answer is almost always the former (even if we assume uncorrelated faults!).

But rather than N-versions of the same program, what about different programs compared at runtime?  That’s the basic idea of runtime monitoring.  In runtime monitoring, one program is the implementation and another is the specification; the implementation is checked against the specification at runtime.  This is easier than checking before runtime (in which case you’d have to mathematically prove every possible execution satisfies the specification).  As Sha points out in his article, the specification can be slow and simple.  He gives the example of using the very simple Bubblesort as the runtime specification of the more complex Quicksort: if the Quicksort does its job correctly (in O(n log n), assuming a good pivot element), then checking its output (i.e., a hopefully properly sorted list) with Bubblesort will only take linear time (despite Bubble sort taking O(n2) in general).

The simple idea of simple monitors fascinates me.  Of course, Bubblesort is not a full specification, though.  Although Sha doesn’t suggest it, we’d probably like our monitor to compare the lengths of the input and output lists to ensure that the Quicksort implementation didn’t remove elements.  And there’s still the possibility that the Quicksort implementation modifies elements, which is also unchecked by a Bubblesort monitor.

But instead of just checking the output, we could sort the same input with both Quickcheck and Bubblesort and compare the results.  This is a “stronger” check insofar as different sorts would have to have exactly the same faults (e.g., not sorting, removing elements, changing elements) for an error not to be caught.  The principal drawback is the latency of the slower Bubblesort check as compared to Quicksort.  But sometimes, it may be ok to signal an error (shortly) after a result is provided.

Just like for N-version programming, we would like the faults in our monitor to be statistically uncorrelated with those in the monitored software.  I am left wondering about the following questions:

• Is there research comparing the kinds of programming errors made in radically different paradigms, such as a Haskell and C?  Are there any faults we can claim are statistically uncorrelated?
• Runtime monitoring itself is predicated on the belief that the implementations of different programs will fail in statistically independent ways, just like N-version programming is.  While more plausible, does this assumption hold?