In a previous post, I discussed the notion of Schrödinger CRCs, first described by Kevin Driscoll et al. in their paper Byzantine Fault Tolerance, from Theory to Reality. The basic idea is that error-detecting codes do not necessarily prevent two receivers from obtaining messages that are semantically different (i.e., different data) but syntactically valid (i.e., the CRC matches the respective data words received). The upshot is that even with CRCs, you can suffer Byzantine faults, with some probability.
… So what is that probability of a Schrödinger’s CRC? That’s the topic of this post—which cleans up a few of the ideas I presented earlier. I published a short paper on the topic, which I presented at Dependable Sensors and Networks, 2010, while Kevin Driscoll was in the audience! If you’d prefer to read the PDF or get the slides, they’re here. The simulation code (Haskell) is here.
Software contains faults. The question is how to cost-effectively reduce the number of faults. One approach that gained traction and then fell out of favor was N-version programming. The basic idea is simple: have developer teams implement a specification independent from one another. Then we can execute the programs concurrently and compare their results. If we have, say, three separate programs, we vote their results, and if one result disagrees with the others, we presume that program contained a software bug.
N-version programming rests on the assumption that software bugs in independently-implemented programs are random, statistically-uncorrelated events. Otherwise, multiple versions are not effective at detecting errors if the different versions are likely to suffer the same errors.
John Knight and Nancy Leveson famously debunked this assumption on which N-version programming rested in the “Knight-Leveson experiment” they published in 1986. In 1990, Knight and Leveson published a brief summary of the original experiment, as well as responses to subsequent criticisms made about it, in their paper, A Reply to the Criticisms of the Knight & Leveson Experiment.
The problem with N-version programming is subtle: it’s not that it provides zero improvement in reliability but that it provides significantly less improvement than is needed to make it cost-effective compared to other kinds of fault-tolerance (like architecture-level fault-tolerance). The problem is that even small probabilities of correlated faults lead to significant reductions in potential reliability improvements.
Lui Sha has a more recent (2001) IEEE Software article discussing N-version programming, taking into account that the software development cycle is finite: is it better to spend all your time and money on one reliable implementation or on three implementations that’ll be voted at runtime? His answer is almost always the former (even if we assume uncorrelated faults!).
But rather than N-versions of the same program, what about different programs compared at runtime? That’s the basic idea of runtime monitoring. In runtime monitoring, one program is the implementation and another is the specification; the implementation is checked against the specification at runtime. This is easier than checking before runtime (in which case you’d have to mathematically prove every possible execution satisfies the specification). As Sha points out in his article, the specification can be slow and simple. He gives the example of using the very simple Bubblesort as the runtime specification of the more complex Quicksort: if the Quicksort does its job correctly (in O(n log n), assuming a good pivot element), then checking its output (i.e., a hopefully properly sorted list) with Bubblesort will only take linear time (despite Bubble sort taking O(n2) in general).
The simple idea of simple monitors fascinates me. Of course, Bubblesort is not a full specification, though. Although Sha doesn’t suggest it, we’d probably like our monitor to compare the lengths of the input and output lists to ensure that the Quicksort implementation didn’t remove elements. And there’s still the possibility that the Quicksort implementation modifies elements, which is also unchecked by a Bubblesort monitor.
But instead of just checking the output, we could sort the same input with both Quickcheck and Bubblesort and compare the results. This is a “stronger” check insofar as different sorts would have to have exactly the same faults (e.g., not sorting, removing elements, changing elements) for an error not to be caught. The principal drawback is the latency of the slower Bubblesort check as compared to Quicksort. But sometimes, it may be ok to signal an error (shortly) after a result is provided.
Just like for N-version programming, we would like the faults in our monitor to be statistically uncorrelated with those in the monitored software. I am left wondering about the following questions:
- Is there research comparing the kinds of programming errors made in radically different paradigms, such as a Haskell and C? Are there any faults we can claim are statistically uncorrelated?
- Runtime monitoring itself is predicated on the belief that the implementations of different programs will fail in statistically independent ways, just like N-version programming is. While more plausible, does this assumption hold?
I’m working on a NASA-sponsored project to monitor safety-critical embedded systems at runtime, and that’s started me thinking about cyclic redundancy checks (CRCs) again. Error-checking codes are fundamental in fault-tolerant systems. The basic idea is simple: a transmitter wants to send a data word, so it computes a CRC over the word. It sends both the data word and the CRC to the receiver, which computes the same CRC over the received word. If its computed CRC and the received one differ, then there was a transmission error (there are simplifications to this approach, but that’s the basic idea).
CRCs have been around since the 60s, and despite the simple mathematical theory on which they’re built (polynomial division in the Galois Field 2, containing two elements, “0″ and “1″), I was surprised to see that even today, their fault-tolerance properties are in some cases unknown or misunderstood. Phil Koopman at CMU has written a few nice papers over the past couple of years explaining some common misconceptions and analyzing commonly-used CRCs.
Particularly, there seems to be an over-confidence in their ability to detect errors. One fascinating result is the so-called “Schrödinger’s CRC,” so-dubbed in a paper entitled Byzantine Fault Tolerance, from Theory to Reality, by Kevin Driscoll et al. A Schrödinger’s CRC occurs when a transmitter broadcasts a data word and associated CRC to two receivers. and at least one of the data words is corrupted in transit and so is the corresponding CRC so that the faulty word and faulty CRC match! How does this happen? Let’s look at a concrete example:
11-Bit Message USB-5 Receiver A 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 Transmitter 1 ½ 1 1 0 1 1 ½ 0 1 1 ½ 1 0 0 1 Receiver B 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1
We illustrate a transmitter broadcasting an 11-bit message to two receivers, A and B. We use USB-5 CRC, generally used to check USB token packets (by the way, for 11-bit messages, USB-5 has a Hamming Distance of three, meaning the CRC will catch any corruption of fewer than three bits in the combined 11-bit message and CRC). Now, suppose the transmitter has suffered some fault such as a “stuck-at-1/2” fault so that periodically, the transmitter fails to drive the signal on the bus sufficiently high or low. A receiver may interpret an intermediate signal as either a 0 or 1. In the ﬁgure, we show the transmitter sending three stuck-at-1/2 signals, one in the 11-bit message, and two in CRC. The upshot is an example in which a CRC does not prevent a Byzantine fault—the two receivers obtain different messages, each of which passes its CRC.
One question is how likely this scenario is. Paulitsch et al. write that The probability of a Schrödinger’s CRC is hard to evaluate. A worst-case estimate of its occurrence due to a single device is the device failure rate.” It’d be interesting to know if there’s any data on this probability.