As I posted over on the Galois blog, I recently gave a talk a couple of times on formal methods at Galois over the years (pdf). It’s been fun putting theory into practice!

## FM@Galois Talks

November 10, 2011## Making your Ubuntu life better

June 18, 2011I’ve had a lot of trouble with Ubuntu 11.04 (Natty Narwhal) on a laptop (ThinkPad 420), and I’ve had problems including:

- Not being able to use dual monitors,
- Random logoffs.

I tentatively think I was able to solve them with two easy fixes:

- Turn off Unity. You can do this in the login screen.
- In the Update Manager, under “Settings”, check “Proposed updates”.

With the proposed updates, Unity may be working; try that at your own risk. With that, I have a reasonably stable system.

## Meta-Programming and eDSLs

January 30, 2011I’ve been working on a domain-specific language that is embedded in Haskell (an “eDSL”) that essentially takes a set of Haskell stream (infinite list) equations and turns them into a real-time C program implementing the state-machine defined by the streams. It’s called Copilot, and in fact, it’s built on top of another Haskell eDSL called Atom,^{1} which actually does the heavy lifting in generating the C code.

For example, here’s the Fibonacci sequence in Copilot:

`fib = do let f = varW64 "f" f .= [0,1] ++ f + (drop 1 f) `

I’ve been writing Copilot libraries recently, and I’ve had the following realization about eDSLs and meta-programming (let me know if someone has had this idea already!). Many languages treat meta-programming as a second-class feature—I’m thinking of macros used by the C preprocessor, for example (it’s probably generous even to call the C preprocessor ‘meta-programming’). One reason why Lisp-like languages were exciting is that the *language* is a first-class datatype, so meta-programming is on par with programming. In my experience, you don’t think twice about meta-programming in Lisp. (Haskell is more like C in this regard—you do think twice before using Template Haskell.)

So languages generally treat meta-programming as either second-class or first-class. What’s interesting about eDSLs, I think, is that they treat meta-programming as first-class, but programming as *second*-class! This isn’t surprising, since an eDSL is a first-class datatype, but the language is *not* first-class—its host language is.

Practically, what this means is that you spend very little time actually writing eDSL programs but rather *generating* eDSL programs. It is natural to layer eDSLs on top of other eDSLs.

This is just how Copilot came about: I was writing various Atom programs and realized that for my purposes, I just needed a restricted set of behaviors provided by Atom that are naturally represented by stream equations (and make some other tasks, like writing an interpreter, easier).

But as soon as Copilot was written, we^{2} started writing libraries implementing past-time linear temporal logic (LTL) operators, bounded LTL operators, fault-tolerant voting algorithms, regular expressions, and so on.

You wouldn’t think about combining the intermediate languages of a compiler into the same program. The idea of a language is more fluid, more organic in the context of eDSLs, where now libraries can be quickly written and different levels can be easily combined.

^{1}Tom Hawkins wrote Atom.

^{2}Credit for Copilot also goes to Sebastian Niller, Robin Morisset, Alwyn Goodloe.

## New Group: Functional Programming for Embedded Systems

May 30, 2010Are you interested in how functional programming can be leveraged to make embedded-systems programming easier and more reliable? You are not alone. For example, check out what’s been happening in just the past couple of years.

- Programming the Arduino micro-controller with Atom/Haskell.
- The Atom eDSL language itself.
- The Feldspar language for DSP (embedded in Haskell).
- Writing a DO178B-compliant compiler in Ocaml (presented at ICFP’09).
- … And others.

Now Tom Hawkins (designer of Atom) has started a Google group, fp-embedded, to discuss these issues. Please join and post your projects & questions!

## An Apologia for Formal Methods

March 14, 2010In the January 2010 copy of *IEEE Computer*, David Parnas published an article, “Really Rethinking ‘Formal Methods’” (sorry, you’ll need an IEEE subscription or purchase the article to access it), with the following abstract:

We must question the assumptions underlying the well-known current formal software development methods to see why they have not been widely adopted and what should be changed.

I found some of the opinions therein to be antiquated, so I wrote a letter to the editor (free content!), which appears in the March 2010 edition. IEEE also published a response from David Parnas, which you can also access at the letter link above.

I’ll refrain from visiting this debate here, but please have a look at the letters, enjoy the controversy, and do not hesitate to leave a comment!

## 10 to the -9

January 24, 2010, or one-in-a-billion, is the famed number given for the maximum probability of a catastrophic failure, per hour of operation, in life-critical systems like commercial aircraft. The number is part of the folklore of the safety-critical systems literature; where does it come from?

First, it’s worth noting just how small that number is. As pointed out by Driscoll *et al*. in the paper, Byzantine Fault Tolerance, from Theory to Reality, the probability of winning the U.K. lottery is 1 in 10s of millions, and the probability of being struck by lightening (in the U.S.) is more than a 1,000 times more likely than

So where did come from? A nice explanation comes from a recent paper by John Rushby:

If we consider the example of an airplane type with 100 members, each flying hours per year over an operational life of 33 years, then we have a total exposure of about 107 flight hours. If hazard analysis reveals ten potentially catastrophic failures in each of ten subsystems, then the “budget” for each, if none are expected to occur in the life of the fleet, is a failure probability of about per hour [1, page 37]. This serves to explain the well-known requirement, which is stated as follows: “when using quantitative analyses. . . numerical probabilities. . . on the order of per flight-hour. . . based on a flight of mean duration for the airplane type may be used. . . as aids to engineering judgment. . . to. . . help determine compliance” (with the requirement for extremely improbable failure conditions) [2, paragraph 10.b].

[1] E. Lloyd and W. Tye, Systematic Safety: Safety Assessment of Aircraft Systems. London, England: Civil Aviation Authority, 1982, reprinted 1992.

[2] System Design and Analysis, Federal Aviation Administration, Jun. 21, 1988, advisory Circular 25.1309-1A.

(By the way, it’s worth reading the rest of the paper—it’s the first attempt I know of to formally connect the notions of (software) formal verification and reliability.)

So there a probabilistic argument being made, but let’s spell it out in a little more detail. If there are 10 potential failures in 10 subsystems, then there are potential failures. Thus, there are possible configurations of failure/non-failure in the subsystems. Only one of these configurations is acceptable—the one in which there are no faults.

If the probability of failure is then the probability of non-failure is So if the probability of failure for each subsystem is then the probability of being in the one non-failure configuration is

We want that probability of non-failure to be greater than the required probability of non-failure, given the total number of flight hours. Thus,

which indeed holds:

is around

Can we generalize the inequality? The hint for how to do so is that the number of subsystems () is no more than the overall failure rate divided by the subsystem rate:

This suggests the general form is something like

*Subsystem reliability inequality*:

where and are real numbers, and

Let’s prove the inequality holds. Joe Hurd figured out the proof, sketched below (but I take responsibility for any mistakes in it’s presentation). For convenience, we’ll prove the inequality holds specifically when but the proof can be generalized.

First, if the inequality holds immediately. Next, we’ll show that

is monotonically non-decreasing with respect to by showing that the derivative of its logarithm is greater or equal to zero for all So the derivative of its logarithm is

We show

iff

and since

iff

Let , so the range of is

Now we show that in the range of , the left-hand side is bounded below by the right-hand side of the inequality.

and

Now taking their derivatives

and

Because in the range of , our proof holds.

The purpose of this post was to clarify the folklore of ultra-reliable systems. The subsystem reliability inequality presented allows for easy generalization to other reliable systems.

Thanks again for the help, Joe! Read the rest of this entry »

## Writer’s Unblock

September 30, 2009I’ve recently got a few technical papers out the door involving Haskell, physical-layer protocols, SMT, security modeling, and run-time verification of embedded systems (phew!). One of the benefits of industrial research is getting your hands involved in a lot of different research projects.

- This paper is about using Haskell to model physical-layer protocols and using QuickCheck to test them. Physical-layer protocols are used to transmit bits from one clock-domain to another and are used in ethernet, credit card swipers, CD players, and so on. The gist of the paper is that even though Haskell is pure & lazy, it works great for modeling and testing real-time protocols and even for computing reliability statistics. I presented it at the Haskell Symposium in September ’09, which was associated with ICFP. (The talk video is online!) The paper is a short experience report—indeed, it is the only experience report that was accepted at the symposium. The Haskell Symposium was an entertaining and friendly environment for presenting.

- This paper actually precedes the Haskell paper, but it extends the results by describing how to formally verify physical-layer protocols using SMT solvers and
*k*-induction (we use SRI’s SAL tool in this work). The paper is a journal article accepted at Formal Aspects of Computing. You’ll find at least two things interesting about this article: (1) For all the excitement about SMT, there don’t seem to be a lot of great examples demonstrating its efficacy—the problems described in this paper were (laboriously!) verified using theorem-provers by others previously, and our approach using SMT is much more automated. (2) We provide a nice general model of cross clock-domain circuits and particularly metastability.So if you can verify physical-layer protocols, why model them in Haskell and QuickCheck them (as we did above)? There are at least two reasons. First, if you’re using SMT, then your timing constraints need to be linear inequalities to be decidable. For systems that with nonlinear constraints, QuickCheck might be your only recourse. Second, QuickCheck gives you

*concrete*counterexamples and test-cases that you can use to test implementations (SMT solvers often return symbolic counterexamples).

- This paper describes a simple model for analyzing information flow in a system (where a “system” could be a program, a network, an OS, etc.). The main portion of the paper describes heuristics based on graph algorithms for deciding what sort of information flow policies you might want to enforce in your system. In general, there’s been a lot of work on analyzing access control policies but not so much work in figuring out
*what*kind of policy you should have in the first place (if you know of such work, please tell me!). The paper isn’t deep, and it’s also preliminary insofar as I don’t describe building a complex system using the techniques. Still, there’s a small (Haskell) script available that implements the algorithms described; I’d love to see these analyses find their way into a tool to help system designers build secure systems.

- Finally, this report describes the field of run-time monitoring (or run-time verification) as it applies to safety-critical real-time embedded software. Run-time monitoring compliments formal verification since when a system is too complicated to verify
*a priori*, it can be monitored at run-time to ensure it conforms to its specification. Not a lot of work has been done on monitoring software that’s hard real-time, distributed, or fault-tolerant—which ironically could benefit the most from run-time monitoring. The report should serve as a nice, gentle introduction. The report should be published soon as a NASA Contractor Report—the work was done under a NASA-sponsored project for which I’m the PI.

Don’t hesitate to give me feedback on any of these papers. Ok, time to fill up the queue again…

## Finding Boole

August 10, 2009Here’s a simple challenge-problem for model-checking Boolean functions: Suppose you want to compute some Boolean function , where represents 0 or more Boolean arguments.

Let , , , range over 2-ary Boolean functions, (of type ), and suppose that is a fixed composition of , , , . (By the way, I’m going to talk about functions, but you can think of these as combinatorial circuits, if you prefer.)

Our question is, “Do there exist instantiations of , , , such that for all inputs, ?

What is interesting to me is that our question is quantified and of the form, “exists a forall b …”, and it is “higher-order” insofar as we want to find whether there exist satisfying functions. That said, the property is easy to encode as a model-checking problem. Here, I’ll encode it into SRI’s Symbolic Analysis Laboratory (SAL) using its BDD engine. (The SAL file in its entirety is here.)

To code the problem in SAL, we’ll first define for convenience a shorthand for the built-in type, `BOOLEAN`:

B: TYPE = BOOLEAN;

And we’ll define an enumerated data type representing the 16 possible 2-ary Boolean functions:

B2ARY: TYPE = { False, Nor, NorNot, NotA, AndNot, NotB, Xor, Nand , And, Eqv, B, NandNot, A, OrNot, Or, True};

Now we need an application function that takes an element `f` from `B2ARY` and two Boolean arguments, and depending on `f`, applies the appropriate 2-ary Boolean function:

app(f: B2ARY, a: B, b: B): B = IF f = False THEN FALSE ELSIF f = Nor THEN NOT (a OR b) ELSIF f = NorNot THEN NOT a AND b ELSIF f = NotA THEN NOT a ELSIF f = AndNot THEN a AND NOT b ELSIF f = NotB THEN NOT b ELSIF f = Xor THEN a XOR b ELSIF f = Nand THEN NOT (a AND b) ELSIF f = And THEN a AND b ELSIF f = Eqv THEN NOT (a XOR b) ELSIF f = B THEN b ELSIF f = NandNot THEN NOT a OR b ELSIF f = A THEN a ELSIF f = OrNot THEN a OR NOT b ELSIF f = Or THEN a OR b ELSE TRUE ENDIF;

Let’s give a concrete definition to `f` and say that it is the composition of five 2-ary Boolean functions, `f0` through `f4`. In the language of SAL:

f(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): [[B2ARY, B2ARY, B2ARY, B2ARY, B2ARY] -> B] = LAMBDA (f0: B2ARY, f1: B2ARY, f2: B2ARY, f3: B2ARY, f4: B2ARY): app(f0, app(f1, app(f2, b0, app(f3, app(f4, b1, b2), b3)), b4), b5);

Now let’s define the `spec` function that `f` should implement:

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B = (b0 AND b1) OR (b2 AND b3) OR (b4 AND b5);

Now, we’ll define a module `m`; modules are SAL’s building blocks for defining state machines. However, in our case, we won’t define an actual state machine since we’re only modeling function composition (or combinatorial circuits). The module has variables corresponding the function inputs, function identifiers, and a Boolean stating whether `f` is equivalent to its specification (we’ll label the classes of variables `INPUT`, `LOCAL`, and `OUTPUT`, to distinguish them, but for our purposes, the distinction doesn’t matter).

m: MODULE = BEGIN INPUT b0, b1, b2, b3, b4, b5 : B LOCAL f0, f1, f2, f3, f4 : B2ARY OUTPUT equiv : B DEFINITION equiv = FORALL (b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): spec(b0, b1, b2, b3, b4, b5) = f(b0, b1, b2, b3, b4, b5)(f0, f1, f2, f3, f4); END;

Notice we’ve universally quantified the free variables in `spec` and the definition of `f`.

Finally, all we have to do is state the following theorem:

instance : THEOREM m |- NOT equiv;

Asking whether `equiv` is false in module `m`. Issuing

$ sal-smc FindingBoole.sal instance

asks SAL’s BDD-based model-checker to solve theorem `instance`. In a couple of seconds, SAL says the theorem is proved. So `spec` can’t be implemented by `f`, for any instantiation of `f0` through `f4`! OK, what about

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B = TRUE;

Issuing

$ sal-smc FindingBoole.sal instance

we get a counterexample this time:

f0 = True f1 = NandNot f2 = NorNot f3 = And f4 = Xor

which is an assignment to the function symbols. Obviously, to compute the constant `TRUE`, only the outermost function, `f0`, matters, and as we see, it is defined to be `TRUE`.

By the way, the purpose of defining the enumerated type `B2ARY` should be clear now—if we hadn’t, SAL would just return a mess in which the value of each function `f0` through `f4` is enumerated:

f0(false, false) = true f0(true, false) = true f0(false, true) = true f0(true, true) = true f1(false, false) = true f1(true, false) = true f1(false, true) = false f1(true, true) = true f2(false, false) = false f2(true, false) = true f2(false, true) = false f2(true, true) = false f3(false, false) = false f3(true, false) = false f3(false, true) = false f3(true, true) = true f4(false, false) = false f4(true, false) = true f4(false, true) = true f4(true, true) = false

OK, let’s conclude with one more spec:

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B = (NOT (b0 AND ((b1 OR b2) XOR b3)) AND b4) XOR b5;

This is implementable by `f`, and SAL returns

f0 = Eqv f1 = OrNot f2 = And f3 = Eqv f4 = Nor

Although these assignments compute the same function, they differ from those in our specification. Just to double-check, we can ask SAL if they’re equivalent:

spec1(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B = ((b0 AND ((NOT (b1 OR b2)) b3)) OR NOT b4) b5;

specifies the assignments returned, and

eq: THEOREM m |- spec(b0, b1, b2, b3, b4, b5) = spec1(b0, b1, b2, b3, b4, b5);

asks if the two specifications are equivalent. They are.

## “Schrodinger’s Probability” for Error-Checking Codes

May 15, 2009In a previous post, I discussed the notion of *Schrödinger CRCs*, first described by Kevin Driscoll et al. in their paper *Byzantine Fault Tolerance, from Theory to Reality*. The basic idea is that error-detecting codes do not necessarily prevent two receivers from obtaining messages that are semantically different (i.e., different data) but syntactically valid (i.e., the CRC matches the respective data words received). The upshot is that even with CRCs, you can suffer Byzantine faults, with some probability.

… So what *is* that probability of a Schrödinger’s CRC? That’s the topic of this post—which cleans up a few of the ideas I presented earlier. I published a short paper on the topic, which I presented at *Dependable Sensors and Networks, 2010*, while Kevin Driscoll was in the audience! If you’d prefer to read the PDF or get the slides, they’re here. The simulation code (Haskell) is here.

## N-Version Programming… For the nth Time

April 27, 2009Software contains faults. The question is how to cost-effectively reduce the number of faults. One approach that gained traction and then fell out of favor was *N-version programming*. The basic idea is simple: have developer teams implement a specification independent from one another. Then we can execute the programs concurrently and compare their results. If we have, say, three separate programs, we vote their results, and if one result disagrees with the others, we presume that program contained a software bug.

*N*-version programming rests on the assumption that software bugs in independently-implemented programs are random, statistically-uncorrelated events. Otherwise, multiple versions are not effective at detecting errors if the different versions are likely to suffer the same errors.

John Knight and Nancy Leveson famously debunked this assumption on which *N*-version programming rested in the “Knight-Leveson experiment” they published in 1986. In 1990, Knight and Leveson published a brief summary of the original experiment, as well as responses to subsequent criticisms made about it, in their paper, *A Reply to the Criticisms of the Knight & Leveson Experiment*.

The problem with *N*-version programming is subtle: it’s not that it provides *zero* improvement in reliability but that it provides significantly less improvement than is needed to make it cost-effective compared to other kinds of fault-tolerance (like architecture-level fault-tolerance). The problem is that even small probabilities of correlated faults lead to significant reductions in potential reliability improvements.

Lui Sha has a more recent (2001) *IEEE Software* article discussing *N*-version programming, taking into account that the software development cycle is finite: is it better to spend all your time and money on one reliable implementation or on three implementations that’ll be voted at runtime? His answer is almost always the former (even if we assume *un*correlated faults!).

But rather than *N*-versions of the same program, what about different programs compared at runtime? That’s the basic idea of runtime monitoring. In runtime monitoring, one program is the implementation and another is the specification; the implementation is checked against the specification at runtime. This is easier than checking before runtime (in which case you’d have to mathematically *prove* every possible execution satisfies the specification). As Sha points out in his article, the specification can be slow and simple. He gives the example of using the very simple Bubblesort as the runtime specification of the more complex Quicksort: if the Quicksort does its job correctly (in *O*(*n* log *n*), assuming a good pivot element), then checking its output (i.e., a hopefully properly sorted list) with Bubblesort will only take linear time (despite Bubble sort taking *O*(*n*^{2}) in general).

The simple idea of simple monitors fascinates me. Of course, Bubblesort is not a full specification, though. Although Sha doesn’t suggest it, we’d probably like our monitor to compare the lengths of the input and output lists to ensure that the Quicksort implementation didn’t remove elements. And there’s still the possibility that the Quicksort implementation modifies elements, which is also unchecked by a Bubblesort monitor.

But instead of just checking the output, we could sort the same input with both Quickcheck and Bubblesort and compare the results. This is a “stronger” check insofar as different sorts would have to have exactly the same faults (e.g., not sorting, removing elements, changing elements) for an error not to be caught. The principal drawback is the latency of the slower Bubblesort check as compared to Quicksort. But sometimes, it may be ok to signal an error (shortly) after a result is provided.

Just like for *N*-version programming, we would like the faults in our monitor to be statistically uncorrelated with those in the monitored software. I am left wondering about the following questions:

- Is there research comparing the kinds of programming errors made in radically different paradigms, such as a Haskell and C? Are there any faults we can claim are statistically uncorrelated?
- Runtime monitoring itself is predicated on the belief that the implementations of different
*programs*will fail in statistically independent ways, just like*N*-version programming is. While more plausible, does this assumption hold?

## Programming Languages for Unpiloted Air Vehicles

April 20, 2009I recently presented a position paper at a workshop addressing software challenges for unpiloted air vehicles (UAVs). The paper is coauthored with Don Stewart and John Van Enk. From a software perspective, UAVs (and aircraft, in general) are fascinating. Modern aircraft are essentially computers that fly, with a pilot for backup. UAVs are computers that fly, without the backup.

Some day in the not-so-distant future, we may have UPS and FedEx cargo planes that are completely autonomous (it’ll be a while before people are comfortable with a pilot-less airplane). These planes will be in the commercial airspace and landing at commercial airports. Ultimately, UAVs must be *transparent*: an observer should not be able to discern whether an airplane is human or computer controlled by its behavior.

You won’t be surprised to know a lot of software is required to make this happen. To put things in perspective, Boeing’s 777 is said to contain over 2 million lines of newly-developed code; the Joint Strike Fighter aircraft is said to have over 5 million lines. Next-generation UAVs, with pilot AI, UAV-to-UAV and UAV-to-ground communications, and arbitrary other functionality, will have millions more lines of code. And the software must be correct.

In our paper, we argue that the only way to get a hold on the complexity of UAV software is through the use of domain-specific languages (DSLs). A good DSL relieves the application programmer from carrying out boilerplate activities and providers her with specific tools for her domain. We go on and advocate the need for *lightweight* DSLs (LwDSLs), also known as *embedded* DSLs. A LwDSL is one that is hosted in a mature, high-level language (like Haskell); it can be thought of as domain-specific libraries and domain-specific syntax. The big benefit of a LwDSL is that a new compiler and tools don’t need to be reinvented. Indeed, as we report in the paper, companies realizing the power of LwDSLs are quietly gaining a competitive advantage.

Safety-critical systems, like those on UAVs, combine multiple software subsystems. If each subsystem is programmed in its own LwDSL hosted in the same language, then it is easy to compose testing and validation across subsystem boundaries. (Only testing within each design-domain just won’t fly, pun intended.)

The “LwDSL approach” won’t magically make the problems of verifying life-critical software, but “raising the abstraction level” must be our mantra moving forward.

## Byzantine Cyclic Redundancy Checks (CRC)

April 18, 2009I’m working on a NASA-sponsored project to monitor safety-critical embedded systems at runtime, and that’s started me thinking about cyclic redundancy checks (CRCs) again. Error-checking codes are fundamental in fault-tolerant systems. The basic idea is simple: a transmitter wants to send a data word, so it computes a CRC over the word. It sends both the data word and the CRC to the receiver, which computes the same CRC over the received word. If its computed CRC and the received one differ, then there was a transmission error (there are simplifications to this approach, but that’s the basic idea).

CRCs have been around since the 60s, and despite the simple mathematical theory on which they’re built (polynomial division in the Galois Field 2, containing two elements, “0” and “1”), I was surprised to see that even today, their fault-tolerance properties are in some cases unknown or misunderstood. Phil Koopman at CMU has written a few nice papers over the past couple of years explaining some common misconceptions and analyzing commonly-used CRCs.

Particularly, there seems to be an over-confidence in their ability to detect errors. One fascinating result is the so-called “Schrödinger’s CRC,” so-dubbed in a paper entitled *Byzantine Fault Tolerance, from Theory to Reality*, by Kevin Driscoll et al. A Schrödinger’s CRC occurs when a transmitter broadcasts a data word and associated CRC to two receivers. and at least one of the data words is corrupted in transit and so is the corresponding CRC so that the faulty word and faulty CRC match! How does this happen? Let’s look at a concrete example:

11-Bit Message USB-5 Receiver A 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 Transmitter 1 ½ 1 1 0 1 1 ½ 0 1 1 ½ 1 0 0 1 Receiver B 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1

We illustrate a transmitter broadcasting an 11-bit message to two receivers, A and B. We use USB-5 CRC, generally used to check USB token packets (by the way, for 11-bit messages, USB-5 has a Hamming Distance of three, meaning the CRC will catch any corruption of fewer than three bits in the combined 11-bit message and CRC). Now, suppose the transmitter has suffered some fault such as a “stuck-at-1/2” fault so that periodically, the transmitter fails to drive the signal on the bus sufficiently high or low. A receiver may interpret an intermediate signal as either a 0 or 1. In the ﬁgure, we show the transmitter sending three stuck-at-1/2 signals, one in the 11-bit message, and two in CRC. The upshot is an example in which a CRC does *not* prevent a Byzantine fault—the two receivers obtain different messages, each of which passes its CRC.

One question is how likely this scenario is. Paulitsch et al. write that The probability of a Schrödinger’s CRC is hard to evaluate. A worst-case estimate of its occurrence due to a single device is the device failure rate.” It’d be interesting to know if there’s any data on this probability.