A Critical Systems Blog

Thoughts on safe and secure embedded computer systems

Writer’s Unblock

I’ve recently got a few technical papers out the door involving Haskell, physical-layer protocols, SMT, security modeling, and run-time verification of embedded systems (phew!).  One of the benefits of industrial research is getting your hands involved in a lot of different research projects.

  • This paper is about using Haskell to model physical-layer protocols and using QuickCheck to test them.  Physical-layer protocols are used to transmit bits from one clock-domain to another and are used in ethernet, credit card swipers, CD players, and so on.  The gist of the paper is that even though Haskell is pure & lazy, it works great for modeling and testing real-time protocols and even for computing reliability statistics.  I presented it at the Haskell Symposium in September ‘09, which was associated with ICFP.  (The talk video is online!)  The paper is a short experience report—indeed, it is the only experience report that was accepted at the symposium.  The Haskell Symposium was an entertaining and friendly environment for presenting.
  • This paper actually precedes the Haskell paper, but it extends the results by describing how to formally verify physical-layer protocols using SMT solvers and k-induction (we use SRI’s SAL tool in this work).  The paper is a journal article accepted at Formal Aspects of Computing.  You’ll find at least two things interesting about this article: (1) For all the excitement about SMT, there don’t seem to be a lot of great examples demonstrating its efficacy—the problems described in this paper were (laboriously!) verified using theorem-provers by others previously, and our approach using SMT is much more automated.  (2) We provide a nice general model of cross clock-domain circuits and particularly metastability.

    So if you can verify physical-layer protocols, why model them in Haskell and QuickCheck them (as we did above)?  There are at least two reasons.  First, if you’re using SMT, then your timing constraints need to be linear inequalities to be decidable.  For systems that with nonlinear constraints, QuickCheck might be your only recourse.  Second, QuickCheck gives you concrete counterexamples and test-cases that you can use to test implementations (SMT solvers often return symbolic counterexamples).

  • This paper describes a simple model for analyzing information flow in a system (where a “system” could be a program, a network, an OS, etc.).  The main portion of the paper describes heuristics based on graph algorithms for deciding what sort of information flow policies you might want to enforce in your system.  In general, there’s been a lot of work on analyzing access control policies but not so much work in figuring out what kind of policy you should have in the first place (if you know of such work, please tell me!).  The paper isn’t deep, and it’s also preliminary insofar as I don’t describe building a complex system using the techniques.  Still, there’s a small (Haskell) script available that implements the algorithms described; I’d love to see these analyses find their way into a tool to help system designers build secure systems.
  • Finally, this report describes the field of run-time monitoring (or run-time verification) as it applies to safety-critical real-time embedded software.  Run-time monitoring compliments formal verification since when a system is too complicated to verify a priori, it can be monitored at run-time to ensure it conforms to its specification.  Not a lot of work has been done on monitoring software that’s hard real-time, distributed, or fault-tolerant—which ironically could benefit the most from run-time monitoring.  The report should serve as a nice, gentle introduction.  The report should be published soon as a NASA Contractor Report—the work was done under a NASA-sponsored project for which I’m the PI.

Don’t hesitate to give me feedback on any of these papers.  Ok, time to fill up the queue again…

Filed under: Software, Verification

Finding Boole

Here’s a simple challenge-problem for model-checking Boolean functions: Suppose you want to compute some Boolean function spec :: B^k \rightarrow B, where B^k represents 0 or more Boolean arguments.

Let f_0, f_1, \ldots, f_j range over 2-ary Boolean functions, (of type (Bool, Bool) \rightarrow Bool), and suppose that f is a fixed composition of f_0, f_1, \ldots, f_j. (By the way, I’m going to talk about functions, but you can think of these as combinatorial circuits, if you prefer.)

Our question is, “Do there exist instantiations of f_0, f_1, \ldots, f_j such that for all inputs, f = spec?

What is interesting to me is that our question is quantified and of the form, “exists a forall b …”, and it is “higher-order” insofar as we want to find whether there exist satisfying functions. That said, the property is easy to encode as a model-checking problem. Here, I’ll encode it into SRI’s Symbolic Analysis Laboratory (SAL) using its BDD engine.  (The SAL file in its entirety is here.)

To code the problem in SAL, we’ll first define for convenience a shorthand for the built-in type, BOOLEAN:

B: TYPE = BOOLEAN;

And we’ll define an enumerated data type representing the 16 possible 2-ary Boolean functions:

B2ARY: TYPE = { False, Nor, NorNot, NotA, AndNot, NotB, Xor, Nand
              , And, Eqv, B, NandNot, A, OrNot, Or, True};

Now we need an application function that takes an element f from B2ARY and two Boolean arguments, and depending on f, applies the appropriate 2-ary Boolean function:

app(f: B2ARY, a: B, b: B): B =
  IF    f = False   THEN FALSE
  ELSIF f = Nor     THEN NOT (a OR b)
  ELSIF f = NorNot  THEN NOT a AND b
  ELSIF f = NotA    THEN NOT a
  ELSIF f = AndNot  THEN a AND NOT b
  ELSIF f = NotB    THEN NOT b
  ELSIF f = Xor     THEN a XOR b
  ELSIF f = Nand    THEN NOT (a AND b)
  ELSIF f = And     THEN a AND b
  ELSIF f = Eqv     THEN NOT (a XOR b)
  ELSIF f = B       THEN b
  ELSIF f = NandNot THEN NOT a OR b
  ELSIF f = A       THEN a
  ELSIF f = OrNot   THEN a OR NOT b
  ELSIF f = Or      THEN a OR b
  ELSE                   TRUE
  ENDIF;

Let’s give a concrete definition to f and say that it is the composition of five 2-ary Boolean functions, f0 through f4. In the language of SAL:

f(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B):
  [[B2ARY, B2ARY, B2ARY, B2ARY, B2ARY] -> B] =
    LAMBDA (f0: B2ARY, f1: B2ARY, f2: B2ARY, f3: B2ARY, f4: B2ARY):
      app(f0, app(f1, app(f2, b0,
                              app(f3, app(f4, b1, b2),
                                      b3)),
                      b4),
              b5);

Now let’s define the spec function that f should implement:

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B =
  (b0 AND b1) OR (b2 AND b3) OR (b4 AND b5);

Now, we’ll define a module m; modules are SAL’s building blocks for defining state machines. However, in our case, we won’t define an actual state machine since we’re only modeling function composition (or combinatorial circuits). The module has variables corresponding the function inputs, function identifiers, and a Boolean stating whether f is equivalent to its specification (we’ll label the classes of variables INPUT, LOCAL, and OUTPUT, to distinguish them, but for our purposes, the distinction doesn’t matter).

m: MODULE =
BEGIN
  INPUT b0, b1, b2, b3, b4, b5 : B
  LOCAL f0, f1, f2, f3, f4 : B2ARY
  OUTPUT equiv : B

  DEFINITION
    equiv = FORALL (b0: B, b1: B, b2: B, b3: B, b4: B, b5: B):
              spec(b0, b1, b2, b3, b4, b5)
            = f(b0, b1, b2, b3, b4, b5)(f0, f1, f2, f3, f4);
END;

Notice we’ve universally quantified the free variables in spec and the definition of f.

Finally, all we have to do is state the following theorem:

instance : THEOREM m |- NOT equiv;

Asking whether equiv is false in module m. Issuing

$ sal-smc FindingBoole.sal instance

asks SAL’s BDD-based model-checker to solve theorem instance. In a couple of seconds, SAL says the theorem is proved. So spec can’t be implemented by f, for any instantiation of f0 through f4! OK, what about

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B =
  TRUE;

Issuing

$ sal-smc FindingBoole.sal instance

we get a counterexample this time:

f0 = True
f1 = NandNot
f2 = NorNot
f3 = And
f4 = Xor

which is an assignment to the function symbols. Obviously, to compute the constant TRUE, only the outermost function, f0, matters, and as we see, it is defined to be TRUE.

By the way, the purpose of defining the enumerated type B2ARY should be clear now—if we hadn’t, SAL would just return a mess in which the value of each function f0 through f4 is enumerated:

f0(false, false) = true
f0(true, false) = true
f0(false, true) = true
f0(true, true) = true
f1(false, false) = true
f1(true, false) = true
f1(false, true) = false
f1(true, true) = true
f2(false, false) = false
f2(true, false) = true
f2(false, true) = false
f2(true, true) = false
f3(false, false) = false
f3(true, false) = false
f3(false, true) = false
f3(true, true) = true
f4(false, false) = false
f4(true, false) = true
f4(false, true) = true
f4(true, true) = false

OK, let’s conclude with one more spec:

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B =
  (NOT (b0 AND ((b1 OR b2) XOR b3)) AND b4) XOR b5;

This is implementable by f, and SAL returns

f0 = Eqv
f1 = OrNot
f2 = And
f3 = Eqv
f4 = Nor

Although these assignments compute the same function, they differ from those in our specification. Just to double-check, we can ask SAL if they’re equivalent:

spec1(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B =
  ((b0 AND ((NOT (b1 OR b2))  b3)) OR NOT b4)  b5;

specifies the assignments returned, and

eq: THEOREM m |- spec(b0, b1, b2, b3, b4, b5) =  spec1(b0, b1, b2, b3, b4, b5);

asks if the two specifications are equivalent. They are.

Filed under: Hardware, Verification , ,

“Schrodinger’s Probability” for Error-Checking Codes

In a previous post, I discussed the notion of Schrödinger CRCs, first described by Kevin Driscoll et al. in their paper Byzantine Fault Tolerance, from Theory to Reality. The basic idea is that error-detecting codes do not necessarily prevent two receivers from obtaining messages that are semantically different (i.e., different data) but syntactically valid (i.e., the CRC matches the respective data words received). The upshot is that even with CRCs, you can suffer Byzantine faults, with some probability.

… So what is that probability of a Schrödinger’s CRC? That’s the topic of this post.  I’m trying out these ideas for the first time, so feedback is appreciated.

Background

To begin, recall the idea of a Hamming Weight (HW).  The HW is a function on the set of data words of a fixed length and a fixed number of bit-corruptions that may occur (in either the data word or its associated CRC code).  It returns the number of corruptions that the CRC fails to detect.  For example, for the parity bit on data words of length three, the following is one instance in the HW count for 2-bit corruptions:

           data word  bit
sent       0 1 1      0
corrupted  1 0 1      0

The Hamming Distance (HD) for a fixed data word length and CRC is the smallest number of bit-corruptions resulting in a non-zero Hamming Weight (for the parity bit, it’s two).  Koopan and Chakravarty document the HDs of a number of CRCs.

While the concept of a HD might be appropriate for measuring the resilience of a CRC to random bit errors, it may not be a good metric for calculating the probability of Schrödinger’s CRCs.  Consider the following quotation from Paulitsch et al. in their paper, Coverage and the Use of Cyclic Redundancy Codes in Ultra-Dependable Systems:

…Network inter-stages can exhibit arbitrary faults, accidentally forging valid CRC check sequences. These faults can dominate system dependability issues, resulting in undetected failures at the 10^{-6} component failure rate.

The point here is that hardware that manipulates transmitted data (i.e., the inter-stages) may suffer random failures that cause correlated faults in the transmitted data.  For example, a bus driver may be “stuck at 1/2″ so that it delivers an intermediate voltage that will be interpreted by some receives as a “1″ and by others as a “0″.  Or an inter-stage may have a bad oscillator, causing it suffer a slightly-out-of-spec timing fault, so that different receivers interpret the incoming bits differently.

In data corruption caused by these kinds of hardware failures, we must change our assumptions:

  1. Data transmission faults are random but correlated.  By this I mean that we might presume that all bit errors are “in the same direction”: either 0s become 1s or 1s become 0s.  For example, with a stuck-at-1/2 fault in which a 1 is transmitted at a slightly too-high voltage (assuming 1s are denoted by a low signal), a receiver might interpret the signal arbitrarily, but consistently.
  2. Bit-error rates are much higher.  Paulitsch et al. note that typical assumed bit-error rates range from 10^{-6} to 10^{-13}, but actual rates vary wildly.  However, if there is a hardware failure (e.g., a stuck-at-1/2 fault), a dramatically higher BER (for correlated bit-errors) might be induced.

We define Schrödinger’s Hamming Distance (SHD) for a error-checking code as follows. Suppose in a data word and its associated error-checking code all bit-errors are exclusively of one of the following type:

  1. 0s may randomly be flipped to 1s.
  2. 1s may randomly be flipped to 0s.

Then the Schrödinger Hamming Weight (SHW) of the code for a fixed data word width is the number of undetected errors from (1) or (2).  We’ll refer to a bit corruption that contributes to a non-zero SHW as a Schrödinger CRC. The Schrödinger Hamming Distance is the least number of errors required for a non-zero SHW.

The SHWs for a CRC and data word size are no greater than the respective HWs.  For example, the parity-bit example above would be included in the calculation of the Hamming Weight but would be excluded in the calculation of Schrödinger Hamming Weight (becuase we flipped a 0 to a 1 and a 1 to a zero).  However, the following would be included in both, since we only flip 0s to 1s:

           data word  bit
sent       1 0 1      0
corrupted  1 1 1      1

Clearly, SHWs can never be greater than HWs.  However, I do not know whether for some CRC and data word length, the Schrödinger Hamming Distance is ever strictly less than its Hamming Distance. (My suspicion is not.)

To calculate the probability of a Schrödinger CRC involves a small analytical analysis together with a some probabilistic simulation.  We’ll describe the simulation first.

Simulations

For a fixed CRC and data word length, we’ll randomly generate data words and then compute their CRCs, within some constraints.  We do not simulate arbitrary bit-flips, since we want our simulation to be as productive as possible—we do not want to have to simulate 10s of millions of times to get reliable statisitical data.

So without loss of generality, we’ll simulate some number of 0s being flipped to 1s (we could just have easily simulated 1s being flipped to 0s) in the data word together with its CRC.  Of course, we can flip no more 0s than appear in the data word and CRC.  Furthermore, it does no good to simulate less than HD bit-errors, since the CRC is guaranteed to protect against any bit-errors up to (but not including) the HD.  So we’ll simulate at least HD bit-errors.  If there are less than HD 0s in the data word and its CRC, then it is counted as a  test run in which the CRC successfully catches errors.

To carry out the simulation, I wrote a little Haskell program.  (By the way, this is not the most beautiful or most efficient code possible, but I can do a million simulations in minutes on a commodity laptop.  Good enough for my purposes.)  The simulation uses QuickCheck.  To execute it, you will need to add a small patch to QuickCheck—see the patch and installation instructions here.

So this gives us some failure rate.  For example, for the USB-5 CRC (x^5 + x^2 + 1) on 11-bit data words (with a  HD of 3), after 10,000 simulations, we get somewhere around 3% of the generated runs producing bit-errors that result in the CRC failing to capture the fault.  For something like the CAN CRC (x^{15} + x^{14} + x^{10} + x^8 + x^7 + x^4 + x^3 + 1) on 64-bit data words (with a HD of 6), after a million simulations, we get a failure rate of around 3.9 \times 10^{-5}.

Calculations

Now for the analytical part.  For an arbitrary data word and its corresponding CRC, we need to calculate the probability of the simulation runs occurring.

We do this as follows.  First, on average, if a data word together with its CRC comprises n bits, we’ll assume that roughly half are 1s and half are 0s.  Thus, the maximum number of 0s that could be flipped is approximately half the total number of bits in the data word and its CRC.

If hd is the Hamming Distance, then we want to sum the probabilities of suffering exactly hd, hd+1, hd+2, …, \frac{n}{2} bit-errors.  The total number of possible bit-errors for each number of corrupted bits is determined with the choose function:

{n \choose i} = \dfrac{n!}{i!(n-i)!}

So, for example, the total number of coin tosses in which exactly two coins show heads out of three tosses is

{3 \choose 2} = \dfrac{3!}{2!(3-2)!} = 3

Now, if the probability of a bit-error is p, then the probability of i bit-errors is

p^i (1-p)^{n-i}

Thus, the probability of any of the bit-errors introduced in our simulation occurring (regardless of whether they led to an undetected corruption), is

\displaystyle\sum^{n/2}_{i=hd}{\frac{n}{2} \choose i} \times p^i (1-p)^{\frac{n}{2} - i}

We multiply this summation by the percentage of undetected corruptions determined from the simulation.  So for USB-5 mentioned above on 11-bit data words, recall that our simulation showed a ~3% of the simulations resulting in a uncaught error.  If the probability p of a single bit corruption is 0.01 (assuming the hardware is faulty), then

\displaystyle\sum^{\frac{16}{2}}_{i=3}{\frac{16}{2} \choose i} \times 0.01^i (1-0.01)^{\frac{16}{2} - i} = 5.39 \times 10^{-5}

5.39 \times 10^{-5} multiplied by 3% gives us 1.617 \times 10^{-6}.  For a single message, assuming a fault-arrival rate for bit corruptions, that’s the probability of a Schrödinger CRC.

Combining the Results

But that number by itself is an incomplete portrayal of the arrival rate of Schrödinger CRCs in a system.  We must also take into account the following to get a system-wide Schrödinger CRC arrival-rate per hour of operation:

  1. The number of inter-stages.
  2. The failure rate of the hardware in the inter-stages (particularly for failures that can lead to slightly-out-of-spec faults, so this includes transient faults).
  3. The number of receivers.
  4. The number of messages per hour.

Specifically, from the hardware failure rate and the number of inter-stages, we’ll calculate the probability of hardware failure, and we’ve already computed the probability of a Schrödinger CRC given a hardware failure.  Then we’ll do a final calculation to get the probability per hour.

For a concrete example, let’s assume there are 10 inter-stages messages pass through, the failure rate of the hardware is 10^{-3} per hour (again, we’re including transient faults, so we’re using a high rate), and there are 20 receivers in the broadcast.  We’ll assume a fairly slow data-rate, of 1 kilobit/second.

Let’s find the probability of a faulty inter-stage first.  If there are 10 inter-stages, and we assume the probability of hardware failures is independent, then the probability of at least one of them being faulty is 1 - ((10^{-3})^0 \times (1 - 10^{-3})^{10}) = 9.96 \times 10^{-3}.

So now we have the probability of a uncaught error per message, assuming faulty hardware (1.617 \times 10^{-6}) and the probability of faulty hardware (9.96 \times 10^{-3}).  Multiplying these, we get that the probability of a Schrödinger CRC per message is 1.61 \times 10^{-8}.  If the throughput is 1kb/s, then that’s 225,000 16-bit messages/hour.  So the probability of a Schrödinger CRC per hour of operation is 1 - ((1.61 \times 10^{-8})^0 \times (1 - 1.61 \times 10^{-8})^{225,000}) = 3.61 \times 10^{-3}.  That’s a high rate, which is not acceptable for an ultra-reliable system that depends on CRCs to protect against Byzantine faults.  Of course, the constants we used in our calculations are system-specific, but I hope this suffices to demonstrate that Schrödinger CRCs should be of concern in reliable system design.

For fun, let’s run the numbers on the CAN CRC using the same constants as above.  Our simulations for CAN on 64-bit data words gives us a probability of a Schrödinger CRC, assuming faulty hardware, of 3.9 \times 10^{-5} \times 7.248 \times 10^{-7} = 2.83 \times 10^{-11}.  This gives a probability of a Schrödinger CRC at 2.82 \times 10^{-13} per message or 1 - (2.82 \times 10^{-13})^56250 = 1.59 \times 10^{-8}, potentially still too probable for ultra-dependable systems with required failure rates of 10^{-9} or lower.

Conclusions

What are the lessons to draw from this?

  1. We’ve described new concepts of Schrödinger Hamming Weights and Schrödinger Hamming Distance
    as another metric for the quality of error-checking codes.
  2. We’ve discussed how one might use a combination of simulation and probability calculations to discover the probability of Schrödinger CRCs in a system (assuming I got all my calculations correct!).
  3. Depending on the architecture and requirements, these calculations could be as important in determining system reliability as simply using the Hamming Distance of a CRC.

Open questions include

  1. How Schrödinger Hamming Weights compare with Hamming Weights and similarly for Schrödinger Hamming Distances and Hamming Distances.  In particular, is the Schrödinger Hamming Distance ever strictly greater than the Hamming Distance?
  2. Are any CRCs well-suited to protect against Schrödinger CRCs?
  3. Under what constraints should Schrödinger HWs be the primary reliability measure in system design?  Should error-detecting codes ever be used to protect against Byzatine behavior?

Filed under: Fault Tolerance, Software , , , ,

An Atomic Fibonacci Server: Exploring the Atom (Haskell) DSL

This post is consistent with Atom 0.0.1 and not the latest version, Atom 0.0.5 (the author went off and implemented changes I and others suggested :) ).  I’ll update the post… soon.

Tom Hawkins has open-sourced Atom, a domain-specific language (DSL) for writing embedded real-time software. Atom is actually an “embedded DSL” (I prefer the term “lightweight DSL”) in the functional language Haskell. It’s a lightweight DSL (LwDSL) because you write legal Haskell and let the Haskell compiler do all the heavy lifting. The DSL is a set of special functions and data types and a “compile function” that generates embedded (i.e., no dynamic memory) C code.  You don’t have to write your own compiler from scratch.

John Van Enk has already posted a couple of blog entries on using Atom; first on adding slightly to the LwDSL (one major advantage of a LwDSL is that it’s easy to extend the language—you don’t have to re-engineer a standalone compiler) and then on using Atom to blink some LEDs on the Arduino.  Keep checking his blog for more updates!

Here, I write a little device and driver program in Atom: the driver sends an index i, and the device returns the ith Fibonacci number. The little bit of challenge in doing this is that the device and driver may run at different rates, so their communication is asynchronous.  How does this work in a language like Atom?

Writing in the Atom DSL

Let’s think about the Fibonacci device (we’ll call it fibDev) first.  The device fibDev will do three things:

  1. Wait for a new index i from the driver.
  2. Produce a result, fib(i).
  3. Give the result to the driver.

Let’s think about step (2) first.  Think for a second how we’d write this (efficiently) in Haskell:

fib :: Int -> Int
fib n = fst $ fibHlp n
    where fibHlp n =
              case n of
                0 -> (1, 1)
                _ -> let (a,b) = fibHlp (n-1)
                     in (b,a+b)

The Atom implementation will use the same algorithm, but it’ll look different.  Atom is a synchronous language, so you specify rules that fire on clock ticks.  Here’s what the core of the algorithm looks like in Atom (I haven’t shown the variable declarations, but look you can look at the full source):

atom "computeFib" $ do
  cond $ value runFib
  cond $ value i >. 0
  decr i
  snd <== (value fst) + (value snd)
  fst <== value snd

Atom is written in a monadic style.  Here, we have two conditions, both of which must be true for the rule to “fire”.  The first condition is that runFib is true (telling the device it’s in its computation step), and the second condition is that the index is greater than 0 (we stop computing at zero).  If the conditions are true, then the value of i is decremented, and we update the values of the fst and snd variables, corresponding the first and second elements, respectively, of the pair in the Haskell specification. Again, this is legal Haskell; the Atom library defines the special operators (e.g., >.).  One great thing about writing embedded code in Atom is that variable updates are synchronous.  For example, in the code above, fst is updated to the previous value of snd. That’s the core of the Fibonacci device.

The rest of the architecture handles the message passing (in the C code we’ll generate, messages are passed via global variables) and synchronization between the driver and device, as summarized below:

System Architecture

System Architecture

We do not assume that fibDvr and fibDev execute at the same rate, so we handle message passing with a series of handshakes.  First, fibDvr sends a new value x and notifies fibDev that the value is ready (by issuing newInd).  fibDev acknowledges that x has been received with valRcvd.  At this point, fibDvr knows to wait for fibDev to compute fib(x).  Once it receives the notice ansReady, it reads off the answer, ans.

All we have to do now is implement the handshakes.  For example, let’s look at step (3) of the device, sending the final answer to the driver.  It’s behavior should be clear from the architectural description.

atom "sendVal" $ do
  cond $ value i ==. 0
  cond $ value runFib
  runFib   <== false
  ans      <== value fst
  ansReady <== true
  valRcvd  <== false

And here’s step (1) for fibDev, waiting for a new index from the driver:

atom "getIndex" $ do
  cond $ not_ (value runFib)
  cond $ value newInd
  i        <== value x
  runFib   <== true
  fst      <== 1
  snd      <== 1
  ansReady <== false
  valRcvd  <== true

These three rules for fibDev define the body of fibDev’s “do” block.

fibDev :: Atom ()
fibDev = period 3 $ do ...

We tell atom that the period is 3, meaning execute each of our three rules every three clock ticks (based on the underlying clock).

Now that we’re comfortable with the language, let’s look at the entire definition of fibDvr in one go. Recall the job of fibDvr is to send a value then wait for an answer.  Our driver will increment values by 5, starting at 0.  It’ll stop sending new values if the index is bigger than 50.

fibDvr :: Atom ()
fibDvr = period 20 $ do
  x        <- word64 "x" 0 -- new index to send
  oldInd   <- word64 "oldInd" 0 -- previous index sent
  -- external signals --
  valRcvd  <- bool' "valRcvd" -- has the device received the new index?
  ans      <- word64' "ans" -- the newly-computed fib(x)
  ansReady <- bool' "ansReady" -- is an answer waiting?
  ----------------------
  valD     <- word64 "valD" 1 -- local copy of fib(x)
  newInd   <- bool "newInd" True -- a new index is ready
  waiting  <- bool "waiting" True -- waiting for a new computation

  atom "wait" $ do
    cond $ value valRcvd
    cond $ not_ $ value waiting
    newInd  <== false
    waiting <== true

  atom "getAns" $ do
    cond $ value ansReady
    cond $ value waiting
    cond $ value x <. 50
    valD    <== value ans
    x       <== value x + 5
    waiting <== false
    newInd  <== true
    oldInd  <== value x

Note that we’ve specified the period of the driver to be 20, meaning that its two rules get executed every 20 ticks.  So the driver is much slower than the device, but if our handshakes are correct, the two devices communicate correctly for any rates of execution of the two components.  (Proving it for all-time is a classic model checking problem.)

Compiling to C

We include a little Haskell function that we can call to “compile” fibDev and fibDvr into embedded C files.  (The compile function is part of Atom, and it takes a name for the generated C file and Atom specifications to compile.)

compileFib :: IO ()
compileFib = do
  compile "fibDev" $ fibDev
  compile "fibDvr" $ fibDvr

We can call this from an interpreter for Haskell; it takes about a second to compile. Doing so almost produces the source files fibDvr.c and fibDev.c. We do a few things manually:

  • Write two header files, fibDvr.h and fibDev.h and import them. This is the code we want to talk to each other through global variables.  We’ll also include stdio.h so we can printf our results.
  • Because Atom automatically (atomatically? :) ) generates variable and function names in the generated code, we declare some of the identifiers in fibDev.c to be static so they aren’t globally visible.
  • We #define the variable names from the Atom-generated identifiers back to the expected identifiers for the variables that are shared.
  • And we add a little main function to execute the code: let’s execute the driver and device for 500 clock ticks:
    int main() {
       while(__clock < 500) {
          fibDvr();
          fibDev();
       }
       return 0;
    }

Of course, Atom could be extended to handle these things itself—John Van Enk has already started doing some of it.  In all, our 80-some lines of Atom compile to over 200 lines of embedded C.  So let’s test it!

> gcc -Wall -o fibDvr fibDev.c fibDvr.c
> ./fibDvr

generates the following output:

i: 0, fib(i): 1
i: 0, fib(i): 1
i: 0, fib(i): 1
i: 5, fib(i): 8
i: 5, fib(i): 8
i: 10, fib(i): 89
i: 10, fib(i): 89
i: 10, fib(i): 89
i: 15, fib(i): 987
i: 15, fib(i): 987
i: 15, fib(i): 987
i: 15, fib(i): 987
i: 20, fib(i): 10946
i: 20, fib(i): 10946
i: 20, fib(i): 10946
i: 20, fib(i): 10946
i: 25, fib(i): 121393
i: 25, fib(i): 121393
i: 25, fib(i): 121393
i: 25, fib(i): 121393
i: 25, fib(i): 121393
i: 30, fib(i): 1346269
i: 30, fib(i): 1346269
i: 30, fib(i): 1346269
i: 30, fib(i): 1346269

Wait, why are we getting the same answers multiple times? Recall that Atom is a synchronous language, so functions are executed based on time (measured in underlying clock ticks), not events. But most times, the guards don’t hold, so state isn’t updated. That’s what we see here.

Oh, we should check our specification. We can do that using our original Haskell specification:

> map fib [0,5..30]
[1,8,89,987,10946,121393,1346269]

Looks good!

Let me know if this helps you understand Atom, or if you have thoughts on how Atom compares to other languages.

Finally, here are the sources:

Filed under: Software , , , ,

N-Version Programming… For the nth Time

Software contains faults.  The question is how to cost-effectively reduce the number of faults.  One approach that gained traction and then fell out of favor was N-version programming.  The basic idea is simple: have developer teams implement a specification independent from one another.  Then we can execute the programs concurrently and compare their results.  If we have, say, three separate programs, we vote their results, and if one result disagrees with the others, we presume that program contained a software bug.

N-version programming rests on the assumption that software bugs in independently-implemented programs are random, statistically-uncorrelated events.  Otherwise, multiple versions are not effective at detecting errors if the different versions are likely to suffer the same errors.

John Knight and Nancy Leveson famously debunked this assumption on which N-version programming rested in the “Knight-Leveson experiment” they published in 1986.  In 1990, Knight and Leveson published a brief summary of the original experiment, as well as responses to subsequent criticisms made about it, in their paper, A Reply to the Criticisms of the Knight & Leveson Experiment.

The problem with N-version programming is subtle: it’s not that it provides zero improvement in reliability but that it provides significantly less improvement than is needed to make it cost-effective compared to other kinds of fault-tolerance (like architecture-level fault-tolerance).  The problem is that even small probabilities of correlated faults lead to significant reductions in potential reliability improvements.

Lui Sha has a more recent (2001) IEEE Software article discussing N-version programming, taking into account that the software development cycle is finite: is it better to spend all your time and money on one reliable implementation or on three implementations that’ll be voted at runtime?  His answer is almost always the former (even if we assume uncorrelated faults!).

But rather than N-versions of the same program, what about different programs compared at runtime?  That’s the basic idea of runtime monitoring.  In runtime monitoring, one program is the implementation and another is the specification; the implementation is checked against the specification at runtime.  This is easier than checking before runtime (in which case you’d have to mathematically prove every possible execution satisfies the specification).  As Sha points out in his article, the specification can be slow and simple.  He gives the example of using the very simple Bubblesort as the runtime specification of the more complex Quicksort: if the Quicksort does its job correctly (in O(n log n), assuming a good pivot element), then checking its output (i.e., a hopefully properly sorted list) with Bubblesort will only take linear time (despite Bubble sort taking O(n2) in general).

The simple idea of simple monitors fascinates me.  Of course, Bubblesort is not a full specification, though.  Although Sha doesn’t suggest it, we’d probably like our monitor to compare the lengths of the input and output lists to ensure that the Quicksort implementation didn’t remove elements.  And there’s still the possibility that the Quicksort implementation modifies elements, which is also unchecked by a Bubblesort monitor.

But instead of just checking the output, we could sort the same input with both Quickcheck and Bubblesort and compare the results.  This is a “stronger” check insofar as different sorts would have to have exactly the same faults (e.g., not sorting, removing elements, changing elements) for an error not to be caught.  The principal drawback is the latency of the slower Bubblesort check as compared to Quicksort.  But sometimes, it may be ok to signal an error (shortly) after a result is provided.

Just like for N-version programming, we would like the faults in our monitor to be statistically uncorrelated with those in the monitored software.  I am left wondering about the following questions:

  • Is there research comparing the kinds of programming errors made in radically different paradigms, such as a Haskell and C?  Are there any faults we can claim are statistically uncorrelated?
  • Runtime monitoring itself is predicated on the belief that the implementations of different programs will fail in statistically independent ways, just like N-version programming is.  While more plausible, does this assumption hold?

Filed under: Fault Tolerance, Software , , ,

Programming Languages for Unpiloted Air Vehicles

I recently presented a position paper at a workshop addressing software challenges for unpiloted air vehicles (UAVs).  The paper is coauthored with Don Stewart and John Van Enk.  From a software perspective, UAVs (and aircraft, in general) are fascinating.  Modern aircraft are essentially computers that fly, with a pilot for backup.  UAVs are computers that fly, without the backup.

Some day in the not-so-distant future, we may have UPS and FedEx cargo planes that are completely autonomous (it’ll be a while before people are comfortable with a pilot-less airplane).  These planes will be in the commercial airspace and landing at commercial airports.  Ultimately, UAVs must be transparent: an observer should not be able to discern whether an airplane is human or computer controlled by its behavior.

You won’t be surprised to know a lot of software is required to make this happen.  To put things in perspective, Boeing’s 777 is said to contain over 2 million lines of newly-developed code; the Joint Strike Fighter aircraft is said to have over 5 million lines.  Next-generation UAVs, with pilot AI, UAV-to-UAV and UAV-to-ground communications, and arbitrary other functionality, will have millions more lines of code. And the software must be correct.

In our paper, we argue that the only way to get a hold on the complexity of UAV software is through the use of domain-specific languages (DSLs). A good DSL relieves the application programmer from carrying out boilerplate activities and providers her with specific tools for her domain. We go on and advocate the need for lightweight DSLs (LwDSLs), also known as embedded DSLs. A LwDSL is one that is hosted in a mature, high-level language (like Haskell); it can be thought of as domain-specific libraries and domain-specific syntax.  The big benefit of a LwDSL is that a new compiler and tools don’t need to be reinvented. Indeed, as we report in the paper, companies realizing the power of LwDSLs are quietly gaining a competitive advantage.

Safety-critical systems, like those on UAVs, combine multiple software subsystems.  If each subsystem is programmed in its own LwDSL hosted in the same language, then it is easy to compose testing and validation across subsystem boundaries. (Only testing within each design-domain just won’t fly, pun intended.)

The “LwDSL approach” won’t magically make the problems of verifying life-critical software, but “raising the abstraction level” must be our mantra moving forward.

Filed under: Software, Verification , , ,

Byzantine Cyclic Redundancy Checks (CRC)

I’m working on a NASA-sponsored project to monitor safety-critical embedded systems at runtime, and that’s started me thinking about cyclic redundancy checks (CRCs) again.  Error-checking codes are fundamental in fault-tolerant systems.  The basic idea is simple: a transmitter wants to send a data word, so it computes a CRC over the word.  It sends both the data word and the CRC to the receiver, which computes the same CRC over the received word.  If its computed CRC and the received one differ, then there was a transmission error (there are simplifications to this approach, but that’s the basic idea).

CRCs have been around since the 60s, and despite the simple mathematical theory on which they’re built (polynomial division in the Galois Field 2, containing two elements, “0″ and “1″), I was surprised to see that even today, their fault-tolerance properties are in some cases unknown or misunderstood.  Phil Koopman at CMU has written a few nice papers over the past couple of years explaining some common misconceptions and analyzing commonly-used CRCs.

Particularly, there seems to be an over-confidence in their ability to detect errors.  One fascinating result is the so-called “Schrödinger’s CRC,” so-dubbed in a paper entitled Byzantine Fault Tolerance, from Theory to Reality, by Kevin Driscoll et al. A Schrödinger’s CRC occurs when a transmitter broadcasts a data word and associated CRC to two receivers.   and at least one of the data words is corrupted in transit and so is the corresponding CRC so that the faulty word and faulty CRC match!  How does this happen?  Let’s look at a concrete example:

             11-Bit Message           USB-5
Receiver A   1 1 1 1 1 1 0 1 1 0 1    1 0 0 0 q
Transmitter  1 1 1 1 1 1 0 1 1 0 ½    1 ½ 0 ½ 1
Receiver B   1 1 1 1 1 1 0 1 1 0 0    1 1 0 1 1

We illustrate a transmitter broadcasting an 11-bit message to two receivers, A and B. We use USB-5 CRC, generally used to check USB token packets  (by the way, for 11-bit messages, USB-5 has a Hamming Distance of three, meaning the CRC will catch any corruption of fewer than three bits in the combined 11-bit message and CRC).  Now, suppose the transmitter has suffered some fault such as a “stuck-at-1/2” fault so that periodically, the transmitter fails to drive the signal on the bus sufficiently high or low. A receiver may interpret an intermediate signal as either a 0 or 1. In the figure, we show the transmitter sending three stuck-at-1/2 signals, one in the 11-bit message, and two in CRC.  The upshot is an example in which a CRC does not prevent a Byzantine fault—the two receivers obtain different messages, each of which passes its CRC.

One question is how likely this scenario is.  Paulitsch et al. write that The probability of a Schrödinger’s CRC is hard to evaluate.  A worst-case estimate of its occurrence due to a single device is the device failure rate.”  It’d be interesting to know if there’s any data on this probability.

Filed under: Fault Tolerance , , ,