Max: Phase 1 Report

May 18, 2016

I sent the following R&D report to my colleagues, but a few other folks outside Galois have expressed interest in the project, so I’m sharing it more broadly.


Subject: Max: Phase 1 Report

As some of you know, about nine months ago, I started a skunk-works R&D project with Brooke. Autonomous systems are all the rage these days, so we decided to try to create one. First, I have to be honest and say that although I participated in the initial project kickoff, I mostly played a supporting role after that. Brooke did all the major software and hardware development. (If you’ve worked with me on a project, this should sound pretty familiar.) Once Brooke started development, she really threw herself into it. She seemed to be working on things day and night, and it even looked a bit painful at times. She sensed she was getting close to an alpha release a few days ago, and after a final four hour sprint, she made the release at 2:30am on Mothers Day! We are officially out of stealth mode.

We call our project “Machine Awareness X”, or Max for short. The current system is capable of basic knowledge discovery and processing. Now, don’t get too excited; we expect at least a few years before it’s able to do something interesting with the knowledge acquired. I won’t go into the technical details, but the programming model is very biological—think “Game of Life” on steroids. At this point, we’ll have to continue to provide guidance and some rules, but its basically self-programming.

Following a “Johnny Ive” approach to design, Max has basically one notification method. It’s a fairly piercing audio signal used whenever his power supply is running low or there’s a hardware problem. (Frankly, sometimes it seems to just go off for no reason at all.) We designed it to be loud enough to hear across the house, but I wish it had a volume control. Next time! Otherwise, the package is quite attractive, in my opinion, even cute. Unfortunately, at 7lbs. 8oz., the hardware is heavier than even a decade-old laptop, and we expect new versions to require an even larger form factor. Fortunately, we designed the system to be self-propelling, although it’ll take a few years before that hardware is developed (the software isn’t ready for it anyways).

There’s still quite a bit of work to do. Our back-of-the-envelope estimate is that we’ll have to spend just short of two decades caring for Max before he’s fully autonomous. Even more disappointingly, we’re estimating having to spend up to a quarter million (in today’s dollars) in upkeep and maintenance! (Sadly, while others are interested in playing with the system intermittently, nobody seems that interested in joining us as early investors.) Despite all the training we’re planning to provide, the system seems too complicated to guarantee certain behaviors. For example, while more general than an autonomous car, it may take more than 15 years of training before his software is capable of piloting a conventional automobile.

I’m guessing some of you are wondering about commercialization opportunities. The good news: we expect Max to be useful to society (we haven’t found a killer app yet, though) and to generate quite a bit of revenue over its lifetime. The bad news: we don’t expect it to start producing reliable revenue for more than 20 years. What’s more, it has a lot of upkeep expenses that will only grow with time. This might sound like science fiction, but we imagine he might even replicate himself in the distant future, and will likely pour his revenues into his own replicants. In short, we don’t expect to make a dime from the project.

More seriously, we had a kid; mom and baby are doing fine. See you soon.

Papa Pike

Viva La Resistance! A Resistance Game Solver

February 8, 2016

Update (Feb142016): see bottom for an improved strategy.

The Game

At a December workshop, I played The Resistance, a game in which there are two teams, the resistance and the spies. The spies know everyone’s identity; each resistance player knows only one’s own. Overall, the goal of the spies is to remain undetected; the goal of the resistance is to discover who the spies are.

Play proceeds in rounds in which a player nominates a subgroup to go on a “mission”. The nomination is then voted on. If the vote succeeds, every member of a mission plays either a “success” or “failure” card for the mission. One or two failure cards (depending on the mission size) causes the mission to fail. The cards played in a mission are public, but it is secret who played which card. (If the vote fails, the next player nominates a subgroup.)

The spies’ goal is to fail missions without being detected, and the resistance goal is to have missions succeed. So the spies generally wish to play failure cards while in a mission. Furthermore, spies always want some spy in the mission to spoil it. The resistance wants no spies to go on missions. The problem for the resistance is that when a mission fails, they know one or more of the subgroup is a spy, they just don’t know which one.

The spies win the game if they can cause three missions to fail before the resistance can cause three missions to succeed.

The game is simple but engaging, and similar in spirit to the game Mafia (Werewolf).y

The Problem

I played the game with computer scientists from the top universities and research labs. We debated strategy and were generally pretty bad at the game, despite having a lot of fun. In particular, the spies seemed to win a lot.

The problem is: what is the optimal approach to game play?

The Approach

The problem is a great fit for Bayesian Analysis. In each round, we learn partial information about the players from the group outcome. Only spies play failure cards. We can use that information to update our believe in the probability that a player is a spy.

Suppose there are four players, [ a, b, c, d] and two spies. Initially each player has the same probability of being a spy, 2/4 = 0.5. Now suppose that [ a, b, c ] go on a mission, and return the set of cards \{Success, Fail, Fail\}. How do we update the spy probabilities of the group?

Bayes Theorem states that

P(A|B) = \dfrac{P(A)P(B|A)}{P(B)}

In our case, “A” is the event that a particular player is a spy, and “B” is the event that we we observed a particular set of mission cards. We wish to compute P(A|B), the updated probability that a player is a spy given the cards played in the mission.

So for each mission, we apply Bayes’ Theorem to each player, including the players not in the mission—if the spy probabilities increase (or decrease) for the mission players, then they decrease (or increase) for the non-mission players.

From the cards \{Success, Fail, Fail\}, we know that two of [ a, b, c ] are spies (and so d is definitely not a spy, since there are two spies, total). Let’s compute the updated spy probability for player a.

P(A) = 0.5, the original spy probability for player a (or any other player). To calculate P(B), we first determine every possible assignment of players in the mission to spies and non-spies (there are \binom{2}{3} such assignments). In the mission, there are three possibilities:

  1. spies = [a, b]
  2. spies = [a, c]
  3. spies = [b, c]

For each combination, we multiply the probabilities for the assignments. So in case (1), we have

0.5(0.5)(1-0.5) = 1/8

for assigning players a and b to being spies and c to being a non-spy. Then we sum the probabilities for all three combinations. In our example, we get

\frac{1}{8} + \frac{1}{8} + \frac{1}{8} = \frac{3}{8}

So P(B) = 0.375.

To compute P(B|A), we assume that player a is a spy, and now recompute the probability that [b, c] contains the remaining one spy. Using the same approach as for computing P(B), we get P(B|A) = 0.5. Now we can apply Bayes’ Theorem:

\dfrac{0.5(0.5)}{0.375} = 2/3

So player’s a probability of being a spy shot up to 0.66, and so did player b’s and c’s.

Player d‘s spy probability drops to 0. The updated spy probabilities for any player not in the mission can be computed just as we did for the mission players, except we take the total number of spies and subject the number of failure cards observed. In this case, however, since we know all the spies were in the group, d‘s spy probability must be 0. (Another way of thinking about it is that there is an invariant that the sum of the player’s spy probabilities must always be the total number of spies in the group.)

The Strategies

How should the game be played by the resistance and spies, respectively, to increase the odds of winning?

For the resistance, picking the players the least likely to be spies is optimal. Let’s call this the pick-lowest strategy.

One possible optimization is to always pick yourself in a mission. The rational is that you know if you are part of the resistance, and you pick yourself for a mission, then you have at least one guaranteed non-spy. So even if your probability of being a spy as known to the group is higher than others, you have perfect knowledge of yourself. Let’s call this the pick-myself strategy,

For the spies, there are a few options. By default, the spies could always play a failure card (fail-always). But a spy might also play a success card to avoid detection; doing so is especially advantageous in the early rounds; one strategy is to never fail the first round (succeed-first). If the spies can collude to ensure only one plays a failure card during a mission, that provides the least amount of information to the resistance (fail-one).

There are other strategies and other combinations of strategies, but this is a good representative sample.

Which are the best strategies?


To discover the best strategies, we use Monte-Carlo simulations on the strategies over each number of players (with different number of players, there are different number of spies and missions). I found a few interesting results:

  • The best chance of wining for the resistance, and the closest odds between the resistance and spies, is with six players. At six players, the resistance has a 30-40% chance of winning under different strategies. The worst configuration is eight players, with not more than a 14% chance of winning for the resistance. During actual game play, it seemed that the odds favored the spies.
  • The best strategy for the resistance is the pick-lowest strategy. The strategy may be counter-intuitive, but consider this: the pick-myself strategy provides the spies an opportunity to always include themselves in a mission when picking a mission. The pick-myself strategy is an instance of a local maxima (i.e., local knowledge of knowing yourself to be part of the resistance) that is non-optimal.
  • Moreover, voting becomes a no-op. The game includes a voting round in which players vote on the proposed players for a mission. But If the resistance agrees on an optimal strategy, any deviation from the strategy by a player is because the player is a spy. If the person does deviation, the resistance votes against it (and resistance outnumber spies, so will win the vote), and we now have complete assurance the proposer is a spy. The spies have no choice but to follow the optimal strategy of the resistance.
  • Of the spy strategies listed above, the best is the fail-one strategy, which is intuitive. The succeed-first strategy is another example of a local maxima that is non-optimal; while it protects that particular spy from detection, it is more valuable for the spies in general to fail the mission.

The related Mafia game has some analytical results, giving bounds on (their version of) the resistance to spies. I have not done that, nor have I done Monte-Carlo analysis to determine what proportion of resistance and spies and mission sizes gives a more even chance of winning. In Mafia, it is noted that in actual game play, the resistance wins more often than simulations/analytical analysis would suggest, with different attributions (e.g., people are bad at lying over iterative rounds).

Play Along at Home

I have implemented the Bayesian solver in a webserver hosted on Amazon Web Services. You can use the solver when playing with others.

An easier-to-remember link is

If you want to run Monte Carlo simulations, you will have to download the code and run it locally, however.


Update (Feb142016)

It has been pointed out by Eike Schulte in the comments and Iavor Diatchki that by including some additional information, the strategy might be improved. This is indeed the case. The intuition is that if a group has previously included a spy, that group should not be selected again, even if it is the lowest probability group. For example, with five players, consider the following rounds:

  • Round 0: players [0,1] are selected and there is one fail card.
  • Round 1: players [2,3,4] are selected and there is one fail card.
  • Round 2: players [2,3] are selected and there are no fail cards.

At this point, players [0,1] have a 0.5 probability of being a spy and players [2,3,4] have a 1/3 probability. So in round 3, we do not want to select players [2,3,4] even if they have the lowest probabilities. So we select a group with the lowest spy probability that has not already included a spy. The server has been updated to include this strategy. The strategy does better; for example, at 6 players, we have just over a 50% chance of winning!

15 Minutes

February 9, 2015

Some of the work I lead at Galois was highlighted in the initial story on 60 Minutes last night, a spot interviewing Dan Kaufman at DARPA. I’m Galois’ principal investigator for the HACMS program, focused on building more reliable software for automobiles and aircraft and other embedded systems. The piece provides a nice overview for the general public on why software security matters and what DARPA is doing about it; HACMS is one piece of that story.

I was busy getting married when filming was scheduled, but two of my colleagues (Dylan McNamee and Pat Hickey) appear in brief cameos in the segment (don’t blink!). Good work, folks! I’m proud of my team and the work we’ve accomplished so far.

You can see more details about how we have been building better programming languages for embedded systems and using them to build unpiloted air vehicle software here.

All Hammers are Terrible

December 29, 2014

“All computers are terrible.”

On social media, this turn of phrase is almost a meme. The sentiment is reasonable at some level—computers are complicated, buggy, and insecure. But the general public is largely indifferent, or least, they accept the situation for what it is: Computer freezes? turn it on then off again. Smart phone insecure? Well, the convenience of digital wallets outweighs the risk. My parents, who are not computer scientists, never say that all computers are terrible.

Rather, this is a complaint from within, by computer scientists themselves.  What drives this thinking? (While not specifically addressing the cause of the sentiment, this blog post provides a nice counterpoint.) No immediately obvious hypotheses seem particularly explanatory:

  • Ludditism: perhaps despite (or because of) our work in technology, computer scientists abhor it. But I find this explanation wanting, particularly because the focus of contempt is specifically about computers, and not about technological advancement generally.
  • Pessimism: perhaps there is a correlation between computer scientists and pessimists. Faced with the challenge of building good systems, they focus on the flaws rather than the progress. But it does not adequately explain the focus of the pessimism on computers rather than life in general—I don’t see posts stating that “all relationships are terrible”.
  • Realism: perhaps computers are terrible. Computer science history is measured in decades, and the approach to building systems is immature compared to other engineering disciplines. I tore down my old garage this summer. The city required a blueprint, erosion plan, a drainage plan, and a few hundred dollars for me to do so. They inspected the work afterwards. Contrast this with software: how much oversight does writing security-critical code get? Just about anyone can throw code up that purports to solve a problem. Still, given that programs are among the most complex artifacts built by humans, we’re doing pretty well.
  • Narcissism: finally, perhaps a tinge of narcissism drives these comments, something along the lines of, “where others have failed, I will succeed.” It is given as a tongue-in-cheek put-down regarding the work of fellow professionals. The motivation is there: if we thought that the tools of computer science were “good enough”, then the world wouldn’t need new programming languages, tools, or libraries. On the other hand, these kind of complaints are not just about directly competing work, and there will always be new inventions to make.

I tend not to hear the same kind of complaints from members of other professions. Yes, we all complain about our jobs, but not necessarily about our tools. Carpenters don’t say, “All hammers are terrible,” even if they complain about their company, their coworkers, the architect, the inspectors. Doctors don’t complain about X-ray machines, MRIs, or scalpels (but they might complain about their patients, insurance, and administration). Farmers don’t complain about the design of combines and tractors. Now, they might complain about a specific instance—“My tractor is terrible; it breaks down all the time.” But not about tractors, in general.

Other professions largely take their tools as a given. Indeed, most practitioners know so little about how the tools they use are built, it is difficult to critique them. A carpenter does not know how a skill saw is built and a doctor does not know how an X-rays are taken. Just whether they work or not and if they are useful or not.

Computer science is unique insofar as the tool makers are so closely connected with the tool users. There’s another glib quip in computer science: “patches welcome” meaning that if you (the user) thinks a tool should be improved, then it is your responsibility to make those improvements yourself. In other fields, the saying does not exist; it would be laughable! A taxi driver can’t be expected to invent a more fuel-efficient automobile; an architect can’t be expected to build better modeling software.

Computer science is unique in this respect, and it has philosophical, psychological, and sociological implications. Imagine, if you will, you are a carpenter, and you have some understanding of how the tools that you have are designed and built. Your skill saw is right-handed, but you are left-handed; with a few modifications, you realize its design can be abstracted, to accommodate either handedness. When your drill dies, instead of simply replacing it, you open it up, find the faulty circuit, and realize there’s a weakness in the wiring that allows it to overheat. When each of your tools fails you in some way, you uncover some flaw and realize they could be improved. Finally, you come to expect failure from every tool you might ever use: “All hammers are terrible.”

We all know that technology—whether we understand how it is built or not—has flaws. We know tradeoffs are made between cost, generality, performance, and reliability. There are more and less reliable, performant, and costly cars, saws, X-ray machines.

The general public certainly has opinions and a sense of the tradeoffs between the tools they use, e.g., Windows vs. Mac OS, Android vs. iOS. But the sentiment is that these are tools, with flaws and lifespans, just like a cars, appliances, and power tools.

I don’t have a strong opinion about whether computer scientists should complain or not about their tools, except, obviously, that general complaints don’t effect much change. My point is more philosophical: we find ourselves in an interesting profession, where we can open our very tools up, peer inside, and change them.

It’s a fascinating prospect.


October 10, 2014

I wanted to try out a recent feature in GHC this week, so I was using HEAD. When I say using, I mean it: I wasn’t developing with it, but using it to build Ivory, our safe C EDSL, as well as some libraries written in Ivory. I want to point out a few dragons that lay ahead for others using HEAD ( today.

  • The Functor, Applicative, Monad hierarchy (as well as the Alternative and MonadPlus hierarchy) is no longer a warning but an error. You’ll be adding a lot of instances to library code.
  • You’ll need to update Cabal, which comes as a submodule with the GHC sources. Unfortunately, building Cabal was a pain because of the bootstrapping problem. The script in cabal-install is broken (e.g., outdated dependency versions). Getting it to work involved using runhaskell directly, modifying a bunch of Cabal’s dependencies, and getting a little help from Iavor Diatchki.
  • Some of the datatypes in Template Haskell have changed; notably, the datatypes for creating class constraints have been folded into the datatype for constructing types (constraint kinds!). Many libraries that depend on Template Haskell breaks as a result.
  • I won’t mention the issues with package dependency upper bounds. Oops, I just did.

Fortunately, once Cabal is updated, Cabal sandboxes work just fine. I wrote a simple shell script to swap out my sandboxes to switch between GHC versions. (It would be a nice feature if Cabal recognized you are using a different version of GHC than the one the cabal sandbox was originally made and created a new sandbox automatically.)

Because I’m on OSX and use Homebrew, I tried using it to manage my GHC versions, including those installed with Brew and those built and installed manually. It works great. When building GHC, configure it to install into your Cellar, or wherever you have Brew install packages. So I ran

> ./configure --prefix=/usr/local/Cellar/ghc/7.9

which makes Brew aware of the new version of GHC, despite being manually installed. After it’s installed,

> brew switch ghc 7.9

takes care of updating all your symbolic links. No more making “my-ghc-7.9” symbolic links or writing custom shell scripts.

SmartChecking Matt Might’s Red-Black Trees

August 20, 2014

Matt Might gave a nice intro to QuickCheck via testing red-black trees recently. Of course, QuickCheck has been around for over a decade now, but it’s still useful (if underused–why aren’t you QuickChecking your programs!?).

In a couple of weeks, I’m presenting a paper on an alternative to QuickCheck called SmartCheck at the Haskell Symposium.

SmartCheck focuses on efficiently shrinking and generalizing large counterexamples. I thought it’d be fun to try some of Matt’s examples with SmartCheck.

The kinds of properties Matt Checked really aren’t in the sweet spot of SmartCheck, since the counterexamples are so small (Matt didn’t even have to define instances for shrink!). SmartCheck focuses on shrinking and generalizing large counterexamples.

Still, let’s see what it looks like. (The code can be found here.)

SmartCheck is only interesting for failed properties, so let’s look at an early example in Matt’s blog post where something goes wrong. A lot of the blog post focuses on generating sufficiently constrained arbitrary red-black trees. In the section entitled, “A property for balanced black depth”, a property is given to check that the path from the root of a tree to every leaf passes through the same number of black nodes. An early generator for trees fails to satisfy the property.

To get the code to work with SmartCheck, we derive Typeable and Generic instances for the datatypes, and use GHC Generics to automatically derive instances for SmartCheck’s typeclass. The only other main issue is that SmartCheck doesn’t support a `forall` function like in QuickCheck. So instead of a call to QuickCheck such as

> quickCheck (forAll nrrTree prop_BlackBalanced)

We change the arbitrary instance to be the nrrTree generator.

Because it is so easy to find a small counterexample, SmartCheck’s reduction algorithm does a little bit of automatic shrinking, but not too much. For example, a typical minimal counterexample returned by SmartCheck looks like

T R E 2 (T B E 5 E)

which is about as small as possible. Now onto generalization!

There are three generalization phases in SmartCheck, but we’ll look at just one, in which a formula is returned that is universally quantified if every test case fails. For the test case above, SmartCheck returns the following formula:

forall values x0 x1:
T R E 2 (T B x1 5 x0)

Intuitively, this means that for any well-typed trees chosen that could replace the variables x0 and x1, the resulting formula is still a counterexample.

The benefit to developers is seeing instantly that those subterms in the counterexample probably don’t matter. The real issue is that E on the left is unbalanced with (T B E 5 E) on the right.

One of the early design decisions in SmartCheck was focus on structurally shrinking data types and essentially ignore “base types” like Int, char, etc. The motivation was to improve efficiency on shrinking large counterexamples.

But for a case like this, generalizing base types would be interesting. We’d hypothetically get something like

forall values (x0, x1 :: RBSet Int) (x2, x3 :: Int):
T R E x2 (T B x1 x3 x0)

further generalizing the counterexample. It may be worth adding this behavior to SmartCheck.

SmartCheck’s generalization begins to bridge the gap from specific counterexamples to formulas characterizing counterexamples. The idea is related to QuickSpec, another cool tool developed by Claessen and Hughes (and SmallBone). Moreover, it’s a bridge between testing and verification, or as Matt puts it, from the 80% to the 20%.

SmartCheck: Redux

June 16, 2014

A few years ago, I started playing with the idea of making counterexamples from failed tests easier to understand. (If you’re like me, you spend a lot of time debugging.) Specifically, I started from QuickCheck, a test framework for Haskell, which has since been ported to many other languages. From this, a tool called SmartCheck was born.

In this post, I’m not going to describe the technical details of SmartCheck. There’s a paper I wrote for that, and it was just accepted for publication at the 2014 Haskell Symposium (warning: the paper linked to will change slightly as I prepare the final version).

Rather, I want to give a bit of perspective on the project.

As one reviewer for the paper noted, the paper is not fundamentally about functional programming but about testing. This is exactly right. From the perspective of software testing in general, I believe there are two novel contributions (correct me ASAP if I’m wrong!):

  1. It combines counterexample generation with counterexample reduction, as opposed to delta-debugging-based approaches, which performs deterministic reduction, given a specific counterexample. One possible benefit is SmartCheck can help avoid stopping at local minima during reduction, since while shrinking, new random values are generated.  Update: as John Regehr points out in the comments below, his group has already done this in the domain of C programs.  See the paper.
  2. Perhaps the coolest contribution is generalizing sets of counterexamples into formula that characterize them.

I’d be interested to see how the work would be received in the software testing world, but I suppose first it’d need to be ported to a more mainstream language, like C/C++/Java.

Compared to QuickCheck specifically, QuickCheck didn’t used to have generics for implementing shrinking functions; recent versions include it, and it’s quite good. In many cases, SmartCheck outperforms QuickCheck, generating smaller counterexamples faster.  Features like counterexample generalization are unique to SmartCheck, but being able to test functions is unique to QuickCheck. Moreover, I should say that SmartCheck uses QuickCheck in the implementation to do some of the work of finding a counterexample, so thanks, QuickCheck developers!

When you have a tool that purports to improve on QuickCheck in some respects, it’s natural to look for programs/libraries that use QuickCheck to test it. I found that surprisingly few Hackage packages have QuickCheck test suites, particularly given how well known QuickCheck is. The one quintessential program that does contain QuickCheck tests is Xmonad, but I challenge you to think of a few more off the top of your head! This really is a shame.

The lack of (public) real-world examples is a challenge when developing new testing tools, especially when you want to compare your tool against previous approaches. More generally, in testing, it seems there is a lack of standardized benchmarks. What we really need is analogue of the SPEC CPU2000 performance benchmarks for property-based testing, in Haskell or any other language for that matter.  I think this would be a great contribution to the community.

In 1980, Boyer and Moore developed a linear-time majority vote algorithm and verified an implementation of it. It took until 1991 to publish it after many failed tries. Indeed, in the decade between invention and publication, others had generalized their work, and it being superseded was one of the reasons reviewers gave for rejecting it! (The full story can be found here.)

To a small extent, I can empathize. I submitted a (slightly rushed) version of the paper a couple years ago to ICFP in what was a year of record submissions. One reviewer was positive, one luke-warm, and one negative. I didn’t think about the paper much over the following year or two, but I got a couple of requests to put the project on Hackage, a couple of reports on usage, and a couple of inquiries about how to cite the draft paper. So after making a few improvements to the paper and implementation, I decided to try again to publish it, and it finally will be.

As I noted above, this is not particularly a Haskell paper. However, an exciting aspect of the Haskell community, and more generally, the functional programming community, is that it is often exploring the fringes of what is considered to be core programming language research at the time. I’m happy to be part of the fringe.

Embedded Security in the Mainstream

April 1, 2014

Science Friday had an interesting podcast with Bruce Schneier worth a listen.  It discussed security in embedded systems (or as is hip to say now, the Internet of Things).  The idea of security in embedded systems seemed pretty obscure just a few years ago, so it’s nice to see it being featured on a general-audience radio show.

So who’s going to listen to it from their phone, connected to their car’s audio over Bluetooth, on the way to the store to buy a Nest smoke alarm?

Stop Regression Testing

December 15, 2013

Some of my colleagues and I have been building a large software application, the open-source SMACCMPilot autopilot system, using Haskell-based embedded domain-specific languages (EDSLs) to generate embedded C code.  Our thesis is that EDSLs both increase our productivity and increase the assurance of our application; so far, we believe our thesis is holding up: we’ve created a complex 50k lines-of-code application in under two engineer years that is (mostly) free of undefined and memory-unsafe behavior.

Of course, there are always bugs.  Even with a fully formally-verified system, there are likely bugs in the specification/requirements, bugs in the hardware, or bugs in the users.  But our hope is that most of our bugs now are logical bugs rather than, say, arithmetic bugs (e.g., underflow/overflow, division by zero) or resource bugs (e.g., null-pointer dereference or buffer overflows).

However, since the EDSL we are using, Ivory, is relatively young—we’ve been writing SMACCMPilot in the language while developing the language—there have been cases where we have not checked for unsafe programs as stringently as possible.  For example, while we insert arithmetic underflow/overflow checks in general, we (ahem, I) neglected to do so for initializers for local variables, so an initializer might overflow with an unexpected result.

If we were writing our application in a typical compiled language, even a high-level one, and found such a bug in the compiler, we would perhaps file a bug report with the developers.  If it were an in-house language, things would be simpler, but modifying a large stand-alone compiler can be difficult (and introduce new bugs).  Most likely, the compiler would not change, and we’d either make some ad-hoc work-around or introduce regression tests to make sure that the specific bug found isn’t hit again.

But with an EDSL, the situation is different.  With a small AST, it’s easy to write new passes or inspect passes for errors.  Rebuilding the compiler takes seconds.  Fixing the bug with initializers took me a few minutes.

So introducing regression tests about the behavior of initializers isn’t necessary; we’ve changed the compiler to eliminate that class of bugs in the future.

More generally, we have a different mindset programming in an EDSL: if a kind of bug occurs a few times, we change the language/compiler to eliminate it (or at least to automatically insert verification conditions to check for it).  Instead of a growing test-suite, we have a growing set of checks in the compiler, to help eliminate both our bugs and the bugs in all future Ivory programs.

Edit (16 Dec. 2013): a few folks have noted that the title is a bit strong: it absolutely is, that’s probably why you read it. :)  Yes, you should (regression) test your applications.  You might consider testing your compiler, too.  Indeed, the point is that modifying the compiler can sometimes eliminate testing for a class of bugs and sometimes simply supports testing by inserting checks automatically into your code.


Printf No More

October 23, 2013

I wondered to a colleague if there is a way to get GDB to print the value of a variable when it changes or when a certain program location is reached without breaking; in effect, placing printfs in my program automatically.

His Google-foo was better than mine he and found a stackoverflow solution.  For example, something like

  set pagination off
  break foo.c:42
  print x

prints the value of x whenever line 42 in file foo.c is reached, but just keeps running the program.  You can stick a little script like this in your .gdbinit file, which you can reload with

  source .gdbinit

This is particularly useful if you’re debugging a remote target that doesn’t support print statements. It’s a cool trick for embedded systems debugging.


October 7, 2013


Over on the Galois blog is a post about my current project, building a secure high-assurance autopilot called SMACCMPilot.  SMACCMPilot is open-source; is the project’s landing page that describes the technologies going into the project with links to the software repositories.  Check it out!

Galois’ main approach to building SMACCMPilot is to use embedded domain-specific languages (EDSLs), embedded in Haskell that compile down to restricted versions of C.  (If you’re not familiar with the “EDSL approach”, and particularly how it might improve the assurance of your systems, check out this paper we wrote about our experiences on a separate NASA-sponsored project.)  The project is quickly approaching one of the larger EDSL projects I know about, and it’s still relatively early in its development.  The EDSLs we’re developing for SMACCMPilot, Ivory (for embedded C code generation) and Tower (for OS task setup and communication), are suitable for other embedded systems projects as well.

Like many young and active open-source projects, the code is hot off the presses and under flux, and the documentation always lags the implementation, so let me know if you try using any of the artifacts and have problems.  We’re happy to have new users and contributors!

Book Review: Automate This

May 27, 2013

A family member gave me a copy of Automate This by Christopher Steiner as a gift a few months ago.  The subtitle of the book is “How algorithms came to rule our world.”  The book is a non-technical, fast, and easy read.  I did read a few interesting stories, such as NASA’s use of personality categorization algorithms in the 70s to predict which potential astronauts would work well together, or a math professor’s algorithmic dissection of Beatles songs.  The book particularly emphasizes algorithmic trading on Wall Street.  The somewhat non sequitur conclusion is that we need more science, engineering, and math graduates.

The main point of the book is that algorithms are pervasive and will become more so.  Toward this point, I think the author could have cast an even wider net, mentioning that algorithms are implemented in everything from elevator controllers to autopilots.  There is a cursory pre-computing history of algorithms at the beginning of the book that is (tenuously) tied to a Wall Street trading.

Rather than focus on algorithms broadly, the focus is patchy, hitting on machine learning, high-speed trading, and game theory.  Some mention of algorithms as might be taught in a “Analysis of Algorithms” course, covering basic topics like time-space complexity and decidability, would help prevent the general public from having a narrow-minded interpretation of algorithms.  Computer scientists invent, analyze, and implement algorithms every day, but much of that work is perhaps not sexy enough for a popular science book.

Incidentally, for the topics Steiner does cover, an excellent companion treating similar topics, but with a different focus, is Nate Silver’s (of 538 fame) book, The Signal and the Noise.

A Dane Goes to NASA

November 3, 2012

Last year, Nis Wegmann came from the University of Copenhagen to work with me on a NASA-sponsored project.  He was recently interviewed about his experiences.

Nis worked at the National Institute of Aerospace in Virginia, just outside the NASA Langley gates.  Since Galois is on the other side of the U.S., we mostly collaborated remotely.  (For a taste of what its like to intern at Galois itself, check out Ed Yang’s post.)  Nis is an outstanding programmer and great writer, so it made working together effortless, despite the distance.

Good luck in your future endeavors, Nis!

The Worst Offenders

October 13, 2012

I listened to a Freakonomics podcast the other day in which a story about hand-washing at Cedars-Sinai Medical Center was told. It turns out that of all the staff, the doctors had the lowest rate of hand-washing (around 65%, if I recall correctly) in the hospital. This was surprising, since the doctors supposedly have the best education and so should know the most about the dangers of bacterial infection.

The chief of the staff tried different means to increase hygiene. Incentives like gift cards were tried. Then he tried another approach: shame. Doctors had their hands cultured so that they could see the bacteria on their hands. Images of the cultures were even used as a hospital-wide screensaver on the computers.

I listened to this wondering if there are any parallels to formal verification engineers and researchers. Ostensibly, we are the experts on the perils of software bugs. Academic papers often begin with some story about bugs in the wild, motivating the new formal verification technique described therein.

But to what extent do we practice what we preach? How many of us might write a Python script with no unit tests? C/C++ programs without running Lint/CppCheck? Compile without fixing all the -Wall warnings the compiler emits?

These kinds of activities represent the lowest rung of software assurance; they’re the “hand washing of software assurance”, if you will. I’m certainly guilty myself of not always practicing good “software hygiene”. The justifications you might give for failing to do so is that you’re just writing a program for personal use, or you’re writing prototype software, “just for research”. I could imagine doctors telling themselves something similar: “I always scrub thoroughly before a major surgery, but it’s not such a big deal for a simple office visit.” But this justification can become a slippery slope.

There are a few examples in which verification-tool assurance shoots straight for the top. For example, John Harrison verified a model of HOL Light in HOL Light. Filip Maríc verified a SAT solver. But these are more intellectual curiosities than standard operating procedure.

It could be an interesting empirical study to analyze the software quality of open-source formal verification tools to see just how buggy they are (spoiler alert: they are buggy). What I’m interested in are not deep flaws that might be caught via theorem-proving, but simple ones that could be caught with a little better test coverage or lightweight static analysis.

For now, I’m just happy my colleagues’ screensaver isn’t a bug tracker for software that I’ve written.

Lowering the Bar

October 2, 2012

I gave a talk (video, slides, and paper) at ICFP last month, arguing that it can be easy to build a high-assurance compiler. I gave a similar talk as a keynote a couple weeks later at the very enjoyable Midwest Verification Day, hosted by Kansas University this year (thanks Andy Gill and Perry Alexander for inviting me!). This paper wraps up the Copilot project. I had a great time (I mean, how often do formal methods engineers get to be around NASA subscale jet aircraft?!?).


July 26, 2012

I’ve been working on a Haskell library for testing Haskell programs I call SmartCheck. SmartCheck is focused on testing algebraic data and generalizing counterexamples found. Below is the README for SmartCheck, which I have located on GitHub (I haven’t put it on Hackage yet). The following is a high-level explanation that doesn’t go into details about the algorithms or implementation (that’s another post!).

I’d be interested in feedback on

  • Real-world examples to try SmartCheck on.
  • Whether there are other interesting ways to generalize counterexamples.
  • If there’s similar work out there I should know about (in addition to QuickCheck and SmallCheck.
  • Your experiences, if you try the library out.



SmartCheck is a smarter QuickCheck, a powerful testing library for Haskell. The purpose of SmartCheck is to help you more quickly get to the heart of a bug and to quickly discover each possible way that a property may fail.

SmartCheck is useful for debugging programs operating on algebraic datatypes. When a property is true, SmartCheck is just like QuickCheck (SmartCheck uses QuickCheck as a backend). When a property fails, SmartCheck kicks into gear. First, it attempts to find a minimal counterexample to the property is a robust, systematic way. (You do not need to define any custom shrink instances, like with QuickCheck, but if you do, those are used. SmartCheck usually can do much better than even custom shrink instances.) Second, once a minimal counterexample is found, SmartCheck then attempts to generalize the failed value d by replacing d‘s substructures with new values to make d', and QuickChecking each new d'. If for each new d' generated, the property also fails, we claim the property fails for any substructure replaced here (of course, this is true modulo the coverage of the tests).

SmartCheck executes in a real-eval-print loop. In each iteration, all values that have the “same shape” as the generalized value are removed from possible created tests. The loop can be iterated until a fixed-point is reached, and SmartCheck is not able to create any new values that fail the property.

A typical example

In the package there is an examples directory containing a number of examples. Let’s look at the simplest, Div0.hs.

> cd SmartCheck/examples
> ghci -Wall Div0.hs

Div0 defines a toy language containing constants (C), addition (A), and division (D):

data M = C Int
       | A M M
       | D M M
  deriving (Read, Show, Typeable, Generic)

Because SmartCheck performs data-generic operations using GHC.Generics we have to derive Typeable and Generic. To use GHC.Generics, you also need the following pragmas: and the single automatically-derived instance:

{-# LANGUAGE DeriveDataTypeable #-}
{-# LANGUAGE DeriveGeneric #-}

instance SubTypes M 

Let’s say we have a little interpreter for the language that takes care to return Nothing if there is a division by 0:

eval :: M -> Maybe Int
eval (C i) = Just i
eval (A a b) = do
  i <- eval a 
  j <- eval b
  return $ i + j
eval (D a b) = 
  if eval b == Just 0 then Nothing 
    else do i <- eval a 
            j <- eval b
            return $ i `div` j

Now suppose we define a set of values of M such that they won’t result in division by 0. We might try the following:

divSubTerms :: M -> Bool
divSubTerms (C _)       = True
divSubTerms (D _ (C 0)) = False
divSubTerms (A m0 m1)   = divSubTerms m0 && divSubTerms m1
divSubTerms (D m0 m1)   = divSubTerms m0 && divSubTerms m1

So our property (tries) to state that so long as a value satisfies divSubTerms, then we won’t have division by 0 (can you spot the problem in divSubTerms?):

div_prop :: M -> Property
div_prop m = divSubTerms m ==> eval m /= Nothing

Assuming we’ve defined an Arbitrary instance for M (just like in QuickCheck—however, we just have to implement the arbitrary method; the shrink method is superfluous), we are ready to run SmartCheck.

divTest :: IO ()
divTest = smartCheck args div_prop
  args = scStdArgs { qcArgs   = stdArgs 
                   , treeShow = PrintString }

In this example, we won’t redefine any of QuickCheck’s standard arguments, but it’s certainly possible. the treeShow field tells SmartCheck whether you want generalized counterexamples shown in a tree format or printed as a long string (the default is the tree format).

Ok, let’s try it. First, SmartCheck just runs QuickCheck:

*Div0> divTest 
*** Failed! Falsifiable (after 7 tests):
D (D (D (A (C (-20)) (D (D (C 2) (C (-19))) (C (-8)))) (D (D (C (-23)) (C 32)) (C (-7)))) (A (A (C 2) (C 10)) (A (C (-2)) (C 13)))) (D (A (C 12) (C (-7))) (D (A (C (-29)) (C 19)) (C 30)))

Oh, that’s confusing, and for such a simple property and small datatype! SmartCheck takes the output from QuickCheck and tries systematic shrinking for the one failed test-case, kind of like SmallCheck might. We get the following reduced counterexample:

*** Smart Shrinking ... 
*** Smart-shrunk value:
D (C 0) (D (C 0) (C (-1)))

Ok, that’s some progress! Now SmartCheck attempts to generalize this (local) minimal counterexample. SmartCheck has two generalization steps that we’ll explain separately although SmartCheck combines their results in practice (you can turn off each kind of generalization in the flags). First, SmartCheck tries to generalize values in the shrunk counterexample. SmartCheck returns

*** Extrapolating values ...
*** Extrapolated value:
forall x0:

D x0 (D (C 0) (C (-1)))

Ahah! We see that for any possible subvalues x0, the above value fails. Our precondition divSubTerms did not account for the possibility of a non-terminal divisor evaluating to 0; we only pattern-matched on constants.

In addition, SmartCheck tries to do something I call constructor generalization. For a datatype with a finite number of constructors, the idea is to see if for each subvalue in the counterexample, there is are subvalues that also fail the property, using every possible constructor in the datatype. So for example, for our counterexample above

*** Extrapolating constructors ...
*** Extrapolated value:
forall C0:
  there exist arguments s.t.

D (C 0) (D C0 (C (-1)))

So in the hole C0, SmartCheck was able to build a value using each of the constructors C, A, and D (well, it already knew there was a value using CC 0.

SmartCheck asks us if we want to continue:

Attempt to find a new counterexample? ('Enter' to continue; any character
then 'Enter' to quit.)

SmartCheck will omit any term that has the “same shape” as D (C 0) (D (C 0) (C (-1))) and try to find a new counterexample.

*** Failed! Falsifiable (after 9 tests):  
A (A (D (C (-20)) (A (C (-5)) (C (-32)))) (D (A (C 6) (C 19)) (A (C (-3)) (A (C (-16)) (C (-13)))))) (D (C 29) (D (C (-11)) (D (C 11) (C 23))))

*** Smart Shrinking ... 
*** Smart-shrunk value:
A (C (-1)) (D (A (C 1) (C 1)) (D (C 1) (C 2)))

*** Extrapolating values ...

*** Extrapolating Constructors ...

*** Extrapolated value:
forall values x0 x1:

A x1 (D x0 (D (C 1) (C 2)))

We find another counterexample; this time, the main constructor is addition.

We might ask SmartCheck to find another counterexample:


*** Extrapolating ...
*** Could not extrapolate a new value; done.

At this point, SmartCheck can’t find a newly-shaped counterexample. (This doesn’t mean there aren’t more—you can try to increase the standard arguments to QuickCheck to allow more failed test before giving up (maxDiscard) or increasing the size of tests (maxSize). Or you could simply just keep running the real-eval-print loop.)

A Fistful of Dollars

March 7, 2012

A while back I attended an excellent DARPA Proposers’ Day. It got me thinking, “If I had a fistful of (a few million) dollars to provide researchers, what sort of program might I start?” Two ideas struck me:

Getting software right, fast

Software is usually developed in a world with incomplete requirements and with users (and attackers) that try to break your assumptions.  Furthermore, more often than not, developers themselves (perhaps with a QA team, if they’re lucky) largely decide when they’re “done”—when a version is bug-free enough, secure enough, and meets the specifications to release.

There are a dizzying number of approaches to improving quality: testing (e.g., unit testing, penetration testing, black-box and white-box testing), compiler tools (e.g., type-checking, warnings), formal verification (e.g., model-checking, static analysis, decision procedures), programming methodologies (e.g., pair-programming, extreme programming), etc.  The question is, what combinations of approaches provide benefits and by what factor?

A major portion of the challenge would be to fund experiments that are large enough to draw statistical conclusions about different approaches.  Teams would get points for finishing faster, but lose points for bugs, security vulnerabilities, unimplemented features, etc. uncovered by a separate “read team”.  To make the experience more realistic, I’d have career programmers (not students in an undergrad course) to participate in the experiments.  A major portion of the program would be just designing the experiments to ensure statistical conclusions could be drawn from them, but I imagine the results might be enlightening.

A Believe Search Engine

Imagine a “search” engine in which you could search not for facts, but beliefs.  For example, I might query the likelihood that someone believes that the Earth revolves around the Sun in Russia in 1600.  Or I might ask what the likelihood of someone believing in global warming in Japan in 2012.  Or whatever else.  The engine would return some probability, perhaps with a confidence interval.  I’d imagine being able to query some believe based on geography, time period, religion, nationality, sex, and so forth.  Of course this is a hard, hard problem, and would likely involve search as well technology akin to IBM’s Watson.  I’d be curious to know if it were feasible at all.

When Formal Systems Kill

February 18, 2012

What happens when a philosophy professor (with a M.S. in Computer Science) and a formal methods engineer (like me) collaborate?  The outcome in my case is a paper entitled, “When Formal Systems Kill: Computer Ethics and Formal Methods” (pdf). In the paper, we give an ethical treatment of the use of formal methods. Our goal in the paper is not so much to make definitive claims on the subject but to raise awareness of the issue for further philosophical treatment. The subject has been neglected by ethics philosophers.

Darren (my coauthor) and I had a hard time deciding where to try to publish this. We ended up submitting it to the American Philosophical Association’s (APA) newsletter on Philosophy and Computers, where it was accepted for the Fall 2011 edition. (The APA is the “ACM of Philosophy” within the United States.)

The paper was a number of years in the making, being a side-project for me. I can’t say I agree with everything we wrote, but I hope it serves as a useful starting point for the dialogue between ethics and formal methods.

Who’s Afraid of Software?

January 19, 2012

Who’s afraid of software?  I mean viscerally, stomach-knotting afraid.  Afraid like you might be when you come across a snake or a bear, or when you are mugged.  Do you obsess about a phishing attack each time you open your email?  Do you worry there’s an eavesdropper when you join the open wifi access point in a coffee shop?  Do you worry your software will fail in your modern automobile or aircraft?

I listened to a Freakonomics podcast about risk, uncertainty, and beliefs.  One point made during the show was that our fears are shaped by evolution—to our ancestors, it made sense to be afraid of threatening animals.  In modern life, however, our fears don’t match risks—we’d be much better off being afraid of cheeseburgers, as pointed out in the show.  Some people are afraid of modern risks.  I know people afraid of cancer, for example.

That got me thinking about fearing software.  Software is certainly among the most complex artifacts created by humans.  Modern cars contain 100+ million lines of code.  Nearly every day there is a story about a large corporation being hacked and of cyber-warfare between nations.

My question is serious—I really do wonder if people are genuinely afraid of software.  I work in the area of software assurance, and while I take precautions against viruses, phishing attacks, etc., I don’t particularly worry about software failures, even when my life might depend on it.  This is despite issues just last year like this and this in automotive software.  I get to see a somewhat how the sausage is made, and in general, we only exercise a small fraction of the state-space of deployed software in validation and in actual usage.  There are legitimate risks, but there seems to be very little fear.

Perhaps like a medical doctors stereotypically neglecting their own health, I don’t worry day-to-day about software assurance despite working in the field.  But it seems nobody else really fears software, either.

In the podcast, the topic of polarizing claims, like global warming, is discussed.  Outside of academic circles, one’s view on the risks of software are not so polarizing—your views on the topic won’t cause your friends or colleagues to disparage you (indeed, if anything, the main risk is likely boring others in discussing the topic!).  I wonder just what the “global warming” of software might be in the future.

Copilot featured in Aerospace America

December 10, 2011

My Copilot project, sponsored by NASA, was featured in the Aerospace America magazine “Year in Review” edition (December 2011). We recently finished up the project, which focused on solving the problem of failure in complex embedded systems (like flight control systems in commercial aircraft).  Our approach was a language called Copilot for runtime monitoring of embedded systems corroborated with experiments on airspeed sensor systems and an autopilot communication link.

Here’s a link to the page from the magazine mentioning us from the author’s page. (The relevant paragraph is in the middle of the right column).

Thanks to Galois and the National Institute of Aerospace where the work was done, the NASA folks we collaborated with, and especially Alwyn Goodloe, Nis Wegmann, Sebastian Niller, and Robin Morisset who worked on the project with me.