Space age planning brings problems down to earth

NASA's remote Mars debugging tricks have lessons for earthlings

On a rocky plateau 300 million miles from here, The Mars Exploration Rover A – known to its friends and PR operatives as Spirit – sits quietly, conserving its strength after a near-fatal computer breakdown. As it was not named the Mars Static Nervous Wreck A, you may assume that things are not going exactly to plan.

Much closer to home, gaggles of geeks sit with furrowed brows as they work out exactly why the machine went mad just a third of the way through its mission, when everything was looking spotless. Instead of preparing to drill into a large and tempting rock, the robot had the silicon equivalent of a prolonged and devastating epileptic fit: when HQ tried to tune in, all they heard was binary gibberish.

You will not be surprised to learn that the number one suspect for the space probe's misery is buggy software, nor that Spirit's twin, Opportunity, is being handled with the kiddest of gloves as it too unfurls its sensors on the other side of the planet.

It's happened before: same planet, same people, same problem. Five years ago, the Mars Pathfinder mission was also busy scurrying across the Martian surface -- not doing so much science, but testing out many of the techniques used by the Rovers. Just as with the Rovers, a mysterious problem caused the machinery to reset itself continually, never getting to the point where it could do its programmed tasks or return information. And as with the Rovers, the engineers running the mission were presented with a mystery: the local replica of the robot wasn't repeating the problem and you can't slap a logic analyser on a chip from the best part of half a billion miles away.

You might think, reasonably, that a NASA-developed space mission would fly with as many redundant, hand-crafted ultra-reliable systems as man has ever seen. It doesn't work like that, especially with robotic craft where lives aren't at stake. One of the most important factors is obtaining as much science per dollar as possible -- which means keeping launch costs down and active payload up. Redundant systems are dead weight. And far from automatically improving safety, back-up systems increase complexity and can even reduce reliability -- just ask anyone with experience of uninterruptable power supplies.

Hand-crafted code takes a very long time to create and verify: the timescales and budget are such that you want your team to be working on the unique aspects of the mission. All these factors mean that the technology on Mars looks awfully like that on your desk -- a general purpose, standards-based platform like many others running a commercial operating system doing custom tasks.

The problems with Pathfinder boiled down to priorities, both technical and human. The technical side was a classic problem where a low priority task had taken exclusive ownership of a shared resource, only to be interrupted by a higher priority task. This also needed exclusive ownership of the same resource, so waited until it became free -- which it never could, due to the suspended low priority task. After a while, safety software on the spacecraft noticed that the high priority task hadn't completed within its designated time: the computer therefore reset itself and stopped work until it got the next day's communication from Earth.

The bug had been spotted before landing, but couldn't be reproduced back at base -- it only happened when more data than anyone expected was being transferred and under certain timing conditions. Although nobody decided the bug was unimportant, it was deemed less important -- and harder to find and fix -- than many other ongoing problems, and the focus of the engineers was left on flight and landing. If the bug reappeared, they decided, the mission wouldn't be in jeopardy: the safety systems would ensure its survival and opportunities for recovery. The engineers considered that as these assumptions had been proved right and the mission was in the end a success, the prioritisation was correct: a hard conclusion to argue against.

That recovery was aided by a couple of design decisions. The software on the spacecraft had a lot of diagnostic and logging features -- the sort that normally get removed before shipping -- in place and functional. This was part of a larger philosophy, "test what you fly and fly what you test": if you're responsible for looking after a system in the field, make as few changes as possible between testing and deployment – and don't touch the test system afterwards. Once the diagnostic logs were retrieved from the spacecraft, the problem could be replicated locally -- with confidence that the results accurately reflected what was really happening, and that a correct fix could be made, tested and deployed effectively.

For all this to happen with proprietary software that the engineers hadn't developed themselves, two further factors had to be in place. The company behind the software -- Wind River Systems -- had to be there with exemplary support: the bug wasn't in their code, but was dependent on abstruse aspects of the way the operating system worked. Linked to that, the mission engineers had to have an extraordinary knowledge of the guts of the operating system. As a report after the event said: "A good lesson when you fly COTS [commercial off-the-shelf] stuff -- make sure you know how it works."

It's tempting to say that with luck, the current problems with Spirit will be fixable -- but luck's not the factor. It may make for poor advertising copy, but the truth is that good software's got little to do with operating system you buy, what languages you use or what rapid development system you use to cook your code. Solid knowledge, sound engineering discipline and a methodology spring-loaded for safety will save your project: if you can demand these from your suppliers and build them into your team, you too can rescue a project at half a billion miles. Lack any of these, and you'll be brought back to earth in no time.