X
Tech

Apple's 'goto fail' tells us nothing good about Cupertino's software delivery process

The fact that Apple's infamous SSL validation bug actually got out into the real world is pretty terrifying.
Written by Matt Baxter-Reynolds, Contributor

There's been a lot of press about Apple's "goto fail." To catch you up, this was a bug that caused the proper validation of SSL certificates OS X and iOS to fail.

The result was that it was easier to trick any iOS and Mac devices to accept invalid certificates, and to present them as valid to the user.

Most of the analysis focused on the use of C's goto statement, particularly around whether its being used at all constituted bad practice and, by extension, and whether the bug was somehow linked to the usage of goto.

The fact that this code made it into production at all is a shocking indictment of Apple's engineering team. Forget goto -- who cares about how the developers put their code together? The issue here is that the problem was not found, and resolved, before the code got out of the door.

Practice

Here's the actual code in question:

SSL validation setup failure

If you're not used to reading code, what's happening here is that this code starts with a certificate that needs checking. The line that does the actual magic is sslRawVerify -- this is the method that actually checks the certificate. The part with the error in it is just setting things up to make that call.

The problem with the code is that the erroneous second goto stops sslRawVerify from ever being called. The function returns without doing this work, simply returning an optimistic "everything's OK!" back to the caller.

The goto statement is quite old-fashioned, and hasn't been regarded as a "smart" thing for developers to use for a good long while. The problem is that it allows developers to create "flow" in code that is unnatural. Most programming constructs are clear to read -- they tend to go in one direction, branching when decisions need to be made, or looping around when sets of data need to be worked on. Working in defined structures in that way allows developers to glance at code and understand its meaning.

Goto allows the program to fly around all over the place, resulting in missed subtleties. This makes code hard to read.

Reviewing

There is nothing unclear about the code that contains the bug. There are no surprises, and it's easy to read. Specifically, the developer's intention comes through loud and clear. goto may be old-fashioned, but it's not inappropriate in this setting.

The second goto fail obviously should not be there. Either the line has been duplicated by accident, there was a if check above it that was deleted but left in by the developer by accident, or there was a problem merging the code written by two developers.

(Merging is a little complicated for non-programmers to understand. From time to time, two developers will work on the same code in individual copies of the master code. When two copies of the same code are sent back to that master code, the source control system that manages everything has to make decisions about how to combine the two developers' work. This process can go wrong.)

Developers always make mistakes. They always create bugs. Any software organisation understands this and knows it's to be expected. You can judge software organisations by a) how good they are writing code that's easy to maintain, and b) how good they are at stopping bugs before they go into production.

My ZDNet colleague Stilgherrian flagged one part of the process that failed here, namely that of code reviews. In code reviews, a second developer looks at the code to try and find issues with it. An issue can be an error or just an area in the code that needs refining. 

Code reviews are tricky to call in this situation. They can't cover 100% of the code base, so it's understandable that no one developer managed to spot this bug created by another developer. 

Also, this code is open source. Apparently, no one outside Cupertino using it spotted the issue either. So maybe we shouldn't give the Apple developers a hard time for that. We can, however, give them a hard time for something much worse.

Automation

The real fail in all this comes from the fact that the bug was not detected by any automated test suites.

Bugs in software are found by testing. Testing happens in two ways -- someone sits down and tries the software looking for faults (manual testing), or separate software is produced that tests the software (automated testing).

Competence at automated testing is an indicator of sophistication of the software development team and its sponsors -- the better you are at developing software, the more likely you are to use automated testing.

However, automated testing is a complex proposition--it's expensive and difficult. For many, automated testing is optional, with good enough results being achieved from manual testing.

But there are certain classes and types of software where "good enough" cannot be achieved with manual testing. Validation of SSL certificates falls into both camps. It's a core part of security infrastructure, and it's baked into the operating system.

This software needed automated testing because it would be too difficult to do manually. Since testing software requires the rejection of invalid certificates, you need a bunch of them. Practically the only way to do that is to use automated testing where invalid certificates are included with the suite.

Conclusion

The problem here is not whether goto was used or not. The problem isn't even whether the code was reviewed by another developer. The problem is that somehow Apple was able to put that code into production. Why on earth was that not caught by an automated test suite?

And, as my Twitter friend Explainoit said:

Screen Shot 2014-03-18 at 20.01.18
Indeed.

One way to read that, however, is that the lack of literal s*** falling on Cupertino implies that no one uses the Apple kit for anything important.

This bug ended up just being embarrassing. Can you imagine if this bug had been found in Windows and server rooms and enterprises all over the land had to be emergency patched with the fix SSL?

What do you think? Post a comment, or talk to me on Twitter: @mbrit.

Editorial standards