omnishambles (noun): A situation that has been comprehensively mismanaged, characterized by a string of blunders and miscalculations. Origin: Early 21st century: from omni- + shambles, first used in the British satirical television series The Thick of It. -- Oxford Dictionaries
On Tuesday, the Senate Economics Committee heard evidence as part of its inquiry into the train wreck that was Australia's 2016 Census. We already knew it was a confluence of failure, but now we know that it was worse. Much worse.
So many things went wrong, at least as far as the technology goes, and none of them should ever have happened.
One, the strategy for mitigating distributed denial of service (DDoS) attacks was the so-called "Island Australia". Cute name, but all it meant was blocking all traffic from outside Australia.
Given that only Australians physically in Australia were required to complete the census, that might seem like a fair strategy. But the prime minister's special advisor on cyber security, Alastair MacGibbon, put the committee right.
Customers of some Australia-based ISPs have their traffic routed in from overseas, he said, and IBM's own password reset function actually relied on international traffic. Did they not know this?
"So there was a fundamental failure in the logic of an Island Australia," MacGibbon said. "There certainly were better alternatives, yes."
Renowned network engineer Mark Newton was even blunter in his assessment. "If you're being DDoS'ed, geoblocking is not a mitigation unless you have literally no idea what you're talking about," Newton tweeted.
Two, Island Australia never even happened, because an upstream ISP didn't, or couldn't, block all traffic from overseas. Vocus had warned IBM that the secure network they'd built for the census was on an IP address range that Vocus couldn't completely block. IBM told the committee that reconfiguring their network would have involved re-doing too much work, however. Re-addressing the network would be a "last resort" option, they said, although what the previous options would have been was not made clear.
The question of whether it really was impossible for Vocus to block a specific address range I'll leave to a network engineer. But it would seem that IBM was told upfront their plan wouldn't work, but hey who can be bothered, right?
Three, as ZDNet has already reported, the first three DDoS attacks peaked at just 3 gigabits per second (Gbps) but overwhelmed the system. Such attacks routinely reach 500Gbps, and last weekend's DDoS attack on Dyn DNS reached 1000Gbps.
The fourth DDoS attack was a mere 568Mbps. It hit the application itself, rather than the network, and by IBM's own admission the application ran out of threads.
"Distributed denial of service attacks are eminently predictable and should been expected," MacGibbon said. "These were eminently small attacks, and they should not have degraded the ABS system."
At less than 1 percent of the volume of common DDoS attacks, these didn't even register on public third-party DDoS reporting sites. They were nothing.
Four, when an overloaded router was rebooted, it failed to reload its configuration. IBM acknowledged that this would have been picked up if they'd done a hard power-cycle test before launch.
Like, WTF? Seriously, what sort of Mickey Mouse outfit doesn't perform the basic resilience test of pulling power and data cables at random, and seeing what happens?
Four point five, as an aside, too many media outlets are reporting that the census site would have stayed up if the router had simply been turned off and back on again. A fun IT Crowd reference, sure, but not quite true, as we'll soon see.
Five, IBM's performance monitoring system was, to put it mildly, lame.
"The final straw, in a sense, that broke the camel's back, was the misinterpretation of data on a load monitoring system that was interpreted at first as possibly the exfiltration of data, or an actual hack, as opposed to an attack," MacGibbon said.
Yes. And the reason that happened, according to IBM executives giving evidence, is that the data from the system was delayed in reaching the dashboard, which meant that several minutes' worth of data was added up and reported as having happened in one minute.
I guess IBM has forgotten how to timestamp data records, and forgotten that any and every network application needs to cope with propagation delays. Or maybe the work experience kid was in charge that day.
The apparent spike in unidentified traffic out of the ABS system led them take the whole system offline, and hit the big red button marked "Australian Signals Directorate". It was the right decision, given the circumstances, but it meant there was no going back.
Down goes the Census. So who's to blame here?
I'm sure ISPs NextGen and Vocus will end up shouldering some of the blame, if for no other reason than other players want to make themselves look less rubbish.
We clearly need to point the finger at IBM, because they failed to deliver working DDoS mitigation as called for in the original tender, on both the network layer and the application later. They continue to claim that a geoblocking strategy like Island Australia can work, even though they couldn't get a relatively simple version to work.
IBM failed to perform basic resilience tests. And they failed to write competent software, for both the monitoring system and the application itself.
All of this is inexcusable, IBM.
But the real blame has to lie with the Australian Bureau of Statistics (ABS), the "responsible" government agency.
ABS failed to make sure its contractors were doing their job. As we've said so many times, you can outsource work, but you can't outsource responsibility. Yet the ABS submission to the inquiry is studded with claims that they'd received assurances which they didn't cross-check.
Add to that the debacle that was ABS' communication during the census failure. Add to that the separate debacle that was ABS' arrogance over the privacy issues. You've got an omnishambles of fabulous proportions.
All of this is inexcusable, ABS. All of this is foreseeable, too.
I see four possible explanations for the ABS' failure.
Maybe the rumours are true, that years of cost-cutting and outsourcing has left government agencies bereft of IT-savvy staff who can manage the outsourcing. Maybe there are some IT-savvy staff at ABS, but the corporate culture is so toxic that they felt unable to speak up. Maybe there are some IT-savvy staff at ABS, and they did speak up, but were ignored.
Or maybe it's some combination of the above.
In any event, the head of the ABS, Australian Statistician David Kalisch, still seems to think everything was fine on his end, and everyone else was to blame. The captain isn't taking responsibility for navigating his ship of stats. As lawyers might say, that fact goes to character.
But no-one comes out of this covered in glory. It's the other stuff.
Prime Minister Malcolm Turnbull had hoped to show the world how nimble, agile, and innovative Australia is. Instead, global news outlets like the BBC are running headlines such as Australia #censusfail: Derision greets simple fix failure. He must be thrilled.
The Australian government has proven to be nimble in one way, though. It's managed to go beyond a mere omnishambles to create a hypershambles. Well done, everybody.
More on Census 2016