The Truth about the Twitter crash

It wasn't from an attack or from too many Euro 2012 football fans, Twitter's crash came from a failure with an update of Twitter's own infrastructure.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

For over an hour, Twitter was down without even a fail whale to warn us.

Twitter explains yesterday's failure.

Yesterday, June 21st, Twitter crashed at around noon Eastern time. Later that same afternoon, after some mis-fires, Twitterfinally came back up for good. So what happened? Was it a distributed denial of service (DDoS) attack orchestrated by the UGNazi? Too many Euro 2012 football fans? The summer equinox!? In the event, Twitter reports it wasn't any of these.

Twitter Vice President of Engineering Mazen Rawashdeh blogged, “We … found that there was a cascading bug in one of our infrastructure components.” And what's that? "A 'cascading bug'” is a bug with an effect that isn’t confined to a particular software element, but rather its effect 'cascades' into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter.”

We still don't know exactly what the bug was but it's certainly implied that it was introduced in a new version of the Twitter infrastructure programs. From the timing, 9 in the morning Pacific time, I strongly suspect that Twitter rolled out the new software and the platform broke immediately.

Rawashdeh continued, “We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future.”

While Twitter fans panicked—one sample tweet ran “OMG..twitter was down....closest thing to living without oxygen for most of us....”--Rawashdeh is correct when he wrote that “For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though.”

Indeed Twitter is much more stable than it once was. Long time Twitter users recall when a visit by the fail whale seemed like an almost daily occurrence. Still, now that we've gotten used to a reliable Twitter, our expectations are higher and the more upset we'll get when things fail.

Related Stories:

Tweetless in Seattle, also New York, San Francisco, etc., etc. Smart USA does the math on Twitter about pigeon crap EFF’s New Privacy Scorecard: Twitter wins, Foursquare loses CIO view: Five tips for using Twitter Pakistan censors Twitter: all may not be what it seems

Editorial standards