The Truth about the Twitter crash

The Truth about the Twitter crash

Summary: It wasn't from an attack or from too many Euro 2012 football fans, Twitter's crash came from a failure with an update of Twitter's own infrastructure.

SHARE:
3

For over an hour, Twitter was down without even a fail whale to warn us.

Twitter explains yesterday's failure.

Yesterday, June 21st, Twitter crashed at around noon Eastern time. Later that same afternoon, after some mis-fires, Twitterfinally came back up for good. So what happened? Was it a distributed denial of service (DDoS) attack orchestrated by the UGNazi? Too many Euro 2012 football fans? The summer equinox!? In the event, Twitter reports it wasn't any of these.

Twitter Vice President of Engineering Mazen Rawashdeh blogged, “We … found that there was a cascading bug in one of our infrastructure components.” And what's that? "A 'cascading bug'” is a bug with an effect that isn’t confined to a particular software element, but rather its effect 'cascades' into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter.”

We still don't know exactly what the bug was but it's certainly implied that it was introduced in a new version of the Twitter infrastructure programs. From the timing, 9 in the morning Pacific time, I strongly suspect that Twitter rolled out the new software and the platform broke immediately.

Rawashdeh continued, “We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future.”

While Twitter fans panicked—one sample tweet ran “OMG..twitter was down....closest thing to living without oxygen for most of us....”--Rawashdeh is correct when he wrote that “For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though.”

Indeed Twitter is much more stable than it once was. Long time Twitter users recall when a visit by the fail whale seemed like an almost daily occurrence. Still, now that we've gotten used to a reliable Twitter, our expectations are higher and the more upset we'll get when things fail.

Related Stories:

Tweetless in Seattle, also New York, San Francisco, etc., etc. Smart USA does the math on Twitter about pigeon crap EFF’s New Privacy Scorecard: Twitter wins, Foursquare loses CIO view: Five tips for using Twitter Pakistan censors Twitter: all may not be what it seems

Topic: Social Enterprise

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • OMG!!!

    NEW YORK (AP) -- In one of the largest social media disasters in US history, Obama administration officials confirmed that Twitter was completely offline Thursday for at least an hour. The officials, who spoke on condition of anonymity, would not speculate on what the nationwide death toll could have been. Fox News has reported worldwide deaths could number in the tens of millions, but this could not be confirmed.

    Internet physicists at the CERN research facility in Europe say they have no explanation for the outage at this time. There is speculation, however, that it may have been caused by a gravitational fluctuation in local space-time, or that the one wire that was frayed last month just shorted out. The physicists added they told that guy Steve to change the wire out, but "...he's always high anyway, so you can't get him to do anything." Steve was unavailable for comment.

    In a related development, it has been reported that Kathy, a senior technical manager at Electrocorp Industries in Atlanta, claims her project has been completed three weeks ahead of schedule. Said Kathy, "I'm not sure what happened. Everyone was saying at lunch they might not be able to make deadline, and suddenly it's on my desk, as if by magic."

    When asked if the twitter outage and the sudden surge in productivity may have been related, Kathy claimed to not understand the question.
    pishaw
    • I wish I could up-vote that by 100

      LOL!
      William Farrel
  • Still, must ask the question

    did anyone really notice the outage? I certainly didn't -
    Cynical99