Salesforce Outage - Klaatu Barada Nikto

As I'm sure half the known and 2/3 of the yet to be discovered universe is aware now, salesforce.com had an outage on January 6.
Written by Paul Greenberg, Contributor

As I'm sure half the known and 2/3 of the yet to be discovered universe is aware now, salesforce.com had an outage on January 6.  It was due to a memory allocation failure in the main servers and it whacked the failover to back up servers too.  It was manually fixed. It was 38 minutes and it was back in action. The number of transactions on that day and the next day were the same (177,000,000 each day). The speed of each transaction was a little slower the day of the outage (0.320 sec.) than the next day (0.266 sec.).  The problem was not only solved by the next day, it was solved within 38 minutes of the same day.  What's astonishing though is the outcry's form - not the outcry.  I expected that customers who use salesforce.com were going to be mad. I expected that the Twitterverse would buzz away as it always does.  I find out about a lot using Twitter. But I'm also mindful that the characterizations on Twitter are each entirely personal and not necessarily the best source of accurate granular information as much as they are a great source of events (as in both "events one attends" and "occurrences of note."), opinions, and pointers to information.  But how much anyone treats the commentary as valid is up to the reader.

The journalism around the salesforce.com outage was so poor it almost defies description.  But let me try.  It was ridiculous - from the subtle to the obvious.  The obvious was things like " Salesforce.com.....was unreachable for the better part of an hour."  Technically, can't fault the writer, but 38 minutes - which is the better part of an hour - isn't 59 minutes, which, when a system is out, is a notable time difference.   The same article questions the validity of the cloud as a whole by saying "a single disruption paralyzes a small fraction of the world's economy as a whole."   Again. I'm sure its a small fraction of the world's economy - a very, very, very, very, very, very, very to the nth small fraction. The recession is a bigger disruption I would think.  Also, in this particular article,  900,000 becomes "nearly a million", which could be 999,999 - a pretty big difference.  The conclusion? Maybe we shouldn't put all our eggs in one basket because the cloud has a dark side. This all comes from "The Register" in the UK which has a subhead of "biting the hand that feeds IT," so they may feel obligated to expose this "dark underbelly."  I'm not by any means an IT fanboy nor am I a traditional journalist, and I love the idea that there are people out there and institutions out there that will check IT from some of its more insane claims and nuts behavior, but reasonable needs to be the operant principle here. Fine, be strident, edgy and tough.  I think I get that way. But this article is just inflammatory - though that's giving it more credit than its probably worth.

But even more traditional IT oriented publications like eWeek in their article on the same thing had a slightly subtler statement about how "the problem thwarted over 177,000,000 transactions...."  How in the world can they make that judgment? There were 177,000,000 transactions for the day. I somehow doubt they all occurred in that 38 minute period.  Given that you had NO transactions occurring that registered during that period - its impossible to know how many were attempted.  IT Pro, another of the venerables headlined "Salesforce Outage Darkens Cloud Computing" joining The Register in leaps of lack of faith. For chrissakes, it was an OUTAGE, not a global catastrophe.  Godzilla didn't invade New York or Tokyo.  No one is running around saying "klaatu barada nikto" to save us from destruction by aliens who want to end human civilization.  It doesn't call into question the nature of man and his relationship to the environment.  This was an OUTAGE - a disruption based on a problem that got fixed.   I doubt there was a measurable loss of billions of dollars.  The bond market didn't join the collapse of the global economy.

This was an O-U-T-A-G-E.

Was it a problem? Of course.  The benchmark for all cloud and on demand and ISPs is 99.99% uptime.  This doesn't meet that necessary benchmark.  But salesforce.com has had some outages in the past. This was short. They fixed them. Life continued and salesforce.com happily continued to grow.  To call the cloud into question because their servers were down for 38 minutes is a little bit of an emotional overload.  When Google went down for awhile and gmail was disrupted we all survived.  I imagine back in the earlier pre-cloud days when on premise still ruled the land, when a server went down at a company, no one assumed the company was so flawed that it needed to shut down.   Just that they had to fix the server.  When Comcast has node problems that shut down Internet access for hours at a time, I don't remember anyone calling for quarantining the neighborhood so the node breakdown disease doesn't spread to other neighborhoods.

Be smart and approach this outage the way it should be approached.  First, read the very sane commentary of Denis Pombriant in this article.  Then, if you're an affected salesforce.com customer, assess what problems and damage it may have caused - aside from the most likely 38 minutes of emotional distress - and talk to salesforce.com folks if something needs to be done. I suspect, but admittedly don't know, that there was little actual damage beyond frustration for that 38 minutes or a bit more.  But that's up to you as a salesforce.com customer to decide.  If you're not a salesforce.com customer, put this whole thing in perspective.  If a few nodes of a large ISP go down and there is no internet access at all to a large group of people, that could be far more disruptive to more people than salesforce.com's outage.  No one should be calling for the closure of the internet or the destruction of the ISP as a result. We have mechanical failures in this world that can affect one to an uncountable number of people more often than we ever could like. We have natural disasters far worse.  Let's not blow this out of proportion.  We're talking about an outage that was disruptive for 38 minutes.

That's it.  My nearly a book of a blog entry that took me something short of a day to write is completely done.

Klaatu barada nikto

Editorial standards