Salesforce Outage - Klaatu Barada Nikto

Salesforce Outage - Klaatu Barada Nikto

Summary: As I'm sure half the known and 2/3 of the yet to be discovered universe is aware now, had an outage on January 6.


As I'm sure half the known and 2/3 of the yet to be discovered universe is aware now, had an outage on January 6.  It was due to a memory allocation failure in the main servers and it whacked the failover to back up servers too.  It was manually fixed. It was 38 minutes and it was back in action. The number of transactions on that day and the next day were the same (177,000,000 each day). The speed of each transaction was a little slower the day of the outage (0.320 sec.) than the next day (0.266 sec.).  The problem was not only solved by the next day, it was solved within 38 minutes of the same day.  What's astonishing though is the outcry's form - not the outcry.  I expected that customers who use were going to be mad. I expected that the Twitterverse would buzz away as it always does.  I find out about a lot using Twitter. But I'm also mindful that the characterizations on Twitter are each entirely personal and not necessarily the best source of accurate granular information as much as they are a great source of events (as in both "events one attends" and "occurrences of note."), opinions, and pointers to information.  But how much anyone treats the commentary as valid is up to the reader.

The journalism around the outage was so poor it almost defies description.  But let me try.  It was ridiculous - from the subtle to the obvious.  The obvious was things like " unreachable for the better part of an hour."  Technically, can't fault the writer, but 38 minutes - which is the better part of an hour - isn't 59 minutes, which, when a system is out, is a notable time difference.   The same article questions the validity of the cloud as a whole by saying "a single disruption paralyzes a small fraction of the world's economy as a whole."   Again. I'm sure its a small fraction of the world's economy - a very, very, very, very, very, very, very to the nth small fraction. The recession is a bigger disruption I would think.  Also, in this particular article,  900,000 becomes "nearly a million", which could be 999,999 - a pretty big difference.  The conclusion? Maybe we shouldn't put all our eggs in one basket because the cloud has a dark side. This all comes from "The Register" in the UK which has a subhead of "biting the hand that feeds IT," so they may feel obligated to expose this "dark underbelly."  I'm not by any means an IT fanboy nor am I a traditional journalist, and I love the idea that there are people out there and institutions out there that will check IT from some of its more insane claims and nuts behavior, but reasonable needs to be the operant principle here. Fine, be strident, edgy and tough.  I think I get that way. But this article is just inflammatory - though that's giving it more credit than its probably worth.

But even more traditional IT oriented publications like eWeek in their article on the same thing had a slightly subtler statement about how "the problem thwarted over 177,000,000 transactions...."  How in the world can they make that judgment? There were 177,000,000 transactions for the day. I somehow doubt they all occurred in that 38 minute period.  Given that you had NO transactions occurring that registered during that period - its impossible to know how many were attempted.  IT Pro, another of the venerables headlined "Salesforce Outage Darkens Cloud Computing" joining The Register in leaps of lack of faith. For chrissakes, it was an OUTAGE, not a global catastrophe.  Godzilla didn't invade New York or Tokyo.  No one is running around saying "klaatu barada nikto" to save us from destruction by aliens who want to end human civilization.  It doesn't call into question the nature of man and his relationship to the environment.  This was an OUTAGE - a disruption based on a problem that got fixed.   I doubt there was a measurable loss of billions of dollars.  The bond market didn't join the collapse of the global economy.

This was an O-U-T-A-G-E.

Was it a problem? Of course.  The benchmark for all cloud and on demand and ISPs is 99.99% uptime.  This doesn't meet that necessary benchmark.  But has had some outages in the past. This was short. They fixed them. Life continued and happily continued to grow.  To call the cloud into question because their servers were down for 38 minutes is a little bit of an emotional overload.  When Google went down for awhile and gmail was disrupted we all survived.  I imagine back in the earlier pre-cloud days when on premise still ruled the land, when a server went down at a company, no one assumed the company was so flawed that it needed to shut down.   Just that they had to fix the server.  When Comcast has node problems that shut down Internet access for hours at a time, I don't remember anyone calling for quarantining the neighborhood so the node breakdown disease doesn't spread to other neighborhoods.

Be smart and approach this outage the way it should be approached.  First, read the very sane commentary of Denis Pombriant in this article.  Then, if you're an affected customer, assess what problems and damage it may have caused - aside from the most likely 38 minutes of emotional distress - and talk to folks if something needs to be done. I suspect, but admittedly don't know, that there was little actual damage beyond frustration for that 38 minutes or a bit more.  But that's up to you as a customer to decide.  If you're not a customer, put this whole thing in perspective.  If a few nodes of a large ISP go down and there is no internet access at all to a large group of people, that could be far more disruptive to more people than's outage.  No one should be calling for the closure of the internet or the destruction of the ISP as a result. We have mechanical failures in this world that can affect one to an uncountable number of people more often than we ever could like. We have natural disasters far worse.  Let's not blow this out of proportion.  We're talking about an outage that was disruptive for 38 minutes.

That's it.  My nearly a book of a blog entry that took me something short of a day to write is completely done.

Klaatu barada nikto

Topics: Enterprise Software, Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Dammit

    I wanted to use that phrase in a title first!
  • RE: Salesforce Outage - Klaatu Barada Nikto

  • Sigh . . .

    "Oh, don't worry, these people have a gazillion redundancies that will make them more reliable than a personal server or network."

    Bull. I can have a Linux box running for years without interruption.

    Somebody please tell me [b]again[/b] the supposed advantages of SaaS, which, everywhere I turn, I have yet to see materialize.

    Frankly, I just don't see the point. People argue up and down it's better, but when the tires hit the road and the facts roll in, it's totally different. Frankly, I think people are in denial about the drawbacks of this technology.
    • Sigh again

      Hey CobraA1,
      All technologies and all their architectures have drawbacks - that means on premise and SaaS delivery models, that means SOA and REST architectures, that means Linux, Windows and MAC operating systems, that means the applications that are out there. None are perfect for everything. SaaS benefit is in the cost savings which regardless of SaaS hype of 100% are actually in the realm of 20-25% comparably. It's flexibilty of access - meaning its anytime anywhere possibilities is excellent. Etc. But its not "more reliable" its as reliable. Technology always has drawbacks but each has benefits too. SaaS included.
  • RE: Salesforce Outage - Klaatu Barada Nikto


    Great review of what was a mindless dogpile. As a
    large customer I was anxious about the
    downtime. When it came up after 38 minutes I was
    actually amazed...not angered. And, the fact that they
    continue to be transparent with the problems continues
    to astonish me.

    This is a good example of how to handle a problem with
    your customer base. Recognize the problem, fix it and
    come clean with the issue. It's a simple thought but
    one of the toughest things to do.

    What this outage provided was an opportunity for both
    sides of the cloud arguement to push their agenda.
    Although disappointing it called the waring parties
    into the open. Maybe now we can go back to an open
    discussion focused on what's best for the customer.
    • I so agree

      Thanks Kev,
      I'm absolutely in your camp. Given that even the universe is self-perfecting, its more than likely that a company or an individual is not going to be always perfect and will have something to fix. So, the best thing in the world to do if there's something to fix is to fix it. And if it causes inconveniences to people/customers/other companies, fix that too. But agendas being what they are and the fact that one of our guilty human pleasures is complaining (not bad, just is), you get these things.

      Oh well, life goes on.
  • RE: Salesforce Outage - Klaatu Barada Nikto

    I agree Paul. responded quickly and got
    things back to normal. We have certainly survived
    worse than 38 minutes of downtime. Having locally
    installed software does not give any better guarantee
    of uptime. I would rather leave things to the smart
    folks at my SaaS provider, hosting company, etc. to
    manage things and respond quickly in the event of

    Human error, acts of the lord, or other culprits will
    always prevent five-nine'esque uptimes. Even
    RackSpace, a top hosting company, was down for a spell
    when a truck crashed into a transformer and generators

    What I liked about is that they are now
    (but were not always) much more open about their
    uptime/downtime and even provide a system status:
    • Move along, nothing to see here folks

      As an user, this was not a big deal. That's just one user's opinion. If the downtime had been an hour or two, I'd feel differently.

      My complaint was that I was not able to access at the time of the outage to get a status. Might have been just me, but I think it was unavailable was well.

      If I could have checked "trust", I would have saved a few minutes confirming that it wasn't an internal issue.

      • You were right

        I heard and read that went down
        with the rest at the time. The outages are now part of
        its history though.
  • Where is the rugged redundancy?

    All I heard for the last year is how fault tolerant and resilient the magical "cloud" is. I think the cloud just proved to be a thunderstorm. This was an accidental outage. God forbid hackers ever target the ???cloud???.
    Hates Idiots
    • If ever?

      If you think that "hackers" aren't currently targeting SaaS providers with everything from DOS attacks to infil attempts, you're ignorant.
      • Waiting for shoe to drop.

        I know they are. I'm just waiting for the 'big incident' to happen.

        Or an admission by a disgruntled former salesforce employee that hackers have already gotten in.

        Commanies will be running for the doors to get off their system
        Hates Idiots
  • Should SaaS Be Held To Higher Standard?

    I'm sure some portion of the emotional outcry surrounding this event is based on how poorly our expectations have been set by SaaS vendors.

    No where in the marketing literature of these services can I find any admonishment that, if the process you're hoping to manage through SaaS is critical to the daily operation of your business, you should think twice about migrating it into the cloud.

    During this outage, not only did Salesforce the service bounce up and down, but their main website did as well which strongly suggests the kind of "coupling" between systems that make software brittle and unreliable.

    Armed only with what I experienced and have subsequently read, it seems Salesforce hasn't yet designed their service to minimize points-of-failure and route around such catastrophic events.
    • Resiliency or multiple entry points?

      The level of redundancy required to make as reliable as the server in my data center would require nodes across the country with incomprehensible levels of data replication to protect from power outages, natural disasters or even simple accidents.

      A train wreck in a Baltimore tunnel took out data circuits for the entire east cost a few years ago. Statewide and even regional power outages are happening more often as power grids age and are over extended. Katrina knocked out power, telephone and data services across 3 states. The examples go on and on.

      Sure build it out and disperse your clouds data centers to make it more resilient, but how secure is your data when it is scattered in a dozen servers across the globe with terabytes of bandwidth exposing it to the world over dozens if not hundreds of circuits that provide the cloud its wonderful resiliency?

      I run a small company like those often targeted by marketing, I have 1 server behind 1 T1 to worry about. Who sleeps at night?
      Hates Idiots
  • Maybe if Salesforce didn't arrongantly disparage their competitors

    Salesforce's Benioff disparaged all the other CRM providers with a hollier than thou attitude, so when there are hiccups in Salesforce's SaaS model, it is not surprising that there is a backlash. I am not against SaaS, but as Paul Greenberg says, there are advantages and disadvantages to each model.
    Roque Mocan