0.1% downtime is more than 8 hours a year

0.1% downtime is more than 8 hours a year

Summary: What matters is not the downtime itself but the perception of it. Salesforce.com broke two cardinal rules in its handling of Tuesday's outage.

SHARE:
TOPICS: Outage
11

A promise of 99.9% uptime sounds impressive until you do the math. With a total of 8766* hours in a year, that 0.1% of downtime still adds up to eight and three-quarter hours. So salesforce.com could have a further three hours' downtime on top of the five and three-quarter hours that some customers suffered on Tuesday before it got above 0.1% downtime for the year — which is still quite a decent performance.

That calculation is worth bearing in mindSalesforce.com has broken two cardinal rules for on-demand providers before anyone is tempted to either jump ship from salesforce.com to another provider or even completely dismiss the whole notion of on-demand. TalkBack poster jmjames, for example, writes "why in the world would anyone outsource not just a critical business process such as CRM, but critical, confidential data?" I wonder how many readers work for IT shops that manage to consistently deliver 99.9% uptime to all of their users 24/7/365, let alone commit to it? Does jmjames? If not, there is the answer to his question.

Of course there are certain applications (and IT shops) where 8 hours of annual downtime is still too much, in which case it's worth making the extra investment. But the very high cost of that extra reliability has to be weighed up against the commercial benefit. Having your customer call center out of action for an hour or two would be a disaster for many businesses. Having your salespeople briefly unable to make sales calls or review their performance is something most businesses can live with — those individuals have other work they can get on with in the meantime.

I suspect the vast majority of salesforce.com users would be perfectly content with just 8 hours of annual downtime — although naturally they'd prefer if it didn't come all on the same day. In fact I'd be prepared to bet that most would turn you down if you offered them a premium service to reduce the figure to two-and-a-half hours (99.97%) or five minutes (99.999%). What matters is not the downtime itself — provided the vendor has made every effort to maintain an appropriate service level — but the perception of the downtime. This is where I believe salesforce.com has slipped up badly in its handling of Tuesday's outage and indeed in its overall approach to service levels.

In my view, salesforce.com has broken two cardinal rules that I believe on-demand providers must adhere to:

"... while unscheduled down-time is unavoidable, companies should alert customers immediately when there's an outage and keep in touch with status reports. Salesnet has four tiers of customers; those at the top can expect hourly calls from account executives and engineers during a 'code red,' while the lower tiers can expect e-mails to their administrators."

  • Be upfront about service levels. Providers should spell out to customers the service levels they'll commit to — and in what circumstances they'll forfeit penalties, if any. Amazingly, Salesforce.com's generic Master Subscription Agreement makes no undertakings whatsoever, beyond this vaguest of assertions:

"Salesforce.com represents and warrants that it will provide the Service in a manner consistent with general industry standards reasonably applicable to the provision thereof and that the Service will perform substantially in accordance with the online salesforce.com help documentation under normal use and circumstances."

Customers have a responsibility too, never to take for granted anything that's not a contractual commitment. Any salesforce.com users that couldn't afford to be offline for most of Tuesday really only have themselves to blame for not reading the small print.

* An earlier version of this posting quoted the erroneous figure of 8736 hours in a year, which was based on calculating 52x7x24 (ie 364 days) rather than 365x24. Add on another 6 hours to average in the effect of leap years and you reach the correct figure of 8766. Thanks to the first two TalkBack posters for spotting this.

Here's a quick reminder of how much downtime customers are exposed to at various service levels hosting providers often boast about:

  • 99.5% — 43.76 hours (an entire working week, and more)
  • 99.7% — 26.30 hours (more than three working days)
  • 99.9% —  8.77 hours (more than one working day)
  • 99.95% — 4.38 hours (half a working day)
  • 99.97% — 2.63 hours (an extended lunch break)
  • 99.99% — 0.88 hours (about 50 minutes)
  • 99.995% — 0.44 hours (a half hour)
  • 99.999% — 0.09 hours (five minutes)

Topic: Outage

Phil Wainewright

About Phil Wainewright

Since 1998, Phil Wainewright has been a thought leader in cloud computing as a blogger, analyst and consultant.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

11 comments
Log in or register to join the discussion
  • 8736 Hours?

    365 days x 24 hrs = 8760 hours per year. Or is there a dayoff I missed?
    ewan.innes9
    • 8766 is even more accurate...

      if you consider the average of a leap year over the 4-year span (365.25 * 24).

      This gives salesforce.com extra downtime! Merry Christmas!
      Paul C.
    • Message has been deleted.

      slack9999
      • Message has been deleted.

        slack9999
  • You might want to hit on security

    The thought on downtime is right on. The reality is that that 8 hours a year is probably a lot better than most in house systems. You might also want to consider the thoughts on security as well. Who's more qualified to manage your data securly? "Bob" your tech guy who is also responsible for cleaning spam of the CEO's laptop? Or a company that manages data full time for a living?

    I may be disproven, but I believe that every security breach that hit the media this past summer happend with "in-house" soultions.
    chrisbaggott
    • Not really

      If you run a system in-house, you're in a position to take more scheduled outages because you're not dependent on other companies. Scheduled outages during off-peak or off times don't disrupt anything. Some of the organizations I've been with, we'd have been fired if we pulled a 5 hour outage during the day.
      george_ou
    • Large organizations also have specialists.

      I have never worked in an IT shop where the database folks were also the PC support folks.

      That situation might be true in a very small company with only a few IT people who are forced to play jack-of-all-trades, but that is hardly the normal situation in corporate IT.

      My background is airline operational systems, and the in-house systems I have worked on not only had the benefit of dedicated and specialized in-house support people, but also had the benefit of having those resources truly and clearly understand the overall operational impact that the system(s) being supported had (and that outages of those system(s) would have) on the organization as a whole.

      That second form of understanding is one that is often lost when support is farmed out to external specialists, who are often quite astute technically but who might know very little about the details of the system(s) in question that are not explicitly stated in the SLA associated with the support contract.

      Sometimes seeing both the forest and the trees is important, especially on operationally-critical systems.
      rsteiner9
  • Accountability vs. perfection

    Mr. Wainewright makes some excellent points here. My personal favorite is this one: "Of course there are certain applications (and IT shops) where 8 hours of annual downtime is still too much, in which case it's worth making the extra investment. But the very high cost of that extra reliability has to be weighed up against the commercial benefit." That is, at the end of the day, what it all boils down to. I have worked at companies where there would be at least one multiple hour outtage in the middle of a working day per month. The problems would cascade down, causing the company to miss SLA commitments on 20% - 75% of their technician dispatches, which should have cost them to lose millions of dollars for each outtage. Yet, it did not. Why?

    Accountability.

    Do you honestly think that a TPV is going to accurately report to you their downtime or SLA misses? No way! I should know, I worked for quite a while for a major TPV in managed services, and I got to see first hand the lies that the customer was fed. Indeed, I made up some of the more creative ones. Remember that scene in "Total Recall", where there start fiddling with the radio and say "we're losing you sir, we're getting sunspots". That's your typical TPV. Doing TalkBacks about the horrors of TPVs is part of my personal attonement for all the wrongs I committed in the employment of one.

    More to the point, how do you propose to measure this uptime? Are you going to hire someone to sit on their system 24x7 randomly using portions of it to ensure error-free operation? Are you going to write a customer application to test each individual portion of that website? Or do you expect the vendor to kindly hand you a MIB for their application (how many web applications send out SNMP traps, anyways?) as well as the MIBs and passwords to all of their infrastructure so you can monitor their uptime yourself?

    Mr. Wainewright repsonds to my previous TalkBack with:

    "I wonder how many readers work for IT shops that manage to consistently deliver 99.9% uptime to all of their users 24/7/365, let alone commit to it? Does jmjames? If not, there is the answer to his question."

    I'm sorry, but that is hardly an answer. As the person responsible for my current employer's IT infrastructure, I most certainly do not make any guarantees regarding uptime, because I know that without the budget of my dreams, I cannot make any commitments to uptime. I know how to achieve uptime along those lines, it isn't rocket science.

    What I do guarantee my organization is accountability. When our network is down for whatever reason, my boss knows that I will do whatever it takes to get it fixed, provide him with an honest answer as to why it went down, and make recommendations as to what we can do to ensure it doesn't happen again. You simply do not get that with a TPV. What you get is a call center in another country blowing smoke up your shirt about how the problem must be on your end. "Sir, as you sure that our website is the only site you can't reach?" "What do you mean our technician isn't there yet? He told my co-worker thirty minutes ago that no one answered the door? I'll have him return to site." "I'm sorry sir, but I don't see that part in a warehouse anywhere near you, someone must have accidentally used it for another customer." These are the typical TPV lies (I've said them all myself) for SLA misses that my boss does NOT hear from me.

    Earlier this year, I accidentally nuked a significant portion of my company's data during a RAID installation. Sure it could have been prevented with proper backups, but when the chips were down, I did not even mention the fact that my request earlier in the year for backup equipment had been postponed for further consideration (I may note, it was quickly approved after that). Instead, I owned up to my responsibility for the mistake, and offered to resign. That's accountability.

    Sure, a TPV may have economics of scale that allow them to build in redundancy that would be outside the bugetary constraints of an in-house IT department, which allows them to promise some sort of uptime. But even if you catch them missing their uptime (and let's face it, they can easily miss 40+ hours over the course of the year if they do it outside your business hours, and you'll never notice), all that will happen is they will suffer some sort of "penalty". They may have to give you a discount or refund some money. It certainly won't trigger the end of the contract. Even if it did, what would be the chances that you would be able to move off of their system in reasonably easily anyways? With an in-house IT department, consistant failure equals (or should equal) termination of employment. It's that simple. IT workers are disposable, replacable parts. If we weren't, our jobs would be much harder to ship overseas or be outsourced to TPVs. If I am a faulty part, like any other piece of equipment, I get scrapped and replaced. You simply do not get that with a TPV.

    I think most reasonable bosses in (major qualification here) a privately owned company would much rather prefer accountablity and honest answers from an in-house TPV over the supposed "cost-savings" and lies that a TPV shovels out. Publically traded companies are another story, as long as they can tell the stockholders that their losses were someone else's fault, they really don't care.

    That's the true reason so many companies use TPVs. At the TPV I worked for, we used to have our own in-house supply chain. We outsourced that to a major supply chain company, not because it saved any money (it cost more), not because they were better (we missed SLA less often when it was in-house), but because it allowed the turds of responsibility to roll downhill. Even if the technician was late, it didn't matter because the TPV we used was late with the parts anyways. And the contract said that a miss on parts, when handled by our TPV, didn't ding us. What a load of baloney.

    To quote myself, "What if this article had been not about a database bug that took Salesforce.com offline, but about a database bug that revealed one clients data to another client?" Acheiving extremely high levels of uptime is a great goal, and an extremely important one at that (if not the most important one). How much uptime are you willing to sacrifice in order to keep your data confidential? Sure, you will quite possibly lose large amounts of money depending on which system is offline. But, in certain industries, you will be out of business and/or prosecuted by your state, if not the federal government, for letting your data slip into someone else's hands. Health care. Financial services. Military/defense. And so forth.

    At the end of the day, I still think that sending mission critical portions of a company's IT infrastructure to a TPV is a catastrophe waiting to happen. You lose all control and all accountability, in exchange for some nearly unenforceable promises like uptime. TPVs play "no harm, no foul" with contractual agreements all of the time. If the customer doesn't notice, they won't tell the customer. And for that reason, I'll bet my business on people that I personally trust and can hold accountable, as opposed to a company that I'm stuck with until end of contract (if I can migrate away from them anyways) any day of the week. Uptime is hardly the be-all-and-end-all of IT "success".

    J.Ja
    Justin James
    • Measuring uptime -- heartbeats.

      That is what we did when I worked at an airline, and it is what we do at the airline communications and services provider I currently work for.

      Testing a link with periodic automated messages and responses is a method that has been used for years, and it seems to be a fairly good way to ensure that communications links and core application interfaces are constantly up and running. Two concurrent failures generates an alert to the operations center, or one failure if the link is one which guarantees message delivery.
      rsteiner9
  • Message has been deleted.

    slack9999
  • Thanks

    Thanks for this Article. It's really surprising too for our clients, if they hear what the 99,99% realy mean.

    Best Regars
    Gernot
    http://www.ssh-gmbh.de
    Goerni