Small outage at Salesforce.com, not many upset

Performance problems hit three of Salesforce.com's eight server clusters yesterday, including a two-hour blanket outage for European users and intermittent problems during the business day for some customers in North America.
Written by Phil Wainewright, Contributor

Compared to the furore that used to greet Salesforce.com outages, yesterday's problems on three of its eight instances — including one of the five North American instances — has aroused little ire [disclosure: Salesforce.com is a client].

eWeek's coverage illustrates how difficult it is to stir up a storm of discontent when all the available information about the outage is published on the trust.salesforce.com console, which the vendor established in the wake of its earlier problems.

Tipped off by a customer, the only research the journalist had to do was click the link and cut-and-paste the information posted there:

"... at 8:22 a.m. Pacific time, the company's internal server information Web page said, 'NA5 Service Degradation: The technology operations team has been made aware of intermittent service disruptions to NA5. Please check back for further updates.'

"Salesforce.com, based in San Francisco, subsequently reported similar 'service degradations' at 9:26 am, 10:19 am, 11:23 pm and 12:20 pm before announcing at 2:04 pm that 'the Salesforce.com Technology team has restored the service issue with NA5 at 22:11 UTC. We apologize for any inconvenience this may have caused you.'"

All instances are operating normally today, according to the status console. The latest report on yesterday's NA5 problems is as follows (times are given in UTC, which is 8 hours ahead of PST and 5 hours ahead of EST):

Performance Degradation

Time: 2/11/08 8:22 am PST

Detail: NA5 Performance Degradation from 1622 UTC to 2204 UTC on Monday, February 11th.

Root cause: Starting at 1622 UTC, customers on the NA5 instance began experiencing intermittent performance degradations. The salesforce.com technology team worked on troubleshooting the issue throughout the day and took corrective actions to restore normal service levels by 2204 UTC. We believe that the problem occurred due to changes in database utilization introduced in the Spring '08 release which went live on the NA5 instance on Friday night. We have subsequently changed the configuration of our servers to address the problem and do not expect further issues.

There were also serious problems on two other instances. The EU0 instance for users in EMEA was out for two hours in the afternoon, local time:

"Starting at 1443 UTC, customers on the EU0 instance experienced a service interruption lasting approximately 2 hours. The salesforce.com technology team worked to isolate the issue and restored the service at 1644 UTC. While the salesforce.com technical team still is going through the forensics, we believe that the root cause for the outage was related to a significant slowdown in IO throughput to the database storage sub-systems. This had a cascading effect for the clustering software as well as the database software which all had to be validated and restarted."

A similar problem affected the North American SSL instance (NA0), which shut down for 19 minutes:

"Starting at 1443 UTC, customers on the NA0 instance experienced a service interruption lasting approximately 19 minutes. The salesforce.com technology team worked to isolate the issue and restored the service at 1502 UTC."

This is the most serious outage at Salesforce.com for some while — to tell the truth, I'm having trouble remembering the last time there was a problem. The last outage mentioned on independent user site Salesforcewatch.com was in June last year. Please post in Talkback if you can remember anything more recent. That track record, coupled with the transparency and comfort provided by the real-time status console, probably explains why users haven't kicked up a storm this time.

Google could do with taking a leaf out of Salesforce.com's book, given the growing discontent over unexplained Gmail delivery delays.

UPDATE [added 6:30 am PST]: Michael Krigsman points out that "SaaS customers pay to be insulated from release problems, and that’s what they should get ... Testing should have caught these problems before they were released into the wild." It's a valid point, even though Salesforce.com has never promised totally uninterrupted service — Michael's post highlights how the 'out-with-the-old' rhetoric of SaaS evangelists places a heavy burden on SaaS vendors to constantly raise the bar on availability and deliver overall quality of service that's whiter-than-white.

Editorial standards