Salesforce.com, a leading Software as a Service (SaaS) vendor, faced outages across parts of its network yesterday. Here's a screen capture from the company's status dashboard:
The main outage lasted about six hours, as reported by the company:
4:40 am PST : Root Cause Messaging - NA5 Performance Degradation 2/11/2008
Starting at 1622 UTC, customers on the NA5 instance began experiencing intermittent performance degradations. The salesforce.com technology team worked on troubleshooting the issue throughout the day and took corrective actions to restore normal service levels by 2204 UTC. We believe that the problem occurred due to changes in database utilization introduced in the Spring '08 release which went live on the NA5 instance on Friday night. We have subsequently changed the configuration of our servers to address the problem and do not expect further issues.
Secondary problems also occurred:
Time: 2/11/08 6:43 am PST
Detail: NA0 (SSL) Service Disruption from 1443 UTC to 1502 UTC on Monday, February 11th.
Root cause: Starting at 1443 UTC, customers on the NA0 instance experienced a service interruption lasting approximately 19 minutes. The salesforce.com technology team worked to isolate the issue and restored the service at 1502 UTC. While the salesforce.com technical team still is going through the forensics, we believe that the root cause for the outage was related to a significant slowdown in IO throughput to the database storage sub-systems.
Salesforce.com is continuing to work with our vendors to further understand what could have caused the IO slowdown and will take corrective actions to ensure that these issues are addressed appropriately.
THE PROJECT FAILURES ANALYSIS
Software as a service vendors claim non-disruptive upgrades are an important reason customers should buy online, rather than on-premises, software. In this instance, Salesforce.com did not insulate its customers from downtime and inconvenience associated with an upgrade.
Phil Wainewright, ZDNet blogger and premier SaaS analyst, notes the outage was small. With all due respect to Phil, I think this "minor" outage points to deeper issues.
Two points stand out:
Customer don't want this hassle. SaaS customers pay to be insulated from release problems, and that's what they should get. With no reasons, excuses, or explanations.
Salesforce.com's release testing process is broken. Testing should have caught these problems before they were released into the wild.
Outsourced software services remain a great option for many customers, but the marketing hype sometimes outstrips reality. Whether online or on your own server, critical points in the software lifecycle, such as upgrades for example, can (and do) cause end-user disruption.
In the meantime, salesforce.com should do some remedial work on its release and testing processes.