Circuit breaker failure was initial cause of Salesforce service outage

Cloud company shares details of investigation into outage.
Written by Danny Palmer, Senior Writer

Salesforce said it has traced the cause of a database failure which disrupted services across the US last week: the initial cause of the disruption was a circuit breaker failure at the company's Washington D.C. datacenter which then combined with a firmware bug.

The outage occurred last week and left some Salesforce's USA clients unable to access services for over a day. The incident prompted CEO Marc Benioff to step in and personally apologise to disgruntled customers via Twitter.

But now, on its help page, Salesforce has issued an apology for the disruption and has moved to reassure customers that if a similar event were to occur in future, the disruption would be resolved much more quickly.

The root cause of the issue was that a circuit breaker responsible for controlling power to the Washington datacenter failed on 9 May.

"The breakers are used to segment power from the data center universal power supply ring and direct the power into the different rooms. This failed board caused a portion of the power distribution system to enter a fault condition. The fault created an uncertain power condition, which led to a redundant breaker not closing to activate the backup feed because that electronic circuit breaker could not confirm the state of the problem board," the company said.

This was only the start - in an effort to restore service to the NA14 instance as quickly as possible, the team then moved it from its primary data center (Washington) to its secondary data center in Chicago. But hours after this was done the technology team "observed a degradation in performance on the NA14 instance" which meant customers on NA14 were unable to access the Salesforce service. The company said this second problem was caused by a firmware bug on the storage array, which significantly increased the time for the database to write to the array.

"Because the time to write to the storage array increased, the database began to experience timeout conditions when writing to the storage tier. Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file discrepancy condition to become present in the database. Once this discrepancy occurred, the database cluster failed and could not be restarted."

All functionality was restored to the NA14 instance, including sandbox copy and weekly export functionality, on 15 May.

Despite an ongoing investigation, it's not yet clear what the genesis of the broken circuit breaker was and Salesforce are keen to point out that the faulty part in question passed load testing as recently as March 2016. Salesforce is working with its supplier to determine what caused the breaker failure.

In the aftermath of the failure, Salesforce has replaced the faulty components at its Washington datacenter and is in the process of carrying out a full audit of power and failover systems to to ensure power distribution failover will correctly respond in the event of similar failures in the future.

"Investigation is continuing alongside our database vendor to determine the root cause for the file discrepancies in the database when it encountered timeout conditions writing to the storage layer. From that investigation, corrective steps will be determined and implemented," said Salesforce.

"We sincerely apologize for the impact this incident caused you and your organization." the company added.


Editorial standards