Software as a service (SaaS) vendor, Workday, which sells human resources applications, recently had a 15-hour outage, during which time its system was unavailable to customers. In an unusual twist, this post is about success and not failure.
Background. The story begins when I heard about the outage through an anonymous source. To learn more, I sent out this Twitter message:
Following Naomi's suggestion, I checked Workday's blog for details:
[T]he network attached storage (NAS) device that stores operating system files for our production servers detected a corrupted node within a backup RAID array. Rather than simply log the error, which is what it is supposed to do, the NAS took itself off-line. It is ironic that the redundant backup to a system with built-in redundancy caused the failure.
This type of error should not have caused the array to go offline, but it did. The most important result is that our failover plans worked as expected. Within hours, all customers were live in our secondary datacenter with all their data intact.
Workday gets in touch. Two days later, Workday's Communications Director, Andrew McCarthy, sent me an unsolicited invitation to discuss the outage, even though I previously never had contact with the company.
The note caught me off-guard because it's the first and only time a vendor has reached out to me proactively following a failure. I've written almost 750 blog posts related to IT failure, and Andrew's invitation is unique in my experience. Here's the full text of that email:
Since we're connected on twitter I decided to go old school and drop you a note.
Give me a shout if you have any questions on the outage, etc. Not my favorite subject, but we'll do our best to answer any questions, etc.
During the ensuing conversation, Andrew offered extensive internal details about the outage and Workday's response. He also invited me to talk with Aneel Bhusri, the company's co-founder and co-CEO.
Andrew's offer was surprising, because vendors don't often invite me to interrogate chat with their senior executives about specific implementation failures or downtime incidents. Although I talk with industry leaders on a daily basis, those conversations are usually more general in nature.
Aneel began our conversation by explaining what caused the disruption:
An obscure file system bug in our redundant storage system cascaded into a chain reaction that took us offline. Once we realized the storage vendor could not immediately fix the problem, we initiated our disaster recovery process. Although the disaster process worked exactly according to plan and no customer data was lost, we are going to examine ways to make recovery faster in the future.
The discussion quickly turned to Workday's relationship with customers:
Workday has a strong relationship with its customers. During the outage I spoke with almost every major customer to let them know what happened and how we are working to solve the problem. Keeping customers informed was of primary concern; we sent emails, called them, posted announcements on our forum, and so on.
Almost universally, the CIOs I spoke with recognize that technical outages are an unpleasant fact of life. Therefore, how a company deals with those disruptions is all-important.
In the future, the market will judge SaaS vendors by how well they handle such situations. It's a matter of trust.
Customer perspective. Both Aneel and Andrew said that supportive customers sent in positive emails during or after the outage. I pointedly asked Aneel whether he also received negative feedback from customers regarding the situation. His definite response:
No, we did not. Would you like to speak with a customer?
Aneel put me in touch with Manjit Singh, CIO of Chiquita Brands International, famously known as a worldwide supplier of foods such as bananas. No Workday representatives were present during my conversation with Manjit, there were no pre-conditions, and no part of the discussion was declared off the record.
Manjit said the Workday outage had two primary impacts on his company:
First, we lost the ability to process HR transactions during the normal course of that day's business.
Second, and more significantly, we were preparing to go live with our Costa Rica implementation, so this outage had the potential to delay our schedule. However, we worked around it and went live as planned.
Manjit's comments are an important data point supporting Aneel's assertions about relationship and trust between Workday and its customers:
Outages are never good, but they do happen. Workday's communication was fantastic: they kept us informed of the problem, steps they were taking to resolve it, and expected time to solution.
Are we a happy customer despite this? Yes, we are, absolutely.
These things happen, so communication to stakeholders and flawless execution are most important. In fact, this situation reinforces our decision to go with Workday. They are capable of executing well and we have a strong relationship with them.
THE PROJECT FAILURES ANALYSIS
When evaluating IT failures, one should examine the failure's effect against concrete business criteria, such as:
- Cost, schedule, or other negative business results that hurt customers
- Extent of negative impact on the vendor's relationships with customers
- Negative publicity for either vendor or customers
- Damage to morale for either vendor or customers
- Degree to which the situation is a one-time event or else symptomatic of a broader systemic problem
Dissecting the Workday outage against this list, I conclude we must view it as being little more than a technical failure. There is no systemic problem nor was trust broken between vendor and customer.
Even if rare, such outages are always unacceptable. Aneel emphasized that the company plans to review disaster scenarios and implement steps to improve recovery times should future outages occur. During our conversation, Aneel spent considerable effort sharing his thoughts on this crucial point.
Public relations impact. Remarkably, a Google search brings up only one press article about this outage and only a single blog post (aside from Workday's). That blog, written by my friend, Vinnie Mirchandani, reinforces the company's claims of relationship and trust with customers:
So, how much pissed off customer feedback did they get?
Dave Duffield, his Workday co-founder: “Unbelievably, I got emails from couple of our customers basically saying “Better you than me”. They are so glad they are not being woken up middle of the night. That’s our job now”
In fairness, Workday is a relatively small company with only about 100 customers. However, these customers are generally large organizations and Workday actually has over 100,000 users, a number that is growing rapidly. Any single one of these folks could have called the press--if they wanted to. After all, anyone attempting to use the Workday system knew it was down during that 15-hour period.
Personal skepticism. Researching this post, I was skeptical of Workday's claims regarding the depth of its customer relationships and the extent to which the company is transparent. To verify, I called HR analyst/guru Naomi Bloom, whose willingness to be helpful matches her stellar reputation. In other words, she's someone I trust.
Naomi explained it this way:
When this happened, Workday did not equivocate. Senior executives called each customer through the night, doing whatever it took to treat those customers as respected partners. From the inception of the company, Workday made a commitment to be straightforward with customers, and they have kept that promise.
In addition, Workday customers signed up as early adopters and pioneers in a new way of doing things. The company set appropriate expectations up front, so customers knew it might be a little bumpy along the way. In addition, Workday co-founder, Dave Duffield, has proven his customer care bona fides on the enterprise stage at PeopleSoft, so those signing up knew that his words weren't hollow.
I asked Naomi whether Workday now sets a standard for trust among SaaS vendors. Her response:
My take. Due to careful advance disaster planning and superb management during the recovery, I believe Workday emerges from this outage with greater customer loyalty than it had before. Of course, retaining such an intense customer focus becomes more difficult as a company grows, and only time will tell whether Workday can maintain such lofty achievements in the future.
For the present, however, Workday deserves congratulations for doing the right thing by customers when the chips were truly down.
[Disclosures: This situation is such an outlier from ordinary discussions of failure that I'm open to the possibility of missing something important. If you have substantive information contradicting my understanding of this situation, please let me know. Until contradictory facts emerge, however, I stand by my position.
Naomi Bloom has tracked Workday closely from its inception. She has performed a small of amount of paid work for the company (white paper, webinar, limited consulting); she has also been a paid consultant to Workday's competitors.]