SaaS brings speed, innovation, and enterprise capability…
Today's enterprises embrace SaaS to add capability quickly without huge investments in infrastructure and staff to build it out internally. A business leader chooses a service, signs a contract, and very rapidly has a CI/CD tool, a full HRM system, or another business application.
… along with new, unfamiliar, and often poorly understood risks
Technology and business risks morph with changes in technology and how it is delivered. While cloud services are often considered more dependable, businesses face new risks with SaaS and public cloud – risks that are unfamiliar or not completely understood. People's eyes pop open and ears perk up when they witness prolonged outage events like the current issue with Atlassian. Suddenly, SaaS dependencies and resilience issues become relevant as a business can't access its favorite SaaS tool. The unique risk of using SaaS is that you don't have control over the application or the tool and cannot reimplement yourself. It is also important to understand the cascading risks, as some of the well-known SaaS services are hosted on a leading hyper scalers infrastructure. You need to analyze the business impact of SaaS and cloud services outages just like for any other technology in your portfolio.
When crafting resilience for SaaS, two things matter: What your vendor does and what you do.
Define vendor responsibility
Trust but verify vendor claims about service level agreements supporting operations and resilience plans. To ensure your SaaS providers deliver on their promises:
Demand SaaS providers share their resilience capabilities. Understand the design, architecture, and deployment model for these SaaS services. These should be transparent, not opaque. What resilience capabilities have the SaaS provider built to withstand failures? Insist the provider be clear about which failure scenarios it covers and those it does not.
Inquire about IT operations and controls. While some SaaS providers may identify their design and architecture as a secret sauce, don't settle for boilerplate responses. Engage your recovery practice people to inquire about how SaaS providers manage their services, including their operational practices.
Build SLAs with real vendor consequences into contracts. Downtime for a vendor represents more than a lack of service to your business. Depending on the particular SaaS tool, an outage can mean a whole lot of cost to your business – idle employees, missed deadlines, inability to sell or ship products, loss of physical or digital security, a threat to life, and reputational risks. Make your SLAs with the vendor match the importance of the service to your business. One company writes into its vendor contracts that SLA-violation payments must be signed by each member of the board of directors so that outages get escalated to the highest level.
Implement Your Own Controls
The resilience of your business is your concern, don't pass the buck to your vendor. With SaaS, you avoid running and maintaining an application, but in the case of service outages, you incur business losses. You don't run the infrastructure to put it all back together. Prepare for the risk scenarios that your SaaS provider does not cover and develop a plan of controls and mitigations your business can take to minimize the impact of SaaS outages on your business.
Risks and controls for SaaS resilience vary – act accordingly
Data Loss or Corruption
Backup your data
For the most part, SaaS vendors don't take responsibility for client data, it may be part of their backups, but they aren't guarding against accidental deletion or corruption. There is no easy way to initiate a restore. Let's be clear that backing up data from SaaS does not mean you can restore your business operations in the case of an outage. Data backups provide a safety net for your data in case of corruption – and restore it back onto SaaS. Backups may allow for a way to execute a service migration if staying with the current provider becomes untenable.
Dependent Infrastructure Outage
Monitor key cloud service dependencies
Ascertain if the infrastructure provider will have a downstream effect on your SaaS vendor's offering. For instance, if your provider has significant infrastructure in AWS US-East, you should monitor the service availability of that region in a resilience dashboard.
Short Term Outage
Identify tolerance for service outage
Most cloud and SaaS outages are relatively short, and while disruption is inconvenient, the value that SaaS provides exceeds the hiccups, identify internally when that equation changes and action must be taken – like workarounds, or service migrations.
Medium Term Outage
Workarounds and outage planning
Identify key processes and operations which require workarounds to keep the business running even in a degraded state. When planning for outage scenarios, ask key questions. For example, if your CI/CD pipeline fails, how will the developers write and publish code? If your collaboration system is unavailable, how will teams share key documents until service is restored? Is there a hybrid option, or an available one from the vendor?
Long Term Outage
Most SaaS companies have a healthy set of competitors ready to help you transition to their platform. Identify in advance which vendors would be the best fit for your needs. If possible, test what would be necessary to transform and migrate data backups from your existing vendor into the new platform with potential vendors. Also perform rigorous due diligence on your alternate provider, as it may expose similar risks as your current provider.
Vendor Shuts Down or Discontinues Service
Software Escrow/SaaS Escrow
Companies like NCC Group in the UK provide a unique escrow service. It contracts with customers and software or SaaS vendor to hold (incremental) code, and operational expertise in escrow to de-risk the possibility of a vendor discontinuing a product or going out of business.
Practice and test your recovery and resilience options
Every athlete practices their sport, gauging their performance with the help of coaches or other athletes to determine how to improve. Your resilience operations should be practiced, tested, and improved in the same way. Resilience and recovery are a sport, and executing requires everyone in your organization to know what they are doing in the case of a key application or service being offline. Your sales teams need to know what to do if Salesforce is not available, your HR team needs to know what workarounds to implement if Workday has an outage, and your DevOps teams need to understand how to stay productive if Atlassian goes down.
Just like with self-managed infrastructure, the key to surviving a SaaS outage is knowing the risks, implementing controls to mitigate those risks, and then testing your plans to make sure those work and everyone knows how to execute in the case of crisis.
This post was written by Senior Analysts Brent Ellis and Naveen Chhabra and it originally appeared here.