A few years back, I helped conduct a survey in which we asked IT managers how they first learn of incidents or slowdowns in their services. The leading method was via phone calls or emails from users or customers. The second leading method was via phone calls or emails from (gulp) their top executives.
That's why methodologies and certification programs such as Information Technology Infrastructure Library (ITIL) came about, providing IT teams a proven and standardized roadmap for delivering applications and functions as reliable services. However, with the growing use of outside cloud services, ITIL -- designed during on-premises times -- may be stretched beyond its limits.
The proliferation of cloud means more complications when it comes to running things smooth and service-like, according to a new report out of Constellation Research. "Most enterprise IT teams are struggling to cope with the newer cloud operations-demand-based scaling, cloud-native monitoring, observability, and incident management," says Andy Thurai, Constellation analyst and author of the report. "Most enterprises today are still not set up to handle all the IT-related incidents, or crises, in real-time. Classic legacy enterprises are set up to deal with IT incidents in old-fashioned ways, without considering the cloud, software-as-a-service nuances, or the social media venting and demand by customers that puts pressure on enterprises to fix the incidents faster than ever."
The old-fashion method, "raising a ticket and waiting for it to progress through support levels to reach the proper subject matter expert to solve that incident, can be a disaster waiting to happen," he cautions.
Thurai points to an emerging generation of tools vendors which ostensibly cater to the hybrid environments seen at many enterprises, including:
Thurai provides the following guidelines for handling incidents:
Avoid incidents when possible.
Be prepared for unexpected and unplanned outages.
Identify the incident before the customers do.
Act quickly and decisively to solve the problem immediately.
Take ownership of the incident. Communicate well and in full. Own the story in digital channels.
Capture all details about the incident.
Do a blameless detailed postmortem.
Invest in proper observability tools.
Invest in a centralized incident management system.
Invest in AIOps tools
Break things regularly and see if your theory holds.
"Making assumptions is a dangerous thing in the digital economy," Thurai cautions. "Enterprises are one major incident away from disaster, which can happen anytime. Every business leader or board member should be asking these questions of their IT executives: If a major incident happens to us, how would we manage it? Would we be able to handle it and prove to our customers that we are worthy of their trust, or will we botch it up and cease to exist? If we are not prepared now, how can we get prepared? Ask for a plan of action and proof. Be willing to fund what's necessary to make this happen."