Cloud outages: Why one status page is better than many

The need for a clear picture of cloud outages is more important than ever, given more businesses are using multiple clouds for sites and software. But most dashboards are basic, at best.
Written by Jack Clark, Contributor

When a site like CloudApp goes down, does the reason lie with a fault in its application, or in Heroku's platform-as-a-service that it sits on, or in Amazon Web Services's infrastructure-as-a-service cloud, which Heroku in turn relies on?

These are the questions administrators need to ask themselves when something goes wrong with a modern web application — but finding the answer can be tricky. 

The composite nature of modern websites means they can be damaged by flaws with their own technologies, as well as by problems in the cloud services they use. With the rise in the use of third-party technologies for anything from ads, to databases, to login and payment areas, large websites frequently access a multitude of services — all of which can, and do, fail. 

How, then, can you efficiently diagnose a problem? 

Many companies have tried to build tools to let administrators see through the fog of cloud disruption. On Wednesday, Compuware released its own attempt. 

The Outage Analyser site lets administrators track outages as they happen. Image: Compuware

Outage Analyser is a free website that compiles data gathered by 150,000 Compuware application performance management software agents (APM) used by its customers across the world. This data is amalgamated to give administrators the information they need to determine the root cause of the problem.

The Compuware tool shows the probable cause of the outage, the regions affected and a list of potentially hit websites and other dependent services. It also has an option to display a timeline to show how the outage evolved. 

With Outage Analyser, admins can view outages as they happen and track their spread across the globe. Unlike other comparison tools, it can also make a stab at telling them which sites are dependent on services that have gone down. 

"It is a fairly sophisticated approach that is required to do something like this. The first ingredient is to have the insight and visibility across the internet," Steve Tack, a product manager for Compuware's APM products, tells me, noting that Compuware is using its APM technology to take around eight billion measurements a day.

Though all the major clouds — Amazon, Microsoft, Google — and some of the minor ones have comprehensive status pages, there are few services that pull together information from multiple providers. 

"There's a benefit of having a neutral provider deliver this information," Tack says. "All the cloud providers themselves will present back information. What makes this unique is it's testing from the user experience... you have all these third-party services coming together at the browser."

"If I'm delivering a web property, I don't want to go to each of my providers [status pages]. I care about the whole experience," he adds.

Other companies have attempted to produce a tool like Compuware's: Cedexis has a range of services that let businesses monitor the performance of clouds, though these all cost money. However, Cedexis does not display dependencies. 

In 2010, Compuware released CloudSleuth, which showed the latencies and availability of various clouds, but it did not list dependencies or have timeline features. 

Urgent need

In my view, Outage Analyser is a handy tool, though the interface is a bit clunky. There is an urgent need for better information about clouds and cloud interdependencies, and this site is a step in the right direction.

However, until we have more tools like Outage Analyser, it will be difficult to assess the true scale of an outage. This is because the data in Compuware's site is sampled from Compuware APM customers, and some services may not appear, as those businesses may not be accessing them. 

Unfortunately, this creates a complicated situation. One service evaluating multiple clouds needs to be compared with another to get a full picture, and then their results have to be normalised. The layers of complexity just increase. I hope that other companies produce similar tools, so administrators can have an easier time quickly diagnosing cloud failures in the future.

Editorial standards