Virgin's cloud failure: Rebuttal and a deeper perspective

Virgin's cloud failure: Rebuttal and a deeper perspective

Summary: Yesterday's post characterized Virgin Blue airline's downtime as a "cloud" failure. Several informed readers believe that assessment was misguided and limited.


Yesterday's post characterized Virgin Blue airline's downtime as a "cloud" failure. Several informed readers, especially among my colleagues inside the Enterprise Irregulars, believe that assessment was misguided and limited.

Related: Cloud-based IT failure halts Virgin flights

Serial entrepreneur and generally smart enterprise guy, Bob Warfield, was among those who took issue with my post. Given the strength and clarity of his disagreement, I invited Bob to write a guest post, which is presented here.

In this guest post, Bob links success in the cloud to business strategy and people issues.

In comments about Oracle's recent OpenWorld conference, Gartner BPM analyst, Elise Olding, articulates a similar view:

Cloud is a strategic conversation, not just a technical one. Corporate strategy and leadership need embrace and understand what those changes are and prepare the organization. Internal roles and skills will change and customers will continue to expect more. One of the speakers suggested that the cloud will significantly impact collaboration, decision making and particularly knowledge work. He stated how fast a company can solve a problem will become a competitive advantage. However at the conference, the focus feels heavily slanted towards the technology with very little consideration of how to tackle these issues.

The cloud question. Regarding the narrow issue of whether Virgin Blue actually constitutes a "cloud" failure, one Enterprise Irregular noted:

Beware false clouds.

If this was truly a shared cloud, why did the outage affect only one airline? What would [outsourcing provider] Navitaire have done if its system brought down four airlines instead of one? Would Navitaire have designed its system differently?

In this case, at worst, Virgin Blue will leave Navitaire. If Navitaire were a real cloud, many of their customers would be angry and the company would do MORE to avoid such failures.

Outsourcing is not cloud computing. Private clouds are not cloud computing.

Thanks to Bob Warfield for writing this guest post.


Articles about Cloud disasters are popular these days. The Cloud is relatively new and people realize they don't know what they don't know. Although Michael's post on Virgin Blue is informative, I disagree with his perspective.

It's all fine and well to wring our hands about another "Cloud" disaster with a "seamy underbelly", but it is more interesting and useful to explore several deeper questions.

The Virgin Blue incident raises broad questions that go beyond the Cloud. For example, how can you perform proper diligence, and prepare disaster recovery and failover options, when a third party substantially controls your infrastructure?

Viewed from that broader perspective, I don't see Virgin's, and many other similar incidents, as Cloud issues at all. I don't even think it is a "Hosted" issue, though that came up as a question in the Enterprise Irregulars' discussion of the Virgin Blue situation, as another fellow Enterprise Irregular, David Dobrin, writes.

I'm coming from a perspective of having dealt with large Enterprise customers who had the software on their own machines in their own data centers, but it was all being operated by so few of their own employees (Accenture or some provider was doing the work) that it may as well have been in a Cloud.  The company employees had very little idea or control over what was going on.

Companies who follow such a course can easily find that have the same problem as Virgin:  the system goes down and a third party was at the controls, did the design, setup the software, yada, yada.  Tell me you haven't seen situations like that?  Heck, I have seen a very prominent SI get thrown out of a Fortune 500 IT account for 2 years over a screw up that cost a lot of money.  That company was not just down for a couple days, it took nearly 9 months to get that project back on track.

There is a seamier side to outsourcing of any kind that has little to do with the Cloud, hosting, or indeed technology in general.  It's already here and breathing down your neck in IT whether you've even lifted a finger to move into the Cloud.

At a very high level, you've got to ask why the Out-sourcer / Cloud-sourcer / SaaS-sourcer will do a better job than you can.  Maybe it doesn't matter because you can't afford to do the job or don't have the talent, but it still does matter in terms of asking why the vendor you pick can do a better job than the others.

Scale and experience matter a lot in that kind of evaluation too.  Many IT departments are faced with doing it for the first time if they do it themselves.  Presumably, your vendor is not.  Scale means they can afford to do things like build multiple data centers with your data guaranteed to be in more than one of them at all times, something that may be prohibitive for a smaller organization.  It means they've had to combat things like DOS and other attacks on a larger scale than your organization may have, though it also means they may be exposed to more of it than you have.

Be that as it may, scale and experience are also not the whole story.  You need to know what decisions they've made on disaster recovery and so on. How do they ensure their people are not looking to steal vital information?

But even after all that is said, and you've done your homework, you are going to be placing yourself in a situation where if an outage occurs (and no matter how good the story is, one will eventually occur), the real problem is that you will feel powerless. You won't have the information about what's going on, the ability to make the decisions, or the ability to individually beat on people to get it fixed that you have when it is your own shop.   That's a psychological factor, by the way, that may have little to do with whether those things matter towards getting past the disaster.

Outsourcing is the ultimate delegation.  You better be sure you hired the right people, because in the end, it is people, not a Cloud that you will have to blame if it fails.


Thank you to Bob Warfield for writing this guest post.

Topics: Outsourcing, CXO, Cloud, Data Centers, Enterprise Software, Hardware, Storage, IT Employment

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    Michael and Bob, thanks for adding your perspective. I am among those that don't think this was a true Cloud failure. Virgin Blue specifically mentioned that the app was hosted in Navitaire's data center in Australia and that the fail was due to a solid state storage malfunction. So I think it's pretty clear that at best this is a Private Cloud and at least a poor implementation of it as it's hard to believe the entire app could be brought down by a storage failure unless it was a cascading failure of massive proportions.

    Travel's kind of a unique industry with so much relying on older technology, some of which date back 50 years. This may not be one of them, but it's not cutting edge either. Much of the travel industry is still getting up to speed on Cloud and how it might impact the industry.

    FYI, I'm a contributing writer for Tnooz who was Michael's original source for the news item. And today my Part 1 of 2 on the importance of understanding Cloud and SaaS in the segment ran.
    • RE: Virgin's cloud failure: Rebuttal and a deeper perspective


      Yawn. A true fanatic can be identified by one thing. Either denying the facts or trying to redefine what a cloud is. It doesn't matter as to whether it was outsourced, a private or a public cloud - all that differs there is how many people get affected, the weakness is still the same. At least they could call Navitaire directly and get some answers, rather than joining a queue with thousands of other companies trying to get support from a cloud vendor.
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    Well, all the "authors" in this article seem to have said a mouthful of nothing, actually. They even contradict themselves, let alone rebutt anything clearly. I'd put clarity here at about 10%, maybe a little less. <br> Sorry, that's my take on it anyway.
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    Great discussion.

    It is not about the cloud. It is about management and diligence. If the HVAC portion of the building construction is subcontracted out and it fails to deliver on-time, we do not immediately assume that all subcontracting is a failure.
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    Who cares - this argument is just semantics. Lets re-arrange the deckchairs on the Titanic whilst we are about it. Whether you outsource to a cloud or to a private cloud or any other sort of cloud the end result is the same in the event of catastrophic failure. Outsourcing equals loss of control and expertise within the organisation. The old saying is you get what you pay for and when you pay to rely on a third party to save money on dedicated IT you are totally helpless in the event of a serious problem. As I type Virgin Blue staff in Brisbane Australia are having to resort to manually checking passengers in. Remember how travel agents used to write those hard-copy tickets ? Well Virgin is now back in the equivalent of the Stone Age. Those putting all their eggs in the cloud "basket" are always going to have to pray that a serious problem doesn't take out their cloud partner or their link to the cloud for days on end - what does this do to the productivity and reputation of a company who has heavily reliant partners or customers. Ask Virgin !!!
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    In my experience it usually takes a disaster before business continuance is adopted correctly in an organization. Todays' technology allows for seemless failover and it should have nothing to do with Cloud computing. Install the right solution and test failover as technical problems do occur.
    Peter O.
  • Have you notice the pattern

    Week after week, we read about the FAILURES with "the cloud".

    But week after week, proponents continue to down play the very obvious reasons it is a failure and continue to rewrite the definition to justify their point of view.
  • The point made, missed

    The salient point in Bob's article is made finally. First, the consideration to outsource comes from either lack of efficiencies, or lack of expertise, from running an internal IT organization. To prevent the former you'll, likely, need to prevent the latter. Most organizations (esp. larger ones) can never prevent the latter, because of the nature of IT personnel and the bell curve. A small, intelligent, reliable (internal) staff is the best way to protect against disaster. What ?the cloud? provides, specifically those like Amazon who expose an API, is an architecture that minimizes the number of humans needed to maintain your systems. Underneath, the hardware is the same whether it's in the cloud or in the office closet, and like all hardware it will eventually fail. Cloud computing is a service, not an infrastructure panacea, and needs to be understood sensibly. Virgin should have recognized this.
  • RE: Virgin's cloud failure: Rebuttal and a deeper perspective

    As one said - 'a mouthful of nothing'

    Whether Cloud or Hosted App failure, Virgin Blue still had an huge outage - massively affecting staff and customers.

    Their Business Continuity/DR plans need torn apart and rebuilt.
  • It's about more than just cheap hosting...

    Having read the original article, the rebuttal and the comments above, I see many consider "cloud" to be synonymous with "hosting" (public, private, whatever).

    From this belief we are led into the main debate about accessibility (or lack there of) of the underlying hardware when something does wrong, and ultimately the conclusion that cloud is bad because you cant kick someone's arse and send them into the fridge with a spanner and a spare HD to get the thing running again.

    The cloud is so much more that an alternative hosting solution. If it were just that then yes, the above arguments are valid and probably reason enough to not trust it with LOB applications.

    But it's not. It presents a new (ok - not new, but newly affordable) way of building resilient and scalable IT solutions from commodity hardware (albeit virtualized). Amazon's line isn't "Host your old IT infrastructure in our cloud, it never fails". It's "Everything fails, design for failure and you never will". I recommend anyone with a passing interest in architecting applications for The Cloud to check out a couple of presentations on SlideShare.

    On demand servers, elastic scaling, programmable load balancers, message queues, distributed caches, CDNs, monitoring, geographically diverse yet logically combined physical infrastructures... all of these goodies are the building blocks of the next generation of resilient and scalable server applications that will consume hardware and expect it, even command it to change, rather than rely on it to remain constant and then throw a wobbly when something breaks.

    Of course in travel we rely on a lot of old IT, and we cant just rebuild everything today to take advantage of this. But to see The Cloud as nothing more than a cheap hosting solution and then get upset when it goes wrong is not evidence that The Cloud has failed. It's evidence that some one has failed to design for failure - and that's inexcusable when the Cloud APIs are there to monitor, back up, replicate, shut down, reassign, start up, unassign and scale up for pennies.
  • Single-tenant dedicated hosting is ASP, not cloud

    Single-tenant hosting of a client-specific (likely customized and on an old version) application by a third party is not cloud, it's ASP which died a deserved death a decade ago because it doesn't offer client benefits in cost of ownership, agility, or high availability.

    While economies of scale and standardization are critical elements of the cloud, the most important factor for high-availability is operational knowledge. A cloud/SaaS provider develops many times the operational knowledge any single client (or ASP outsourcer) can hope to acquire, and focuses it all on a single standardized technology stack and single most-recent version of the application software.

    Did Navitaire write the application software? Do they run it for many clients on the same technology stack? Did they even get to select the technology stack vs. inheriting it from Virgin? Do they continuously invest in stability/performance/availability innovations in the software or the infrastructure because it would improve the client/user experience across their entire client base? The answer to all these questions is likely no, and that's the real reason why this 21-hour downtime is due to limitations of single-tenant ASP hosting rather than a failure of the cloud.
    John F. Martin
  • So Tell Me If I Got This Right

    It's 'Cloud Computing' until it fails, then it's not.