Cloud-based IT failure halts Virgin flights

Cloud-based IT failure halts Virgin flights

Summary: An IT failure disrupted travel for 50,000 customers of Virgin Blue airlines in Australia. The situation offers lessons for business buyers of cloud services.

SHARE:
TOPICS: CXO
11

A catastrophic systems failure at cloud-based software provider, Navitaire, a business process outsourcing (BPO) unit of Accenture, disrupted travel for 50,000 customers of Virgin Blue airlines in Australia. The situation offers important lessons for buyers of cloud-based outsourcing services.

Related: Virgin's cloud failure: Rebuttal and a deeper perspective

Virgin Blue provided details in a press release:

Navitaire is the supplier of Virgin Blue’s reservation and distribution software platform and also hosts that platform on its own server infrastructure at a data centre in Sydney.

At 0800 (AEST) yesterday the solid state disk server infrastructure used to host Virgin Blue failed resulting in the outage of our guest facing service technology systems.

We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.

The service agreement Virgin Blue has with Navitaire requires any mission critical system outages to be remedied within a short period of time. This did not happen in this instance. We did get our check-in and online booking systems operational again by just after 0500 (AEST) today.

Navitaire has given us an assurance that they are thoroughly investigating all circumstances which led to the hardware device failure and the delay in getting an alternative platform up and running. They have given an undertaking to get a full report to us as soon as possible.

According to travel technology website, tnooze, Virgin Blue recently transitioned from Navitaire's Open Skies platform to the same company's New Skies system. Navitaire's website describes New Skies:

New Skies is a comprehensive airline passenger sales and management solution providing capabilities for integrated Internet booking, call center reservations, travel agency global distribution connectivity, inter-airline and alliance code-share itineraries, real-time reporting, ancillary revenue generation and departure control.

A Navitaire representative promised to contact me with additional details, but never did.

STRATEGIC ANALYSIS

The Virgin Blue situation raises several key issues for business buyers of cloud-based services:

  • SaaS buyers cannot always ascertain all details of an outsourcing vendor's capability to withstand unexpected problems.
  • This case clearly demonstrates that contracts and SLA's alone do not offer sufficient protection against downtime in a SaaS environment. Virgin Blue's problems illustrate the importance of performing thorough due diligence before selecting a cloud supplier.

I asked CEO of top BPO and sourcing analyst firm Horses for Sources, Phil Fersht, for his view:

This incident highlights the advantages of using a single provider to manage both the business processes and related IT services within a cloud-based business services model. The Navitaire team responded relatively quickly to solve the problem, without Virgin having to deal with multiple points of blame. These things happen all the time; at least Virgin has a "single throat to choke."

Implications for buyers. Outages are an unpleasant reality in both the on-premise and cloud worlds.

To uncover potential Navitaire system weaknesses in advance, Virgin Blue would have needed to perform extraordinary and impractical levels of due diligence, digging deeply into Navitaire's technology, policies, and training procedures. Even then, it is unclear whether Virgin could have anticipated this particular point of failure.

Buyers of mission-critical outsourcing services should consider developing their own plans and procedures to handle external failures. In the end, process redundancy is the best form of failure prevention. Virgin Blue reverted to a poorly executed manual system to handle the outage, which caused the extended inconvenience its customers experienced.

While not excusing Navitaire, we must recognize that all parties have responsibility to plan and prepare for predictable, and even inevitable, failures.

Update 9/27/10, 1:30 PM ET: A few readers question whether this is actually a "cloud" situation or merely traditional outsourcing. Accenture, Navitaire's owner, titles the Navitaire web page, "Navitaire: Cloud computing for airlines: Accenture." Accenture also lists Navitaire under its Cloud Services set of offerings. All this raises questions around what "cloud" actually means. It's a tough question without an easy answer.

Update 10/14/10, 7:30am ET: CIO Magazine reports that the outage cost Virgin Blue $15-20 million.

Topic: CXO

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

11 comments
Log in or register to join the discussion
  • Accenture -- that's most of the problem

    Having worked for Accenture in their previous name, i.e. Andersen Consulting, I can tell you with a high degree of confidence that it is an organization that is long on sales but short on delivery.

    But that's par for the course when dealing with the large consulting houses. Sun's Scott McNealy used to quip about IBM's Consulting Services draining your wallet all the time.

    I'll also point out IT failures that have previously appeared in this blog:

    http://www.zdnet.com/blog/projectfailures/texas-warns-ibm-on-failed-data-center-consolidation/10370?tag=mantle_skin;content

    http://www.zdnet.com/blog/projectfailures/marin-county-abandons-30-million-erp-failure/10905?tag=mantle_skin;content

    When you have a large fat organization incapable of managing a cadre of vendors to come up with an effective IT solution but instead said fat organization relies on another fat organization to "come up with a solution" you have a situation of the blind leading the blind and the outcome isn't good.

    Don't believe me? Imagine you're a major stock exchange and:

    http://www.theregister.co.uk/2009/11/26/lse_crash_again/

    The end result of that (where Accenture played a part) was:

    http://blogs.computerworld.com/14876/london_stock_exchange_dumps_windows_for_linux

    -M
    betelgeuse68
  • haha solid state disk failure.... well duh.

    These things are not made for enterprise level yet. Standard users.. yea... not enterprise and most certainly NOT data storage.

    Such fail. I imagine someone lost their job and some people are going to be running test simluations for a while.
    Been_Done_Before
  • RE: Cloud-based IT failure halts Virgin flights

    @Michael, since when have we started using the word 'cloud' to describe every form of managed hosting under the sun? ... oh, I get it, every managed hoster wants to jump on the cloud bandwagon and use it in their marketing spiel. But that doesn't mean we have to uncritically accept their self-description and reuse it in our headlines, does it?

    What unique elements of cloud are implicit in this solution consumed by Virgin Blue:
    # Virtualization? Not mentioned, but the stack seems to be tied to a specific solid-state disk implementation so there seems to be a lot of hardware dependency involved. So not even a cloud architecture at the foundation layer.
    # Scalable automated provisioning? Well the DR certainly wasn't automated was it. Furthermore, this is a BPO contract so I imagine Virgin's commitment is pretty fixed rather than pay-as-you-go.
    # Multi-tenant? How many airlines' reservation and check-in systems fell over when the system went down? Oh, one. Not multi-tenant then.

    Just because clouds run on boxes, it doesn't mean that every outsourced box is a cloud.
    phil wainewright
    • RE: Cloud-based IT failure halts Virgin flights

      Phil, great points and thanks for commenting.

      However, you don't call it cloud and yet the vendor does. Who should we accept as the reference and does it even matter?

      Why not let the customer choose whatever architecture / features / vendor they want, and put definitions aside?
      mkrigsman
  • RE: Cloud-based IT failure halts Virgin flights

    It simply amazes me that a single disk failure can bring down such a business critical function. Ever hear of RAID, or mirroring? You talk about Virgin doing some due diligence, but it would require such a detailed level to really assess the vendor. It doesn't take much due diligence to detect unprotected DASD in production designs.

    This kind of failure just doesn't happen in the truly "Enterprise-Class" data center environment. I ran a mainframe environment for a top Wall Street trading client (we were an outsourcing provider). Their disk had RAID-1 mirrors, kept two synchronously replicated copies on local RAID-5 disks at all time, plus made an additional two copies through asynchronous replication to a separate site 1500 miles away. If I lost my entire data center, they were completely back up in 2-2.5 hours with a maximum of 30 seconds of data loss from the primary site.

    You mentioned Cloud. I "sort-of" agree this has nothing to do with cloud. However, it is a relatively inexperienced "service provder" similar to the start-ups and "want-a-be's" that we are seeing in the cloud game, none of which have the maturity required to run a enterprise-class mission critical environment. This includes the Googles, Amazons, MS', etc. Recently, even Microsoft had a major outage. They admitted it was due to a change - a change that was made mid-week (Monday), mid-day, and, oh, by the way, they did not have a backout plan for the change, nor a failover capability. In any major enterprise, you would say "Terminated for Cause"! That's assuming the control mechanisms in place in the enterprise even let someone close enough to production to make such an ill-advised change.
    Ken Cameron
  • Shall I whip out the violin and tears now?

    A lot of us have been mentioning the downsides to 'cloud' practices quite often.

    So, when we say "Told ya so", it's not out of vitriol. I think the colloquialism is: "Jus' sayin'."
    HypnoToad72
  • Great article

    Great article, Michael. You bring up the dark side of cloud solutions, which people don't talk about enough. Those dark sides can be mitigated via your suggestions (SLAs, etc.), but too many CIOs think the cloud is a magical world with little cost and risk.
    Eric Kimberling
  • RE: Cloud-based IT failure halts Virgin flights

    Not unlike the T-Mobile/Sidekick/Microsoft outage of 09-Oct-2009.

    http://en.wikipedia.org/wiki/Microsoft_data_loss_2009
    stevej098
  • Fog based computing

    This is where cloud computing becomes Fog Computing. Old fashioned computing stretched to the point where due diligence or intelligent trust become impossible. So, "The system is safe*" ... because the SLA says it is! Or "The system is backed up" ... to a facility some bloke I never met says is OK. *When you can't precisely say for what values of the word "safe" and who's balls are on the line, you are involved in Fog Computing.
    techrepublicaaa20
  • RE: Cloud-based IT failure halts Virgin flights

    In my view the author is being so naive and talking about the business issue (strategy, vendor selection, SLA) rather than the actual tech issue! Do you think Virgin would not have done the due diligence while selecting their vendor?

    @Ken has raised some meaningful questions and we need to wait for the report to come to make/blame any comment/one entity.
    JMR_Chennai
  • RE: Cloud-based IT failure halts Virgin flights

    I have to agree with Phil above that this doesn't even seem to be a true cloud implementation. At least there was no mention of any cloud capabilites such as auto-provisioning, virtualization, etc. Just because Accenture 'markets' it as cloud doesn't mean it is. Rather this is an example of poor technology management and lack of an adequate DR/BC plan for a business critical application.
    app1rtb