What to do if your supercomputing supplier fails

High-performance computing operates at extremes of technological innovation and risk. That combination makes supplier failure a distinct possibility, but it needn't be the end of the world, says supercomputing expert Andrew Jones.
High-performance computing (HPC) delivers some of its greatest benefits because niche providers, or niche divisions of large suppliers, push the limits of technology and business. But rapid technological innovation, the shortage of HPC experts and the need for providers to operate at times near the limits of the possible mean the risk of suppliers disappearing is inevitable.
Users of HPC have always worried about supplier viability; those concerns are not new to the present economic situation, nor will they cease when things improve. The trick for the customer is to realise that supplier failure doesn't matter — providing it is properly managed. Indeed, turnover in HPC providers is a consequence of the innovation and technology-chasing that helps make the industry so potent in the first place.
Stable vendors only, please
I am fortunate enough to work for a healthy HPC company, but not all HPC suppliers are in that situation. Some appear to be perpetually on the brink. Many more are moving closer to it because of the downturn.
Customers naturally worry whether a prospective supplier will stay in business for the duration of a contract. That worry often translates into an overemphasis in the procurement process on the long-term financial position of suppliers, rather than the overall business relationship and the technology. Bidders are often made to jump through financial hoops far more than in other areas.
But the reality for most HPC services is that the most likely risk to smooth delivery will come from technology — especially relating to service installations or refreshes — or from people, especially towards the end of a contract.
Even in cases where most of the hardware and integration work will come from a single risky supplier, the main question is usually: "Will they survive long enough to complete my installation and establish a stable service?", rather than: "Will they survive the whole service lifetime?".
Alternative sources of support
That focus is because the technology is usually based on commodity hardware, perhaps with a specific differentiation, as discussed in my last column for ZDNet UK. Being aware of alternative sources of support for this technology is both good competitive business practice and sensible risk management.
Indeed, the concern should be over committing to a single source of technology support rather than supplier viability. And in any case, many people would contend that any piece of hardware is obsolete within a few months of installation.
The key message is that with proper risk management, the risk of individual suppliers in certain areas of an overall provider mix going bust is no more or less a threat to the service delivery than the many, more-likely technology risks, ranging from roadmap and product delivery delays through architecture limitations and component failures to integration issues.
OK, sometimes it matters
For some aspects of the service, the end-user organisation will have concerns throughout the contract, and not just during the phase of creating a stable operation. For example, user and application support require a sustained relationship between customer and provider to be most effective.
The risk here is as much about people leaving as companies going bust. Indeed the movement of people in the HPC industry, especially between the big players, is probably a greater threat to continuity and quality of service than the risk of a smaller dedicated provider disappearing.
One obvious risk-mitigation strategy in the event of the disappearance of a smaller provider is to hire the recently redundant staff to work for you directly rather than via the ideal managed service — not perfect, but it is a good backup plan.
Clearly, though, the service lifetime relationship is core to the quality of service to users, so a stronger emphasis would be placed on the track record and viability of the prospective supplier.
Protecting against the personnel risk can also be managed by selecting a provider that can deliver support or other services, such as technology or procurement consulting, with a sufficient depth of staff. But some turnover of staff is good, because it can help with innovation and in sustaining the enthusiasm of the service provider.
Risk makes it possible
The inherent instability of the HPC industry — some companies operating near the edge of viability, others with good margins, new companies emerging, some disappearing, different people coming to the fore, and of course the technology race — is what makes HPC able to offer such an edge to its users' business.
In the HPC industry, as elsewhere, reward and risk are closely intertwined. Indeed, clichés are often simply well-repeated truths — "risk brings rewards", or "nothing ventured, nothing gained". Not to mention "cover your a**".
As vice president of HPC at the Numerical Algorithms Group, Andrew Jones leads the company's HPC services and consulting business, providing expertise in parallel, scalable and robust software development. Jones is well known in the supercomputing community. He is a former head of HPC at the University of Manchester and has more than 10 years' experience in HPC as an end user.