Here are just some of the technical challenges ahead before the cloud hits prime time.
Lessons from supercomputing
So far we know that the following things are likely to happen: there will be larger clouds. Some of these clouds will link to others. Many services that businesses consume will sit on top of clouds. Software will be much, much larger.
The question is how companies can deal with this and what challenges they're going to face in getting these technologies up and running. In particular, operating at cloud scale means there will be more hardware and software failures and dealing with these failures will be an important issue.
To get an idea of the type of failures that cloud companies will be forced to deal with, it's helpful to look to supercomputing — an area that uses many of the technologies and methods that eventually make their way into the cloud.
"You're scaling up the number of cores, so the number of failures — hardware and system software failures — go up dramatically," says Richard Kenway, head of the Scientific Steering Committee of PRACE — a European scheme that aims to pool the resources of various supercomputers across the region to create a system capable of exascale computation. "I've heard claims it could be an undetected failure every few minutes."
This means new software systems will have to be developed to help deal with the likelihood of minute-to-minute failures in the underlying infrastructure.
The core failure rate will be compounded by a reduction in memory bandwidth to each core, Kenway says, as he expects the number of cores developed to outpace the rise in bandwidth. This will increase the software issues that programmers face, and could force them to have to think about new ways of developing software.
New ways of thinking about software development
What separates the cloud from supercomputers are the types of hardware choices that need to be made. With supercomputers, huge emphasis is placed on single-threaded performance, so we can expect these systems to focus on high-end chips (Xeons, for example), while clouds care more about 'dumb' workloads that require less single-threaded performance.
This means less capable — but lower-power — chips will start to make inroads into the cloud, hugely increasing the number of chips stewarded by clouds and causing a shift in software development strategies.
New software systems will have to be developed to help deal with the likelihood of minute-to-minute failures in the underlying infrastructure
By 2020, Fujitsu Technology Solution's chief technology officer, Josef Reger, expects the needs of the cloud will favour low-power chips with many cores. But that brings its own complications.
Companies will need to standardise infrastructures and bring their application development in line with the chips they use. They will also need to program their cloud operating systems to be much more parallel to deal with the memory crunch.
These two issues, combined with the larger scale at which clouds are likely to operate, could cause headaches for developers.
Though new networking technologies — faster interconnects, better on-chip communication, and so on — will go some way toward speeding the pace at which updates can ripple through a system, challenges will remain, especially around automating the update process within large applications.
Need for management efficiency
But there's not much point having these huge clouds running off low-power chips if silly mistakes mean you do not get as much efficiency out of them as you could, according to Facebook's VP of hardware design and supply chain, Frank Frankovsky.
"One of the big orchestration layer challenges that I think is interesting to me or anyone running a large-scale datacentre is not how you build efficient infrastructure, but how do you operate it efficiently?" he says.
Frankovsky believes capping and managing power consumption is an area that needs more investment. And while today vendors solve these issues at the server level or rack level, it will take an industry-wide open initiative to solve it for large datacentres, he says.
Facebook has embarked on the Open Compute Project, which hopes to standardise the chunks of infrastructure that go into the datacentre, to make life easier for people in charge of maintaining cloud datacentres.
Though hardware standardisation leads to an increased emphasis on software, which has benefits for management, the proliferation of hardware as clouds grow will mean serviceability will become ever more important.
"Because there's tens of thousands of devices, the ability for the technician to identify the faulty device, replace the component and get it back into operation is a really important part of operations at scale," Frankovsky says.
Vendors are already working to solve these problems. Some HP servers come with a technology HP has called a "sea of sensors" that lets them self-diagnose problems and specific equipment failures, while Facebook has created various software agents that let them reconfigure servers over the network, without having to physically go and manipulate them to modify their BIOS.
It seems likely this scheme will have traction. Adrian Cockcroft, Netflix's cloud architect, advocates exactly the same open source-esque hardware approach that Facebook has called for. He says when Netflix decided to build its own content distribution network (CDN) to make sure its online video service worked smoothly for users, it designed its own hardware as well.
Making sure everyone talks to everyone else
As with any technology, a lot of the true problems could come in implementation. Even large providers with a wealth of experience can struggle to deal with their scale.
As Bryan Ford, a Yale academic who researches cloud stability, puts it: "Such risks are still extremely challenging, even with one organisation, because of the standard 'left hand doesn't know what the right hand is doing' issue."
Cloud computing brings with it a whole new set of applications that will sit on multiple tiers of cloud infrastructure, so there will be a need for communication between all the parties involved in any one cloud. This is an organisational issue as much as a technological one, and some companies are already trying to solve it.
Netflix, for instance, has moved to a "no-ops" development strategy where each developer is responsible for their own code and making sure it can deal with failures in any of the other bits of code it talks to. This strikes Cockcroft as the best way to avoid problems of interdependency at scale.
"All of our components are designed to assume that the things around them will fail and they have to keep working," he says. "At any point in time there's a fair amount of stuff that's probably broken that customers don't notice because the service routes around [it]. I think probably in two years time you'll see that become relatively mainstream."
The thorny problem of multi-tenancy
Related to the problems of communication, both from an organisational and a technical perspective, is the challenge of developing true multi-tenant software applications.
As recent outages by Amazon have shown, developing this technology is non-trivial, even for the world's largest public cloud company, and it's going to get more difficult as more organisations consume multiple applications from multiple clouds.
"One of the big orchestration layer challenges is not how you build efficient infrastructure, but how do you operate it efficiently" — Frank Frankovsky, Facebook
John Manley, who runs HP Lab's automated infrastructure lab, says "the challenge ahead is going to be the production of true multi-tenant software by a software author."
This is because as multiple organisations access a single piece of software from one cloud, the developer needs to make sure that data is kept separate, and charge back is being handled effectively.
At the end of every session the instance needs to be torn down and returned back to the compute pool, Manley says, which can be tricky if companies are renting software for a large period of time.
Being open is the key, but will vendors resist?
Taken together, all the barriers to the cloud can be solved if industry adopts two things: standards and full technical disclosure. Standards will make it easier to manage software and hardware at scale, while full technical disclosure will stop interdependencies causing problems.
But enterprise incumbents are going to fight a pitched battle against both of these things. Standards beget commoditisation and have the potential to cut into the juicy profit margins that major vendors can charge for their proprietary technology. Combined with this, full technical disclosure of cloud architectures should make it easier for start-ups to enter into the market with technology on a par with incumbents, further disrupting the market.
Some vendors have acknowledged this threat; HP and Huawei have joined Facebook's Open Compute Project, though continue to try and carve off bits of the open technology and use it for their proprietary endeavours.
If the past is anything to go on, between now and 2020, we can look forward to a raft of conflicting standards from interested parties — and regular quibbling over exactly how open any one vendor needs to be.