What do recommendation engines for Amazon and Netflix have to do with better cloud computing? Thanks to a ground-breaking system from Stanford University, one is inspiring the other.
Most servers use around 20 percent of capacity when running with their typical workloads. There are several reasons for this, the main one being that cloud service users tend to overestimate the amount of compute they'll need. Other reasons include systems inefficiency as workloads are passed between newer and older physical processing cores or hardware. And changes to applications, where a code rewrite may impose a bigger load on the server, also impact efficiency. On top of all of this, other applications that share the workspace may interfere with performance.
Professor Christos Kozyrakis, associate professor of electrical engineering and computer science at Stanford's multi-scale architecture and systems team, said: "Everyone has the problem of underperforming servers."
Datacentres today are the base stations for countless processes and computations going on all over the networked world at once. If we can successfully schedule application workloads in shared environments as much as possible, everybody benefits.
Whether it's Google, Amazon, or your own infrastructure, those workloads are spread across different infrastructures, different locations and sometimes even different providers all at different times, according to their needs.
"How do you decide how many resources they need and which resources you give to each application?" Kozyrakis said. "What we tried to do is figure out all this critical information that lets us do a good job of that."
The secret is that every computation you make on a server will run better under certain circumstances than others — newer cores, higher bandwidth back to base, low data burst, etc — and the way to exploit that is in figuring out the particular parameters under which your application will run best.
"You want to run it on every kind of machine you have, with every amount of interference possible and every scale factor to see what happens," Kozyrakis explained. "But to do that, you'd have to run it a few thousand times, which is obviously stupid."
Instead, the Stanford system, called "Quasar", samples a short glimpse of the program in action (often just a few milliseconds) and looks for similarities in other workloads it's already seen. When it has a few matches, the system directs the new application to the best possible infrastructure and scheduling based on the informed guess about how it will perform.
If the above description makes the whole process sound a little hit and miss, think of the way heuristic antivirus works by scanning the code of incoming files. If something looks a little too much like something in the database that's already been identified as a "cybernasty", it's flagged for checking.
Quasar does something similar, but instead of scanning the actual code of an incoming application, it fires it up for long enough to see how it will behave, then checks against a repository of knowledge to find matches.
"In the experiments we've done, we've increased utilisation from 20 percent up to 60, 70, and in some cases 80 [percent]," Kozyrakis said. He hastened to add that raising utilisation on its own isn't difficult — the trick is whether you can do it while maintaining good application performance.
"How does it perform with more cores or more memory? How well does an application run when you schedule it on the same machines as others? If you know this stuff, you can do a good job of scheduling it."
How is that similar to recommendation engines that try to sell you books or movies? As Kozyrakis put it: "There are similarities between people and it's the same for applications." Netflix doesn't wait for you to watch its whole catalogue of comedies before it realises you like comedies and recommends them; it compares your viewing habits to thousands of other users. If enough people with a similar viewing profile to you end up watching a lot of comedies, the systems takes a guess that you'll enjoy them too.
Scale that up to countless processes in datacentres, and Quasar can take an at-a-glance look at an application, check the server configurations that give similar applications the best performance, and send that application to similar infrastructure in the environment.
In the wild
At first glance, it seems the market for such a system would be cloud service providers themselves — if only to assure themselves of the highest possible utilisation while maintaining the best performance for customers.
Kozyrakis was quick to point out that he can't see behind the curtain at any of the large services to comment on whether Quasar would be attractive to them. He doesn't think the usual players do anything similar in public cloud services, and the major providers we contacted for this story declined to comment either on how much server utilisation they achieve, or whether something like Quasar would benefit it.
It also sounds like the kind of advance that might generate a revenue stream among today's cloud service system integrators or "cloud brokers". Such a service could take your cloud computing needs, spread them across multiple services or providers, and not concern you with where your work's being done, just that it's getting the best infrastructure for its needs.
Kozyrakis said his team has considered the market for whatever form Quasar takes. For one thing, if you're using two datacentres or a private and public environment, it's a real challenge figuring out the right way to overflow from one to the other.
But there's an even better way to look at it — approaching your data needs from the point of view of the application rather than the infrastructure.
"The big departure for the user is that typically, when you go to a datacentre, you tell them what resources you need to run the application," Kozyrakis said. "That's where people typically get conservative. They ask for too much to be on the safe side."
Instead, Quasar asks the user what performance they're hoping for, then figures out the best use of resources needed to run it based on similar candidates it already knows. It's just like Netflix figuring you'll like a new thriller because other people with similar viewing histories rated it highly.
Kozyrakis and his team are still fine tuning the system, and said that the next step beyond is to figure out how to commercialise it. He's not sure whether it would work independently as a private cluster management tool or a service layer on top of a product like Azure or AWS.
But whatever form Quasar takes in the commercial world, the chance to triple the utilisation of your datacentre is enough to make anyone sit up and take notice.