X
Tech

What ever happened to Dryad?

Microsoft may be adopting Hadoop and making it run both on Azure and Windows to add 'big data' to big databases in SQL Server so that you can look at data you don't yet have a useful structure for as well as data you've structured into submission. But it hasn't exactly abandoned Dryad, its own big data project, which it designed to perform massive computations in parallel over large quantities of unstructured data.
Written by Simon Bisson, Contributor and  Mary Branscombe, Contributor

Microsoft may be adopting Hadoop and making it run both on Azure and Windows to add 'big data' to big databases in SQL Server so that you can look at data you don't yet have a useful structure for as well as data you've structured into submission. But it hasn't exactly abandoned Dryad, its own big data project, which it designed to perform massive computations in parallel over large quantities of unstructured data. Microsoft isn't taking the LINQ to HPC tools based on Dryad any further, so if you want to work with big data yourself, you'll want to look into Hadoop.

But Dryad itself is alive and well internally. It's what runs Bing, Senior Product Manager Saptak Sen told us last month.

"We run upwards of 40,000 nodes of Dryad in Microsoft. Bing runs on it." Search engines might be the canonical big data problem; extracting information about what Web pages are useful and interesting from the pattern of how people use them. Bing ingests over seven petabytes of new data every month, Sen says.

It's also on on a fast update schedule that's more like a startup than most Microsoft teams. For example, according to a job advert last year, the Bing maps and geospatial team ships updates to their production service every two weeks (using the Scrum development process to manage updates).

That experience with big data is standing Microsoft in good stead for working on Hadoop, according to Sen, and will help them improve Hadoop. "Our engineers who worked on the Dryad nodes; some of them are now working for this new data team. We have first hand experience how to scale to that level. Anywhere outside Microsoft, the biggest Hadoop cluster so far is about 4,000 nodes. We are working with the Apache foundation to make some of those architecture exchanges that will take us beyond that in terms of scalability."

And Microsoft's vision for Hadoop is much wider than the simple (or depending on how you look at it, simplistic) MapReduce engine Hadoop implements. "We want to make it flexible enough so that it's not just MapReduce; not all algorithms you can express are efficient to be expressed in MapReduce," Sen points out. "We want to plug in other runtimes as part of this infrastructure."

He talks about working with partners as well as the Apache Foundation to make that happen. But it's very tempting to wonder if one of those runtimes might turn out to be Dryad…

Mary Branscombe

[Alternate runtimes is an approach which also implies alternate programming models for NoSQL data on Hadoop - perhaps even a full .NET managed platform, a LINQ for Hadoop where data explorations are written in C# or even C++, rather than JavaScript, and embedded into line of business applications.

Simon Bisson]

Editorial standards