Hadoop creator: 'Google is living a few years in the future and sending the rest of us messages'

The co-creator of the Hadoop distributed computing platform on how Google's systems of today are shaping the business systems of tomorrow.
Written by Nick Heath, Contributor

Want to understand the type of systems global businesses will be using in five years? Look at the technology used by Google today.

Enterprise has a history of riding in Google's slipstream. It was in 2004 that Google revealed the technologies that inspired the creation of Hadoop, the platform that it is only today starting to be used by business for big data analytics.

Hadoop's co-creator Doug Cutting believes industry will continue to borrow from Google's toolbox, and sees a bright future in enterprise for the recently announced Google Spanner.

"Google is living a few years in the future and sending the rest of us messages," he said at the O'Reilly Strata Conference in London.

Hadoop co-creator Doug Cutting. Image: Tim Bray (http://en.wikipedia.org/wiki/User:TimBray) under licence (http://creativecommons.org/licenses/by-sa/3.0/)

Spanner was unveiled by Google last year as the technology that allows the search giant to provide almost instantaneous access to its services to millions of people worldwide without its software falling over. Primarily it stops Google's systems from getting tangled up while trying to keep up to date with what each other is doing.

In creating Spanner, Google had built a planet-spanning distributed database that allowed its global datacentres to keep in sync without suffering huge latencies. 

At the heart of Spanner is Google's TrueTime service, which allows systems to get accurate timestamps based on readings from atomic clocks and GPS receivers installed in each of Google's datacentres. Because Google can rely on TrueTime systems in different datacentres being in sync it can ensure applications situated on other sides of the world are able to read, write and replicate data without falling out of step with each other.

For Cutting, Spanner shows the future possibilities for open source distributed processing platforms like Hadoop.

Hadoop allows data to be spread over large clusters of commodity servers and processed in parallel. Today the platform is generally used to analyse data that sits outside of online transaction processing (OLTP) systems that are the engine of businesses – the likes of e-commerce, CRM and HR systems.

Spanner demonstrates how major corporations may soon use a Hadoop-like platform run these OLTP systems at a globally distributed scale, said Cutting, who is also chief architect at Hadoop specialist Cloudera.

"I think it [Spanner] is the Holy Grail for big data," he said. "Just a couple of years ago people would talk about OLTP and say 'You can't do that sort of stuff on a Hadoop-like platform'. Google demonstrated that you can."

Google created Spanner because it needed a technology with global reach to underpin its massive software platforms, said Cutting, a need that other large enterprises may struggle with in future.

"In a lot of cases people are served just fine by their existing relational solutions to OLTP problems, and there's no need to drive it to Hadoop.

"[However] as enterprises become more Google like that might not be satisfactory. I think the rest of us will be driven there as well. In the next couple of years I think we'll see it."

Facebook is already demonstrating one way of effectively linking an Hadoop cluster spanning multiple datacentres worldwide with its Prism system.

What's next for Hadoop?

The type of processing that can be carried out on a Hadoop cluster is evolving, with the general availability of Hadoop 2 last month refining Hadoop's software tools to make it easier to use clusters for more than batch processing.

As Hortonworks co-founder Arun Murthy told ZDNet, the introduction of the a separate job scheduler called YARN widens potential uses for Hadoop.

"It opens up Hadoop to so many new use cases, whether it's real-time event processing, or interactive SQL. Machine learning is another example — people are building native machine-learning apps on top of Hadoop right now, thanks to YARN," he said.

Another use for YARN, said Cutting, is for dynamically reassigning computing resources from the Hadoop cluster according to factors that are important to an organisation, such as who is running the job or at what time, which could be spelled out in SLAs.

Cutting sees Hadoop evolving into an enterprise data hub, a platform for running a variety of enterprise workloads — including batch processing, interactive SQL, enterprise search and advanced analytics — that integrates with existing corporate systems.

"Changes are going very quickly. I think there will be a trend towards seeing it as a kernel for datacentres in the way that Linux is the kernel for single nodes. More and more it will be the centre of a wide range of applications," he said.

"In the last year we've seen huge numbers of the traditional enterprise software vendors start to move and make their software available on top of Hadoop.

We've had SAS, Tableau [Software], across the board people are starting to see this as a platform that their customers want. I think that's going to accelerate until it's the default platform for most vendors."

Further reading

Editorial standards