X
Tech

Hadoop: How it became big data's lynchpin, and where it's going next

Doug Cutting, creator of the distributed computing platform Hadoop, on why the platform is in an almost unassailable position and what's in store for the platform.
Written by Nick Heath, Contributor

The creator of the Hadoop platform, Doug Cutting, is surprised Hadoop didn't have to battle it out with more competitors in its early years but thinks it's now too late for a competitor to take on the platform.

Cutting began creating the distributed computing platform in the mid-2000s – based on two papers written by Google about technologies powering its search engine Google File System and MapReduce. At that time he didn't have any sense of how demand for the platform would build to where it was today, with adoption by firms growing by 60 percent annually according to analyst house IDC.

"I was coming at it from a web search context, I saw it was useful in processing the sort of datasets we were getting crawling the web. I wasn't thinking much outside of that. I don't think I was looking at the industries in general," said Cutting, who is chief architect at Hadoop software and services firm Cloudera.

hadoop
Hadoop's logo is a yellow elephant. Photo: Erik Eldrige (http://www.flickr.com/photos/erikeldridge/3614786392/sizes/m/in/photostream/) under Creative Commons licence

After businesses began showing interest in the platform, Cutting anticipated one of the large tech firms would launch another platform that distributed files and processing across clusters of commodity hardware, and was surprised when that didn't happen.

"When it started to become something people were paying attention to I figured that some big companies would try and come up with a proprietary competitor," he said.

"But in the last couple of years it's become pretty clear that everybody's accepting Hadoop as the standard in this area. I think it would be very difficult to start a competitor afresh.

"You've got all the major players, from the companies like Cloudera that have been developed around the technology to existing folks like Microsoft, Oracle, IBM who've all endorsed this as the platform technology for big data. It's hard to go against that. You're swimming upstream and it's a big stream."

User explosion

The use of Hadoop is exploding, according to Cutting. He said that Cloudera, a Hadoop pure-play which sells software and services for the platform, has doubled in customers, size and revenues each year.

"As it rolls out we are seeing we need to add more and more features to permit it to continue to grow," Cutting said.

"There's no question it's exploding across every industry. Every industry has lots of data, and they want to be able to store it and process it economically. As they start to look at it they start to realise the issues specific to those industries and we can work with them to figure out to make sure it works in these cases."

The range of potential uses for Hadoop has been opened up by recent enhancements that have added real-time SQL queries and Google-like keyword search.

Cloudera has recently added the Impala engine to its distribution of Hadoop, allowing SQL queries to be run against data stored in the Hadoop Distributed File System or the non-relational HBase database that sits on Hadoop, and results to be returned in "real-time".

"Impala lets you get 10 or more times faster results for SQL queries over data that's in Hadoop" compared to using older query engines such as Hive running jobs on Hadoop MapReduce, Cutting said.

Cloudera also added the ability to carry out Google-like keyword searches over billions of documents in a Hadoop cluster in a "fraction of a second", Cutting said, with the recent introduction of Cloudera Search, based on Apache Solr 4.3.

"I think SQL and search are the two primary search methods that people have for querying data, so we've got those covered," he said.

The future of Hadoop

A key focus for future Hadoop development will be the addition of security and auditing tools. For Hadoop to gain even wider adoption, users need it to comply with the legal requirements governing more heavily regulated industries, such as finance, government and healthcare, Cutting said.

"We're working on adding these different tools specific to different regulatory regimes," he said.

"Some of these regulations are arcane and require the ability to track over time as data is transformed. You need to know the provenance, to see who accessed what and figure out who can see what aspects of both primary and derivative datasets."

Cutting said work was continuing to build on additions made to Apache Hadoop over the past year, such as the introduction of finer-grained authorisation controls.

"Within a HBase table you can now control the access to different rows and columns, eventually we'd like to get cell-level authorisation in HBase," he said.

Data going into and out of Apache Hadoop is encrypted, while third-party products, such as those provided by Vormetric, can also encrypt data at rest inside Hadoop and Cutting says he imagines eventually that feature will be built into Hadoop.

Editorial standards