Search for Big Data: Cloudera and Lucene get hitched

Instead of a data scientist, what if you only needed your GoogleBing-fu to analyze data in Hadoop? That may be a stretch, but it's exactly what Cloudera is working toward.
Written by Andrew Brust, Contributor on

Several years ago, many business intelligence companies worked to add Web search-like query interfaces to their product stacks.  Those efforts focused on enterprise search technology, which put them more in the realm of big dollars than Big Data.  Success was mixed at best.  

But that doesn't mean it was a bad idea, and in the age of Hadoop, the relevance of search is arguably more pronounced.  After all, a search interface over Big Data makes a lot of sense, because that's exactly what Web search is.  If search engines can work over the entire Web, why can't a search interface in Hadoop work over its own cluster?

The open source Lucene/Solr project offers a sophisticated full-text search engine.  Moreover, the creation of Hadoop and Lucene were both spearheaded by Doug Cutting who just happens to be the Chief Architect at Cloudera.  So consider the dots connected as Cloudera has now integrated Lucene with Hadoop to make a bring plain-language, Web-style search to its Cloudera Distribution including Apache Hadoop (CDH).  The public Beta is of this search technology is now available, and builds on a private beta that has been ongoing.

From the source
I spoke with Cloudera's CEO, Mike Olson and the PM of Cloudera Search, Eva Andreasson, last week about the new technology.  As explained to me, the integration sounds elegant and extensive.

Effectively, Cloudera has ported the code from the SolrCloud project to work over the Hadoop Distributed File System (HDFS), in a fully distributed manner.  More specifically, the code is integrated with Apache Zookeeper, which is used by Hadoop to coordinate distributed processing, and yet, according to Olson and Andreasson, maintains compatibility with the standard Solr application programming interfaces (APIs).

Deep integration and more to come
The integration goes beyond the engine level.  Cloudera has integrated the search code not only with HDFS and Zookeeper, but with MapReduce, Flume (effectively allowing real-time indexing of streaming data), Oozie (allowing the execution of scheduled search jobs) and Hue (providing a browser-based user interface to perform Big Data seach).  

Cloudera is also working on integration of the search code with HBase to allow free-text search over NoSQL data and says it will make that functionality available in a future public Beta.  Speaking of future, Cloudera search will be integrated with YARN/MapReduce 2 but at present is integrated with version 1 of the MapReduce engine.

Battle of the Lucene Titans?
LucidWorks, which has been the major commercial entity behind Lucene, also has a Hadoop-based search offering on the market, called LucidWorks Big Data. The latter is CDH-compatible and it's integrated with MapR's Hadoop distribution.  Admittedly, this could cause some market confusion, and it's not obvious which search technology may be a more sound investment, given LucidWorks' close association with the Lucene project.

I posed this question to Cloudera and, in a convincing rebuttal, Olson pointed out to me that Cloudera, in addition to having Lucene's creator Doug Cutting on-board, also employs SolrCloud lead Mark Miller and Zookeeper lead, Patrick Hunt.  Perhaps more important, all of Cloudera's work is being conducted within the Apache Lucene community and the code will be submitted back to the core project.  That should mitigate a number of misgivings.  But I'm still keen to see how this shakes out in both the broad Lucene and Hadoop ecosystems.

Pick an interface, any interface
Cloudera is working hard to make Hadoop/HDFS data accessible through any number of interfaces, including classic MapReduce, NoSQL/HBase, SQL/Impala and now Web-style search/Lucene.  No matter the final evolution of Hadoop-based search, providing broader access to Hadoop data is something Cloudera is serious about.  And it should be, because Hadoop's barriers need to be much lower for Cloudera to grow its market.


Editorial standards