If you thought Big Data open source project names were whimsical thus far, get ready for a new precedent in this trend. Splunk has taken its machine Big Data analytics platform and generalized it to work over any data stored in Hadoop. Take Hadoop + Splunk, and what do you get? Why "Hunk" of course.
Hunk is interesting in a couple of ways. First is Splunk’s approach to applying schema to unstructured data. While many Hadoop-based query technologies (including Apache Pig) allow schema to be declared at query time, they nonetheless require that declaration to be explicit. Splunk, on the other hand, explores data on the fly, suggests column/entity names and boundaries of its own, and allows the user to approve these suggestions, tweak them slightly, or provide imperative schema information. This approach of discovering schema, rather than declaring it, is very useful in the analysis of mainstream Hadoop data.
The architecture of Hunk is interesting as well. Splunk uses the Hadoop "Streaming" feature to integrate its own code and algorithms into the Hadoop MapReduce engine. While many vendors and users treat Streaming as a mechanism to write standard MapReduce code in languages other than Java, Splunk is using it to port its code to the Hadoop platform. A side benefit of this approach is that Splunk users can monitor interim data as it trickles out, and can stop MapReduce jobs – through the Splunk user interface – if changes are deemed necessary, based on inspection of the output.
Hunk’s architecture allows the Splunk analytics code to operate over Hadoop data without having to move that data out of Hadoop. Such data movement is something that becomes more prohibitive as data volumes increase. Meanwhile, since MapReduce-based analyses are batch-based and non-interactive, Hunk can, optionally, move certain data into its own column store database, facilitating more interactive dashboard analysis work.
What I’m writing here is based on a demo of Hunk, provided by Splunk's Principal Product Manager for Big Data and its VP of Product Marketing, Clint Sharp and Sanjay Mehta. During that demo, I also saw how, with rather minimal configuration, Hunk can be used against virtually any arbitrary Hadoop cluster, be it set up on-premises or in the cloud.
This is good stuff, and the announcement is nicely timed to coincide with the opening of Hadoop Summit in San Jose today. As ZDNet's Kevin Kwang explained in a recent article, Splunk is branching out beyond analysis of data center and other machine-generated data, into the wider world of Big Data analytics overall.