Making Hadoop optimizations Pervasive

As innovative as Hadoop is in toto, its components can benefit from optimization, perhaps significantly. One vendor that’s been in the database business for three decades isn’t just talking about those optimizations. It’s building products around them.

Hadoop is the master of distributed Big Data processing, across clusters of commodity hardware.  But what about distributed computing within a node in the cluster?  There’s been a multi-year, multi-core revolution going on in hardware, but not a lot of software seems to have heard the revolutionary forces charging forward.  How well does Hadoop scale within a worker node?

To be honest, I’m still learning.  But my gut tells me that Hadoop could be doing more in this regard.  Sure, Hadoop is about scaling out, and not necessarily up.  And it’s about commodity hardware and disk storage.  But multi-core technology is part of commodity hardware now, and there’s no point in wasting it.

Last week I had the opportunity to speak with Mike Hoskins of Pervasive Software.  If you’re an old-timer in the industry then you’ll know Pervasive for its Btrieve embedded database product (now called PSQL), which is still making the company a pretty penny.  But Hoskins, who is Pervasive’s CTO, is now also leading the company’s Big Data efforts, and he’s helping the company do some pretty cool stuff.

I’m finding myself impressed by database and BI veteran stalwarts who are finding their way into Big Data.  They see the potential of Big Data technology, but they see it in the context of relational database technology, where performance and optimization is hard-fought and resource waste can be heresy.

In Pervasive’s case, it’s taken the massively parallel technology designed for single nodes in its DataRush product and implemented it in custom readers and writers for HBase.  According to Hoskins, this accelerates throughput, expressed in records per second per node, by 100x or more.  Couple that single-node power with the inter-node distributed power of Hadoop’s MapReduce, and there’s potential for great things.

Pervasive’s stack also includes TurboRush, for optimizing Hive; DataRush for KNIME (an Eclipse-based, open source platform for flow-based data integration and analysis); and RushAnalyzer for machine learning. There’s a lot to dig into.

Hadoop brings formidable raw power to the game of processing data.  But it leaves a lot of white space for optimization and improvement.  That goes for MapReduce, HDFS, and pieces higher in the stack like Hive and HBase.  As Hadoop goes more corporate, expect to see more corporations imbue it with enterprise computing optimizations at the molecular and atomic levels.  It’s the logical place for companies to distinguish themselves from their many competitors at the Big Data jamboree.