Making Hadoop optimizations Pervasive

Making Hadoop optimizations Pervasive

Summary: As innovative as Hadoop is in toto, its components can benefit from optimization, perhaps significantly. One vendor that’s been in the database business for three decades isn’t just talking about those optimizations. It’s building products around them.

SHARE:
TOPICS: Big Data
3

Hadoop is the master of distributed Big Data processing, across clusters of commodity hardware.  But what about distributed computing within a node in the cluster?  There’s been a multi-year, multi-core revolution going on in hardware, but not a lot of software seems to have heard the revolutionary forces charging forward.  How well does Hadoop scale within a worker node?

To be honest, I’m still learning.  But my gut tells me that Hadoop could be doing more in this regard.  Sure, Hadoop is about scaling out, and not necessarily up.  And it’s about commodity hardware and disk storage.  But multi-core technology is part of commodity hardware now, and there’s no point in wasting it.

Last week I had the opportunity to speak with Mike Hoskins of Pervasive Software.  If you’re an old-timer in the industry then you’ll know Pervasive for its Btrieve embedded database product (now called PSQL), which is still making the company a pretty penny.  But Hoskins, who is Pervasive’s CTO, is now also leading the company’s Big Data efforts, and he’s helping the company do some pretty cool stuff.

I’m finding myself impressed by database and BI veteran stalwarts who are finding their way into Big Data.  They see the potential of Big Data technology, but they see it in the context of relational database technology, where performance and optimization is hard-fought and resource waste can be heresy.

In Pervasive’s case, it’s taken the massively parallel technology designed for single nodes in its DataRush product and implemented it in custom readers and writers for HBase.  According to Hoskins, this accelerates throughput, expressed in records per second per node, by 100x or more.  Couple that single-node power with the inter-node distributed power of Hadoop’s MapReduce, and there’s potential for great things.

Pervasive’s stack also includes TurboRush, for optimizing Hive; DataRush for KNIME (an Eclipse-based, open source platform for flow-based data integration and analysis); and RushAnalyzer for machine learning. There’s a lot to dig into.

Hadoop brings formidable raw power to the game of processing data.  But it leaves a lot of white space for optimization and improvement.  That goes for MapReduce, HDFS, and pieces higher in the stack like Hive and HBase.  As Hadoop goes more corporate, expect to see more corporations imbue it with enterprise computing optimizations at the molecular and atomic levels.  It’s the logical place for companies to distinguish themselves from their many competitors at the Big Data jamboree.

Topic: Big Data

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • "But multi-core technology is part of commodity hardware now"

    Multi-pipe has been around longer than that. It's the tools that we use to build the software that is holding back the scale-up. Manually dealing with threads is no fun.
    happyharry_z
    • One reason why the future is declarative

      and where (in the case of a DBMS) the DBMS engine deals with parallelism and multi-threading without the programmer needing to be aware of it at all.

      SQL-DBMSs are already far advanced down this road.
      jorwell
  • Why your developers say you need big data, and why you don't need it really

    There are a number of fundamental mistakes that are commonly made by developers working with SQL-DBMSs that can be guaranteed to lead to poor performance and scaleability.

    When this happens there are the open-minded developers who accept they have to learn how to work properly with SQL-DBMSs and the others who blame the SQL-DBMS and demand that they have big data - so that they can keep up with fashion, perhaps.

    Error #1: Using loops to read and update data. The relational model is set based and SQL is a set based language. Using set level operators rather than loops can lead to orders of magnitude improvements in performance. You can make the performance even worse by using such misguided approaches as Object-Relational Mapping - which in some cases can lead to a database call (over the network) for every single column.

    Error #2: Denormalization. Many programmers seem to believe that the only way to improve performance of a slow running query is to materialize the result. The result of this is that previously your data was quite small, but now it's big (hey, my data is big I need big data!).

    It cannot be said often enough that denormalization is a tool of last resort. It is almost always possible to tune the query to get acceptable results.

    However the real problem with this approach is that the denormalized data is almost always out of date - which means you get the wrong answer to queries and this is a showstopper.

    Error #3: Middleware cacheing. Modern SQL-DBMSs cache the most recently used data (and query plans and stored procedures) automatically. Putting cacheing in code leads to the following consequences:
    - some special process needs to run to update the cached data - which means the cached data is usually out of date and therefore you get the wrong answer to queries (which is, as ever, a showstopper).
    - whenever you have to restart the middleware server the start will take ages because all the cached data needs to be reloaded in one go.
    The DBMS caching mechanism is cleverer than the majority of programmers and you get it without any additional coding.

    So basically you don't need big data, just better education for your programmers about SQL-DBMSs.
    jorwell