EMC on Monday rolled out its own distribution of Hadoop in a move that integrates the open source big data software directly with its Greenplum intellectual property. The aim: Take on Cloudera.
The distribution, called Pivotal HD, is notable because it puts EMC in competition with Cloudera, which has a bevy of partners and is often seen as the Red Hat of big data. Many data warehousing players have built Hadoop connectors, but EMC went for a distribution so it could improve query response time. EMC's Pivotal HD could also drive sales of its Greenplum software and appliances.
Josh Klahr, vice president of products at EMC Greenplum, wasn't shy about the Cloudera comparison:
We want to be competitive with Cloudera. When we beta (Pivotal HD) with customers we've been able to stop a Cloudera purchase decision. Every account we go into there's increasing interest and adoption of Hadoop. The interest ranges from experimental to large production deployments.
Klahr noted that Pivotal HD has parts of Apache Hadoop, value added from the 100 developers EMC has on the project and proprietary database tools.
Among the key points about Pivotal HD:
- EMC's Hadoop distribution natively integrates with Greenplum's massive parallel processing database with Apache Hadoop.
- EMC said that its Hadoop distribution will bring SQL processing to the table and integrate with traditional business intelligence tools. Pivotal HD supports SQL-based data mining tools and allows them to use Hadoop's file system. EMC also outlined Project Hawq, an effort to bring database services to Hadoop.
- Cluster management tools so developers can deploy, configure and manage big data tasks.
- The storage giant claims that Pivotal HD is the most powerful Hadoop distribution because it uses EMC's dynamic pipelining technology. Based on its tests, EMC is claiming response time improvements ranging from 10x to 600x faster than SQL interfaces for Hadoop. EMC provided its own benchmarks comparing Hawq to Hive as well .
- The distribution bundles VMware's Hadoop Virtualization Extensions. That move isn't surprising given EMC owns VMware.
EMC said the rationale for its own distribution is that Hadoop interfaces in the enterprise aren't up to snuff and connectors are too slow. With Greenplum, EMC is looking to bring components and tools to bridge big data and business intelligence software via SQL.
The biggest issue for the big data market is that Hadoop distributions are piling up. Cloudera, IBM and Hortonworks are a few key players and the field is growing.
Pivotal HD will be available as software only or embedded with an appliance.