When the Big Data moniker is applied to a discussion, it’s often assumed that Hadoop is, or should be, involved. But perhaps that’s just doctrinaire.
Hadoop, at its core, consists of HDFS (the Hadoop Distributed File System) and MapReduce. The latter is a computational approach that involves breaking large volumes of data down into smaller batches, and processing them separately. A cluster of computing nodes, each one built on commodity hardware, will scan the batches and aggregate their data. Then the multiple nodes’ output gets merged to generate the final result data. In a separate post, I’ll provide a more detailed and precise explanation of MapReduce, but this high-level explanation will do for now.
But Big Data's not all about MapReduce. There’s another computational approach to distributed query processing, called Massively Parallel Processing, or MPP. MPP has a lot in common with MapReduce. In MPP, as in MapReduce, processing of data is distributed across a bank of compute nodes, these separate nodes process their data in parallel and the node-level output sets are assembled together to produce a final result set. MapReduce and MPP are relatives. They might be siblings, parent-and-child or maybe just kissing cousins.
But, for a variety of reasons, MPP and MapReduce are used in rather different scenarios. You will find MPP employed in high-end data warehousing appliances. Almost all of these products started out as offerings from pure-play companies, but there’s been a lot of recent M&A activity that has taken MPP mainstream. MPP products like Teradata and ParAccel are independent to this day. But other MPP appliance products have been assimilated into the mega-vendor world. Netezza was acquired by IBM; Vertica by HP, Greenplum by EMC and Microsoft’s acquisition of DATAllegro resulted in an MPP version of SQL Server, called Parallel Data Warehouse Edition (SQL PDW, for short).
MPP gets used on expensive, specialized hardware tuned for CPU, storage and network performance. MapReduce and Hadoop find themselves deployed to clusters of commodity servers that in turn use commodity disks. The commodity nature of typical Hadoop hardware (and the free nature of Hadoop software) means that clusters can grow as data volumes do, whereas MPP products are bound by the cost of, and finite hardware in, the appliance and the relative high cost of the software.
MPP and MapReduce are separated by more than just hardware. MapReduce’s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL (Structured Query Language). “Hive,” a subproject of the overall Apache Hadoop project, essentially provides a SQL abstraction over MapReduce. Nonetheless, Hadoop is natively controlled through imperative code while MPP appliances are queried though declarative query. In a great many cases, SQL is easier and more productive than is writing MapReduce jobs, and database professionals with the SQL skill set are more plentiful and less costly than Hadoop specialists.
But there’s no reason that SQL + MPP couldn’t be implemented on commodity hardware and, for that matter, no reason why MapReduce couldn’t be used in data warehouse appliance environments. MPP and MapReduce are both Big Data technologies. They’re also products of different communities and cultures, but that doesn’t justify their continued separate evolution.
The MPP and Hadoop/MapReduce worlds are destined for unification. Perhaps that’s why Teradata’s Aster Data nCluster mashes up SQL, MPP and MapReduce. Or why Teradata and Hortonworks (an offshoot of Yahoo’s Hadoop team) have announced a partnership to make Hadoop and Teradata work together. And that’s probably why Microsoft is also working with Hortonworks, not only to implement Hadoop on Windows Azure (Microsoft’s cloud computing platform) and Windows Server, but also to integrate it with SQL Server business intelligence products and technologies.
Big Data is data, and it’s big, whether in a hulking data warehouse or a sprawling Hadoop cluster. Data warehouse and Hadoop practitioners have more in common than they might care to admit. Sure, one group has been more corporate and the other more academic- or research-oriented. But those delineations are subsiding and the technology delineations should subside as well.
For now, expect to see lots of permutations of Hadoop and its ecosystem components with data warehouse, business intelligence, predictive analytics and data visualization technologies. In the future, be prepared to see these specialty areas more unified, rationalized and seamlessly combined. The companies that get there first will have real competitive advantage. Companies that continue to just jam these things together will have a tougher time.