MapReduce and MPP: Two sides of the Big Data coin?

MapReduce and MPP: Two sides of the Big Data coin?

Summary: To many, Big Data goes hand-in-hand with Hadoop + MapReduce. But MPP (Massively Parallel Processing) and data warehouse appliances are Big Data technologies too. The MapReduce and MPP worlds have been pretty separate, but are now starting to collide. And that’s a good thing.

TOPICS: Big Data, Software

When the Big Data moniker is applied to a discussion, it’s often assumed that Hadoop is, or should be, involved. But perhaps that’s just doctrinaire.

Hadoop, at its core, consists of HDFS (the Hadoop Distributed File System) and MapReduce. The latter is a computational approach that involves breaking large volumes of data down into smaller batches, and processing them separately. A cluster of computing nodes, each one built on commodity hardware, will scan the batches and aggregate their data. Then the multiple nodes’ output gets merged to generate the final result data. In a separate post, I’ll provide a more detailed and precise explanation of MapReduce, but this high-level explanation will do for now.

But Big Data's not all about MapReduce. There’s another computational approach to distributed query processing, called Massively Parallel Processing, or MPP. MPP has a lot in common with MapReduce. In MPP, as in MapReduce, processing of data is distributed across a bank of compute nodes, these separate nodes process their data in parallel and the node-level output sets are assembled together to produce a final result set. MapReduce and MPP are relatives. They might be siblings, parent-and-child or maybe just kissing cousins.

But, for a variety of reasons, MPP and MapReduce are used in rather different scenarios. You will find MPP employed in high-end data warehousing appliances. Almost all of these products started out as offerings from pure-play companies, but there’s been a lot of recent M&A activity that has taken MPP mainstream. MPP products like Teradata and ParAccel are independent to this day. But other MPP appliance products have been assimilated into the mega-vendor world. Netezza was acquired by IBM; Vertica by HP, Greenplum by EMC and Microsoft’s acquisition of DATAllegro resulted in an MPP version of SQL Server, called Parallel Data Warehouse Edition (SQL PDW, for short).

MPP gets used on expensive, specialized hardware tuned for CPU, storage and network performance. MapReduce and Hadoop find themselves deployed to clusters of commodity servers that in turn use commodity disks. The commodity nature of typical Hadoop hardware (and the free nature of Hadoop software) means that clusters can grow as data volumes do, whereas MPP products are bound by the cost of, and finite hardware in, the appliance and the relative high cost of the software.

MPP and MapReduce are separated by more than just hardware. MapReduce’s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL (Structured Query Language). “Hive,” a subproject of the overall Apache Hadoop project, essentially provides a SQL abstraction over MapReduce. Nonetheless, Hadoop is natively controlled through imperative code while MPP appliances are queried though declarative query. In a great many cases, SQL is easier and more productive than is writing MapReduce jobs, and database professionals with the SQL skill set are more plentiful and less costly than Hadoop specialists.

But there’s no reason that SQL + MPP couldn’t be implemented on commodity hardware and, for that matter, no reason why MapReduce couldn’t be used in data warehouse appliance environments. MPP and MapReduce are both Big Data technologies. They’re also products of different communities and cultures, but that doesn’t justify their continued separate evolution.

The MPP and Hadoop/MapReduce worlds are destined for unification. Perhaps that’s why Teradata’s Aster Data nCluster mashes up SQL, MPP and MapReduce. Or why Teradata and Hortonworks (an offshoot of Yahoo’s Hadoop team) have announced a partnership to make Hadoop and Teradata work together. And that’s probably why Microsoft is also working with Hortonworks, not only to implement Hadoop on Windows Azure (Microsoft’s cloud computing platform) and Windows Server, but also to integrate it with SQL Server business intelligence products and technologies.

Big Data is data, and it’s big, whether in a hulking data warehouse or a sprawling Hadoop cluster. Data warehouse and Hadoop practitioners have more in common than they might care to admit. Sure, one group has been more corporate and the other more academic- or research-oriented. But those delineations are subsiding and the technology delineations should subside as well.

For now, expect to see lots of permutations of Hadoop and its ecosystem components with data warehouse, business intelligence, predictive analytics and data visualization technologies. In the future, be prepared to see these specialty areas more unified, rationalized and seamlessly combined. The companies that get there first will have real competitive advantage. Companies that continue to just jam these things together will have a tougher time.

Topics: Big Data, Software

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • These are just optimization techniques

    Surely we just add these techniques to the existing set of techniques used in SQL-DBMSs?

    Otherwise you have to rewrite all your software to accomodate Big Data. Consequently Big Data looks like a huge waste of time, effort and money.

    Optimization improvements should be completely invisible at the logical level at which all users and programmers should work.

    The Big Data approach is too low level for long term practical use, it's the data management equivalent of programming in assembler. I don't want to know about the implementation level, only the logical level. Big Data fails totally on this count.
  • MPP

    Many SQL-DBMSs already implement this of course, as one of a number of optimization techniques, parallelism is not the best approach for all queries and operations.
  • Big data is data, not technologies

    Andrew, thanks for this clear explanation of the differences and commonalities between MapReduce and MPP. The fact is, a lot of companies did not wait for Hadoop to "do" big data, but then they had to use expensive and (often) complicated technologies. I recently discussed some of the pre-Hadoop big data use case in this blog post: What we are starting to see now is "big data for the masses": the democratization of big data thanks to more "modern" technologies.
  • MPP and MapReduce

    The central tenet that MapReduce and MPP have a lot in common is correct. From the user's perspective, the main difference is that MapReduce supports procedural languages (Java etc) and MPP systems are typically SQL-only databases. Both run by default on clusters of SMP nodes in a 'shared nothing' architecture.

    Teradata has been owned by NCR and AT+T, which bought then sold NCR, so has not been independent for most of it's almost 30 years. The MPP usual suspects are indeed Teradata, IBM Netezza and EMC Greenplum with Microsoft's PDW yet to really make much of an appearance in the field.

    MPP systems do not have to get 'used on expensive, specialized hardware'.

    Teradata uses Dell SMP servers and LSI storage, although the BYNET interconnect is proprietary. The software version of Greenplum offers choice of SMP node, storage, OS and filesystem and can be scaled to as many nodes as you choose, all on COTS hardware. Netezza now uses IBM blades after migrating away from home-grown hardware, although the blade that hosts the FPGA is proprietary.

    Not all MPP products are supplied in fixed appliance form. Teradata's non-appliance 'enterprise' offerings can be grown incrementally. This has been the case for decades. Only the relatively new 'appliance' offerings - a response to the competition Netezza brought to the MPP market - are non-expandable, and that's a product positioning choice not a technology limitation.

    The software only version of Greenplum is similarly unhindered. You can scale Greenplum on as many commodity nodes as you wish. The DBMS licence is charged per TB of input data, not per node, so you are only bounded by how much data you wish to process, not the cost of the tin or the DBMS. Seems very reasonable.

    SQL is not just 'easier and more productive', it's also understood by millions of users and developers worldwide. It is a standard after all. SQL is also generated by every query tool out there in the enterprise. For this reason alone it will never be displaced.

    SQL and MPP are very much being implemented on commodity hardware, and MapReduce is being implemented in data warehouse environments. Greenplum already has MapReduce from MapR built into it's offering, and Teradata has them combined in its Aster stack. This evolution will continue at pace.

    The next stage in this MPP + MapReduce evolution is a scalable cloud offering that can be spun up on demand on an arbitrary number of nodes with usable and stable inter-node and intra-node IO bandwidth.

    Paul Johnson
  • MPP for mainstream BigData

    One of the major Hadoop players, Clouedra has recently launched Impala, a real-time query engine for Hadoop. It uses MPP instead of MapReduce unlike Hive.