MapR: A Hadoop rebel with a cause

While MapR's open core strategy is no longer an outlier in the Hadoop ecosystem, its converged platform remains unique. While there remain gaps with regard to data governance, MapR could have a strong TCO story to tell.
Written by Tony Baer (dbInsight), Contributor

When it emerged from stealth in 2011, MapR was an outlier in the Hadoop community. At the time, Hadoop was defined largely by two projects adapted from Google research: MapReduce, which introduced highly linear, massively parallel processing to big data, and the HDFS file system, which could accommodate data by the petabyte. And at the time, Hadoop was defined as an Apache open source platform.

MapR's message, then and now, is that open source technology is not always sufficient for making it the last mile to enterprise grade. Its strategy is not all that unusual -- it's the embodiment of open core, where you supplement open source technology with proprietary value-add. It began with a shot across the bow to one of Apace Hadoop's pillars -- the file system, which was replaced by MapR's own (MapR-FS), providing the update and delete capabilities that HDFS lacked.

Increasingly, MapR, the rebel has gone mainstream, but not necessarily by joining the crowd. Instead, the rest of the industry ended up embracing the same open core model, but with their own twists on implementing it. Cloudera has been open core from the start, but unlike MapR, focuses its unique content away from the runtime (e.g., for monitoring, data governance, SQL optimization, and encryption key management). Hortonworks, the embodiment of 100 percent open source, has recently taken a more realpolitik strategy that began with several OEM agreements with AtScale, Syncsort, and Pivotal followed by a more drastic departure with its special Hortonworks Data Cloud for AWS that swaps in Amazon S3 cloud storage for HDFS.

MapR just concluded its first-ever analyst event, a milestone for a company that has traditionally operated well below the radar. It has clawed out a foothold where 40 percent of its customer base was poached from Cloudera and Hortonworks, and where it has achieved a 99 percent renewal rate for subscriptions that typically grow 35 percent from year to year. Like Cloudera, MapR has predicted that it would go cash flow neutral by 2017 - 18, with implications of looming IPO.

It brands its product as a Converged Data Platform that, in effect, flattens the Lambda Architecture by supporting streaming, interactive, and batch processing on the same cluster. The company emphasizes that its platform is not disparate pieces, but file, table, and streaming engines springing out of common building blocks. That encompasses the NFS, Posix-compliant MapR-FS file system on which its NoSQL database (MapR-DB) tables co-reside. And with the open source Drill project that MapR leads, you can use the same query engine to target SQL and JSON data.

Those building blocks were designed to pick up where the original open source Hadoop project technologies left off. In contrast to HDFS, MapR-FS supports update and delete, allowing storage to occupy a far more compact footprint. And a distributed metadata management system in place of the HDFS NameNode makes possible point-in-time snapshots and multi-master replication, just in case you want to run across multiple data centers and/or clouds concurrently. And its ability to control data placement yields advantages to clients who need to physically segregate specific pools of data. Topping it off, MapR claims its files system is far more scalable, capable of supporting at least 100x more files than HDFS.

While MapR initially supported its own proprietary implementation of HBase that eliminated region servers and the need for compaction operations, it again added its own alternative, MapR-DB, which it claims offers higher reliability over HBase; scalability than MongoDB; and consistency over Cassandra.

MapR has also tilted at windmills when it comes to distributed PubSub message queuing. While Apache Kafka has drawn wide scale third party support from data platforms and streaming engine providers alike, MapR introduced its own alternative, MapR Streams, which also resides on the same cluster where the file system and NoSQL databases live. By contrast, Kafka would require a separate cluster. Strike another blow for flattened architecture.

And so, given the theme of convergence, it's not surprising that MapR also boasts of bringing analytic and operational processing together. Yes, they run on the same cluster, but not the same storage engines; analytics is still the domain of the Hadoop file system side, while operational is confined to the NoSQL database -- although you could use Drill query across both. MapR's narrative of bringing analytics and operational applications together is not unusual.

Household names like Oracle and IBM DB2 address it with in-memory column stores that live alongside, while emerging transaction systems like MemSQL and Splice Machine offer Spark interfaces. The same holds for the name brand NoSQL databases as well -- MongoDB, Couchbase, and Cassandra.

MapR also makes the case that converged platforms are more efficient for IoT as well, because it means less roundtrips to different clusters for aggregating, streaming, parsing, analyzing, and persisting the data. But convergence for use cases as distributed as IoT only go so far, literally, as the dispersed nature of IoT data and bandwidth limitations augur for filtering, pre-processing and aggregation out at the edges. The question is whether MapR will follow in Amazon's footsteps in offering a Greengrass-like pre-processing facility out at the edge.

MapR has made considerable headway with customers like American Express that required the type of analytic performance of Spark -- about five years before Spark emerged. It also required the ability to update files in place, but commercial file systems like Isilon at the time weren't equipped for handling large-scale parallel data access.

There's still a major gap that MapR has yet to fill -- its rivals, Cloudera and Hortonworks have been far more proactive with data governance and security. While its converged platform offers a common target for third party offers to encrypt or protect data, MapR has yet to provide its own solutions for data discovery, audit, lineage, metadata management, and policy enforcement.

Its rebel stance has meant that MapR must prove its innocence. Its underlying components are different from the core Apache Hadoop stack. But at API level, they are compatible, meaning you can move files between MapR and rival Hadoop platforms and run the same Spark jobs.

And back to our original point: MapR's practices aren't all that different from the rest of the field anymore. But forget about acceptance of open core at the moment, the existence of open source or standards won't guarantee interoperability. For instance, in the Hadoop world, there are at least a dozen or more interactive SQL engines -- not to mention in the SQL relational database world, there's even more variants floating around.

Nonetheless, there are limits to the degree that MapR can tout advantages like high performance or convergence. In an era of Spark (which MapR supports), real-time performance on Hadoop not so unusual. And the advantages of a converged platform may not suit all scenarios. No matter how performant your platform, if you have highly demanding dueling workloads, such as highly complex batch jobs involving tens of petabytes of data running alongside real-time, continuous IoT applications running a smart city, it may still be advantageous to run them on separate clusters.

And being a rebel, sometimes there limits to the advantages of thinking different. Take Kafka: with growing third party support, will MapR Streams be left at the altar? Of course, you can run MapR with a separate Kafka cluster, but then you've lost part of the advantage of having that converged platform.

Nonetheless, with its converged platform, this rebel has an interesting TCO trick up its sleeve. When you have a platform that can run workloads with a lot less infrastructure, the clusters get a lot more compact. Sure, the hardware is cheap, but too much cheap becomes very expensive.

Editorial standards