MapR platform update brings major AI and Analytics innovation...through the file system

MapR goes back to its roots by innovating on the file system, resulting in a major boost for AI and analytics. If that sounds like a non sequitur, read on to why it works.
Written by Andrew Brust, Contributor

From the very beginning of its entry to the market, MapR focused on the file system as an axis of innovation. Recognizing that the native Hadoop Distributed File System's (HDFS's) inability to accommodate updates to files was a major blocker for lots of Enterprise customers, MapR put an HDFS interface over the standard Network File System (NFS) to make that constraint go away.

While seemingly simple, the ability for Hadoop to see a standard file system as its own meant that lots of data already present in the Enterprise could be processed by Hadoop. It also meant that non-Hadoop systems could share that data and work cooperatively with Hadoop. That made Hadoop more economical, more actionable and more relevant. For many customers, it transformed Hadoop from a marginal technology to a critical one.
Back to the file system
While MapR has subsequently innovated on the database, streaming and edge computing layers, and has embraced container technology, it is today announcing a major platform update that goes back to file system innovation. But this time it's not just about making files update-able; it's about integrating multiple file system technologies, on-premises and in the cloud, and making them work together.

Also read: Kafka 0.9 and MapR Streams put streaming data in the spotlight
Also read: MapR gets container religion with Platform for Docker

The core of the innovation is around integration between the MapR file system (MapR-FS) and Amazon's Simple Storage Service (S3) file system protocols. This integration manifests in more than one form, and there's some subtlety here, so bear with me.

S3, for two
The first integration point is support for an S3 interface over MapR-FS, via the new MapR Object Data Service. This allows applications that are S3-compatible to read and write data stored in MapR-FS. Since the S3 protocol is supported not just by S3 itself, but also by on-premises file systems, the ecosystem support for the protocol is robust. Now MapR-FS is part of that ecosystem.


MapR's Object Data Services

Credit: MapR

But the integration doesn't end there; it works in the other direction too. That is to say that S3-compatible storage volumes, including actual S3 buckets in the Amazon Web Services (AWS) cloud, can be federated into MapR-FS, providing a more economical storage option to accommodate data to which applications need only infrequent access.

Premium tiers
MapR-FS now also incorporates erasure coding for fast ingest, ideally on solid state disk (SSD) media. Together with standard S3-compatible storage and native MapR-FS, this allows for full-on storage tiering, enabling what MapR calls a "multi-temperature" data platform. Customers can put hot (frequently-accessed) data on the performance-optimized SSDs; warm (infrequently accessed) data on conventional spinning disks, and cold (rarely accessed) data on S3-compatible storage, including Amazon S3 itself.

Tiered storage is the enabler for keeping all data accessible, in an economically-efficient fashion. That in turn allows for analytics and AI to be far more effective and powerful. You never know when that old data will be important in a particular analysis exercise. And sometimes the best machine learning models are the ones that have been built on deep, historical data, in addition to the more recently-collected variety.

Don't just make it possible; make it easy
But tiered storage can't enable all that if it's just a manual storage strategy. Luckily, this new MapR platform release makes the placement of different data on different media automated, through declarative policy, and all the data tiers are federated in a single namespace so that they feel like a single file system.

There's much more:

  • Important performance optimizations, including the location of metadata and file stubs in the native MapR-FS layer for S3 data
  • Security features like automatic encryption of all data by default and Secure File-based services with NFSv4
  • Simple GET and PUT operations to move data physically between tiers
  • Strong features like the scheduled or automatic file recall to move data from higher-latency tiers to lower-latency tiers when it becomes newly-relevant
  • Support for fault tolerance features like disaster recovery clusters in the cloud through mirroring from the MapR cluster to MapR-XD cloud storage in AWS, Google Cloud Platform and Microsoft Azure

Also read: MapR diversifies to cloud storage market
Also read: MapR File System selected by SAP for cloud storage layer

In addition to the above, MapR's integration of Apache Spark 2.3 and Drill 1.14; support for Kafka KSQL; and MapR-DB language bindings for Python and Node.JS make analytics and AI more accessible to a variety of developers and business users. This accessibility is an excellent compliment to the extra enablement provided by the tiered storage.

Parting thoughts
The heart of big data analytics and indeed AI involves high volumes of raw data stored as flat (delimited, JSON, XML, etc.) files. That makes the file system itself critical in operationalizing and optimizing analytics and AI. Adding abstraction layers across the many different storage technologies and locations available today, both on-prem and cloud-based, is key to breaking data silos and making the necessary data easily accessible. And that, in turn, is what makes superior analytics and machine learning possible.

This latest MapR platform release will be available in the third quarter of this year, i.e. within the next three calendar months.

Editorial standards