How Big Data will change your life in 2017
When the open source Apache Arrow project was launched early last year, I covered it with great interest. The project's active contributors hailed from 13 other open source projects as wide-ranging as Cassandra, Impala, Pandas, Spark, and Hadoop itself. All of these projects have occasion to place data in memory in a column-oriented , and they've all done it their own way. The Arrow project is all about creating a standard that the other projects can share, so that they can also share data between themselves, without having to its in-memory representation.
In addition to the many companies, like Hortonworks, Cisco and LinkedIn, who lent personnel to this project, a new startup, called Dremio, was the major force behind it. Though the company has been in stealth until today, its support of, and on, Arrow was explicit. Two of Dremio's founders, Tomer Shiran (Dremio's CEO) and Jaques Nadeau (Dremio's CTO and Program Committee of Arrow), both hail from MapR (where Shiran was VP of product) and, significantly, from the Apache Drill project as well.
Also read: SQL and Hadoop: It's complicated
Drill as a single engine that, in turn, can query and join data from among several other systems. Drill can certainly make use of an in-memory columnar data standard. But while Dremio was still in stealth, it wasn't immediately obvious what Drill's strong intersection with Arrow might . That made it hard to guess what Dremio was up to.
Introducing Dremio, the product
With Dremio emerging from stealth today, the association is more clear, because today the company is launching a namesake product that also acts as a single SQL engine that can query and join data from among several other systems, and it accelerates those queries using Arrow.
Let's back off the comparison with Drill though, and understand Dremio in its own right. It all stems from Dremio's credo that BI today involves too many layers. Source systems, via ETL processes, feed into data warehouses, which may then feed into OLAP cubes. BI tools themselves may add another layer, building their own in-memory models in order to accelerate query performance. Dremio thinks that's a huge mess.
Data lingua franca
Dremio disintermediates by providing a direct bridge between BI tools and the source systems they're querying. The BI tools to Dremio as if it were a primary data source, and query it via SQL. Dremio then delegates the query work to the true back-end systems through push-down queries that it issues. Dremio can connect to relational databases (both and open source), NoSQL stores, Hadoop, cloud blob stores, and ElasticSearch, among others.
In an last week, Shiran and Nadeau told me that Dremio does not materialize its own data store in between the BI tool and the physical back-end databases, and yet it makes queries against that back-end data -- even when it's true Big Data -- perform like queries against "small data" that a BI tool might have in its own local model. It does this using a universal relational that utilizes an optimizer and cache fragments.
In other words...
's how it works: all data pulled from the back-end data sources, say Shiran and Nadeau, is represented in memory using Arrow. Combined with vectorized (in-CPU processing) querying, this can yield up to a 5x performance improvement over conventional systems, the Dremio founders told me.
But a perhaps even more important optimization is Dremio's use of what it calls "," which are materialized data structures that optimize Dremio's row and aggregation operations. Reflections are sorted, partitioned, and indexed, stored as Parquet files on disk, and handled in-memory as Arrow-formatted columnar data. They may be built automatically by Dremio, on query usage patterns it observes; they can also be created directly by those with administrative permissions.
Cube? No. ? Yes.
Dremio may not build an OLAP cube per se, but Reflections sound similar to the aggregation tables built by Relational OLAP (ROLAP) systems (including the likes of AtScale) which don't materialize cubes either. Be that as it may, Dremio acts as a broker that interfaces BI tools like Tableau, Qlik, and Microsoft's Power BI, to a variety of back-end databases, and which handles all query tasks on its own.
Because Dremio plays this query broker role, it can also lineage information to help analysts understand the full data took, from back-end system to front-end analysis. (Some of the lineage experience is shown in the screenshot above.) Speaking more generally, Shiran and Nadeau tell me Dremio handles data ingest and curation, as well.
Dremio is available in an open source Community edition as well as a commercial Enterprise edition. The Community edition is not scale-limited; rather, the Enterprise edition offers greater capabilities around security, governance, and data . It also includes support, of course.
Enterprise subscriptions are priced based on the number of nodes Dremio is deployed to. Dremio can run in the cloud or on-premises, and it can run on a Hadoop cluster, as a YARN application, but it doesn't have to. In addition, support for Mesos and Kubernetes is on the roadmap.
Beyond Nadeau and Shiran's MapR pedigrees, other members of Dremio's leadership hail from MongoDB, MarkLogic, IBM, and Mesosphere. With a team assembled from the worlds of Hadoop, NoSQL, and Enterprise computing, from the open source and commercial software worlds, it's clear Dremio's is both to respect the diversity of data repository technologies out there and, at the same time, break through the silos those diverse technologies have created.
While Dremio's to this is novel, and may break a performance barrier that heretofore has not been well-addressed, the company is nonetheless entering a very crowded space. The product will need to work on a fairly plug-and-play basis and live up to its performance promises, not to mention build a real community and ecosystem. These are areas where Apache Drill has had only limited . Dremio will have to have a bigger hammer, not just an Arrow.