Dremio and its eponymous platform have always been focused on high-performance data virtualization. Such platforms are centralized brokers that connect to and query multiple data sources on a user's or application's behalf. A hallmark of data virtualization, and one of its biggest hurdles, is performing federated queries that return a single result set composed of data from more than one of the connected sources. But today, Dremio is turning its attention to a single data source: the cloud data lake.
Rather than obsessing on the performance of querying multiple sources, Dremio is introducing technology that optimizes access to cloud data lakes. There are good reasons for this. Cloud data lakes are seeing increasing adoption in the Enterprise, acting as the de facto collection point for corporate Big Data. But the very architectural premise that makes the cloud data lake economically efficient -- its basis in cloud object storage -- also makes for an often-unspoken downside: data access is slow.
Rather than leave that flaw an obfuscated detail, Dremio is taking it head-on and working to solve the problem. Today, the company is introducing new Data Lake Engine technology, as part of its Dremio 4.0 release, that significantly increases query performance against data in Amazon Web Services' (AWS') Simple Storage Service (S3) and Azure Data Lake Storage (ADLS) Gen2.
In a briefing with ZDNet, Dremio CEO and co-founder Tomer Shiran explained the new architectural approach, which is fairly straightforward (with hindsight). Since Dremio itself is deployed on a cluster of cloud virtual machines (VMs) and since VM instance types that use high-speed Non-Volatile Memory Express (NVMe) solid state drive (SSD) technology are plentiful, Dremio's Data Lake Engine caches data from the cloud storage-based lake into those much faster SSDs for better performance. Dremio's name for this technology is "C3" (the Columnar Cloud Cache).
This is a smart approach based on robust precedent: various cloud data warehouse platforms, including Snowflake and Azure SQL Data Warehouse Gen2, use this very same technique of SSD caching to maximize performance while still leveraging economical cloud object storage as the persistent backing layer. In the cloud data warehouse world, this enables the clusters to be paused and resumed, while keeping bulk storage costs low.
Warehouse perf, lake flexibility, cool optimizations
In the data lake world, this approach will allow the same flexibility (with even Dremio's Apache Arrow-based "reflections" able to be persisted to cloud object storage), while still upholding the basic tenet of a data lake: maintaining a single repository of data which can be queried and processed by multiple processing and execution engines.
Dremio's Data Lake Engine technology goes beyond mere SSD caching, however. The company is also providing something it calls Predictive Pipelining, which will fetch data in advance from columnar data sources like Apache Parquet and ORC files on S3, ADLS or even HDFS (the Hadoop Distributed File System). Shiran says Predictive Pipelining results in 3x-5x faster query response times. Dremio has also brought to general availability (GA) its upgraded execution engine kernel based on the open source Gandiva technology (which it developed). Gandiva generates CPU-native native code, optimized for fast vector operations. The company says this yields performance increases of up to 70x (note that's an "x" and not a "%").
The combination of the data lake's agility, cloud security, C3's performance optimization and BI-friendly access means Dremio is facilitating the concept of data-driven culture in an important new way. If everything works as Dremio says, it would take the combination of open source data technology and cloud storage technology and make it performance-optimized and self-service-ready.