Apache Arrow: The little data accelerator that could

One of those behind-the-scenes projects, Arrow addresses the age-old problem of getting the compute-storage balance right for in-memory big data processing. Operating as a black box, many practitioners don't know it’s around. But with over a million downloads each month, Arrow’s presence is starting to become known.


A few years back, we noted the emergence of Apache Arrow; what piqued our attention was that the backers consisted of "a who's who list" of over 20 committers from the likes of Cloudera, MapR, Hortonworks, Salesforce.com, DataStax, Twitter, AWS, and Dremio.

As we characterized it then, Arrow was about big data, almost literally, lining its duck up in a… column. Arrow is a standard columnar format for persisting data efficiently in memory. You'd think that in-memory compute would simply brute force performance, which was one of the original draws of Spark. But memory isn't just a fast black box. There's a trick to loading data so it can be read efficiently; that's why developers often ran out of memory.

The challenge with balancing, in this case the loading of data into memory and the feeding of data from memory to compute is the symptom of a larger age-old problem that we used to refer to as balance of system. For years, Teradata sold different data warehousing appliances designed for data-intensive, compute-intensive, or mixed workload scenarios. Just because we've either brought compute to data, as in Hadoop clusters, or separated data from compute, as in the cloud, the challenge remains the same. In fact, it grows even more complicated when you have problems that really take resource requirements to the mix: deep learning AI problems that crunch lots of data while using special GPU hardware for compute.

Apache Arrow was conceived to solve a balance of system problem for data scientists: making sure that they didn't run out of memory when running their models or run out of budget because they overallocated memory. This wasn't an imaginary problem. For Wes McKinney, the Python and serial open source developer who created the Pandas data analysis library for Python (among other projects), it was the frustration of hitting that memory wall. He learned the hard way that, if you had a large data set, you'd better allocate at least 5 – 10x the amount of RAM to it. There needed to be a better way to reduce the trial, error, cost, and complexity of moving data into memory and accessing it.

McKinney, who is currently director of Ursa Labs, and Jacques Nadeau, CTO of Dremio, helped launch Arrow, which became an Apache project back in 2016. Arrow got supported by a critical mass of big data open source projects from the get-go: the Hadoop ecosystem, Spark, Storm, Cassandra, Pandas, HBase, and Kudu. It can read data from popular storage formats such as Apache Parquet, CSV files, Apache ORC, and JSON. Its initial support for Python has grown to 11 programming languages, among them C++, Java, Python, R, C#, JavaScript, and Ruby. When deployed, for instance with Spark, it has been benchmarked as improving performance up to 25x.

Given the wide support, the Apache project page listing a sampling of products and projects using Arrow is a bit underwhelming, as few of them are household names. Examples include Fletcher, a framework for converting an Arrow schema to work with FPGAs; Graphistry, a visual investigation platform used for security, anti-fraud, and related investigations; and Ray, a high-performance distributed execution framework designed for machine learning and AI applications. But where there's smoke, there's fire; download rates from the project portal are averaging about 1 million monthly. The community remains active; over the past year nearly 300 individuals have submitted more than 3000 contributions.

So where is Arrow pointing from here? The most exciting project involves its role as the foundation for cuDF, the DataFrame foundation library for RAPIDS that is built around Arrow. There is Gandiva, the emerging SQL execution kernel for Arrow developed by Dremio that is based on the LLVM open source compiler. Another initiative is around transport so data marshaled on one Arrow node can be efficiently replicated or moved to another.

But the missing piece is streaming, where the velocity of incoming data poses a special challenge. There are some early experiments to populate Arrow nodes in microbatches from Kafka. And, as the edge gets smarter (especially as machine learning is applied), it will also make sense for Arrow to emerge in a small footprint version, and with it, harvesting some of the work around transport for feeding filtered or aggregated data up to the cloud.

It's not time for the Arrow folks to pack up and go home yet.