A year ago, Turing award winner Dr. Michael Stonebraker made the point that, when you try managing more than a handful of data sets, manual approaches run out of gas and the machine must come in to help. He was referring to the task of cataloging data sets in the context of capabilities performed by his latest startup, Tamr. If your typical data warehouse or data mart involves three or four data sources, it's possible for you to get your head around figuring the idiosyncrasies of each data set and how to integrate them for analytics.
But push that number to dozens, if not hundreds or thousands of data sets, and any human brain is going to hit the wall -- maybe literally. And that's where machine learning first made big data navigable, not just to data scientists, but to business users. Introduced by Paxata, and since then, through a long tail of startups and household names, these tools applied machine learning to help the user wrangle data through a new kind of iterative process. Since then, analytic tools such as IBM's Watson Analytics are employing machine learning to help end users perform predictive analytics.
Walking the floor of last week's Strata Hadoop World in New York, we saw machine learning powering "emergent" approaches to building data warehouses. Infoworks monitors the what data end users are targeting for their queries by taking a change data capture-like approach to monitoring logs; but instead of just tracking changes (which is useful for data lineage), it deduces the data model and builds OLAP cubes. Alation, another startup, uses a similar approach for crawling data sets to build catalogs with Google-like PageRanks showing which tables and queries are the most popular. It's supplemented with a collaboration environment where people add context, and a natural language query capability that browses the catalog.
Just as machine learning is transforming the data transformation process to help business users navigate their way through big data, it's also starting to provide the intelligence to help business users become more effective with exploratory analytics. While over the past couple years, interactive SQL was the most competitive battle for Hadoop providers -- enabling established BI tools to treat Hadoop as simply a larger data warehouse -- machine learning will become essential to helping users become productive with exploratory analytics on big data.
What makes machine learning possible within an interactive experience is the emerging Spark compute engine. Spark is what's turning Hadoop from a Big Data platform to a Fast Data one. By now, every commercial Hadoop distro includes a Spark implementation, although which Spark engines (e.g., SQL, Streaming, Machine Learning, and Graph) still varies by vendor. A few months back IBM declared it would invest $300 million and dedicate 3500 developers to Spark machine learning product development, followed by Cloudera's announcement of a One Platform initiative to plug Spark's gaps.
And so our attention was piqued by Netflix's Strata session on running Spark at petabyte scale. Among Spark's weaknesses is that it hasn't consistently scaled over a thousand nodes, and is not known for high concurrency. Netflix's data warehouse currently tops out at 20 petabytes and serves roughly 350 users (we presume, technically savvy data scientists and data engineers). Spark is still at its infancy at Netflix; while workloads are growing, they are not at a level that would merit a dedicated cluster (Netflix runs its computing in the Amazon cloud, on S3 storage). Much of the Spark workloads are for streaming, run under YARN. And that leads to a number of issues showing that at high scale, and high concurrency, Spark is a work in progress.
A few of the issues that Netflix is working to scale Spark include adding caching steps to accelerate loading of large data sets. Related to that is reducing the latency of retrieving large metadata sets ("list calls") that are often associated with large data sets; Netflix is working on an optimization that would apply to Amazon's S3. Another scaling issue related to file scanning (Spark normally scans all Hive tables when a query is first run); Netflix has designed a workaround to pushdown predicate processing so queries only scan relevant tables.
For most business users, the issue of Spark scaling won't be relevant as their queries are not routinely expected to involve multiple petabytes of data. But for Spark to reach its promise for supplanting MapReduce for iterative, complex, data-intensive workloads, scale will prove an essential hurdle. We have little doubt that the sizable Spark community will rise to the task. But the future won't necessarily be all Spark all the time. Keep your eye out for the Apex streaming project; it's drawn some key principals who have been known for backing Storm.