Strata HadoopWorld Fall 2016 postmortem: Maybe AI's the future, but can we make the data science work?

At Strata, we had several discussions on the supposed inevitability of AI and machine learning. But just because you have a Hadoop cluster doesn't mean that your ML models are going to run.
Written by Tony Baer (dbInsight), Contributor

Given all the hype over artificial intelligence (AI) these days, at first glance it would seem surprising that it appeared as almost an afterthought at Strata last week.

There were a handful of product announcements, like Maana, which added semantic search-like capabilities in its newest release of its knowledge management platform for resource-intensive industries like oil and gas; and Splunk, which grafted machine learning to its offerings for identifying and resolving incidents from IT system log files.

And in a keynote talk entitled "Connected Eyes," Microsoft's Joseph Sirosh spoke of a project with India's leading eye institute that applied machine learning over large patient populations to improve outcomes for eye surgery.

But this obscures the bigger picture. Conference sponsor O'Reilly acknowledged this by breaking out AI into a separate pre-event track the day before. And anyway, this wasn't a Google Cloud event, where AI was front and center.

So, get used to it. There's plenty of hype going around whether AI can, will, or should replace humans (spoiler alert: the answers are "not"). But even if present-day AI is no smarter than a bunch of idiot savants, there are plenty of practical and often unglamorous jobs that AI's core ingredient, machine learning (ML), is already performing.

Last year at Strata, we saw ML becoming almost ubiquitous in tooling for data management and governance of data lakes from providers from A to Z.

The rationale for using ML, rather than static governance rules, is due to the nature of data lakes. Unlike data warehouses, you won't know exactly what data will flow in, and so therefore, it won't be practical to build rules ahead of time dictating schema, data quality, de-duping, or identifying what data is likely to be sensitive (even weblogs could give PII data away).

Governance, whether it involves preparing data, building a catalog, and identifying master or reference data may be a moving target requiring the system to "learn" how the norms are changing.

And there's ML elsewhere as well. Providers like Cloudera build ML into the trouble ticket tracking that backs the automated "phone home" function of subscriber client technical support.

As we noted with our take on DataRobot, there is a growing array of tools aimed at simplifying or accelerating different aspects of the lifecycle of building and deploying ML programs.

And ML is showing up in end user analytic tools that help humans parse the signals in data, wrangle it into shape, suggest which questions to ask, and help piece together the narrative.

In other words, when it comes to the packaged software tools that govern big data or analyze it, we're probably starting to take embedded machine learning for granted.

But what if your own data scientists want to get their own hands dirty? As we noted a few weeks back, there's a lot of pent up enthusiasm among R and Python programmers for ML, which many look at as the latest shiny, new thing.

But for all the enthusiasm, at least among Spark users, SQL and streaming are more frequent workloads according to the 2016 Spark Survey just released by Databricks.

Part of the disconnect between enthusiasm and action is that the R or Python programs that machine learning developers write don't necessarily work as well with Spark compared to Scala (which is Spark's native language). Their ML libraries of choice, drawn from sources like CRAN or Scikit-Learn, are not easily ported to Spark.

SparkR and PySpark, the APIs that R and Python programmers use to access Spark, also have their limitations. For instance, SparkR does not support a number of functions (e.g., splitting data sets) to which R programmers are accustomed and only supports a subset of Spark MLlib machine learning libraries.

Meanwhile, PySpark does not yet support all of Spark's API calls. Furthermore, Spark's DataFrame differs in syntax from Pandas, which is Python's equivalent.

Admittedly, all this is occurring as the targets are moving with the transition to Spark 2.0. Spark MLlib will eventually be subsumed by Spark ML, while DataFrames are being unified with DataSets to provide more streamlined targets. And hopefully, with refactored targets will come extensibility that might address some of the impedance mismatches.

And the R community is also taking the law into its own hands. At Strata, RStudio announced sparklyr, a new adaptation of the popular dplyr R data manipulation package for Spark.

Maybe the problem's even more basic: there's data out there, but your organization hasn't figured out the use case. Providers like Datameer are putting their own skin in the game with offerings like no-obligation half day workshops to hammer out some blueprints.

But let's say your team has gotten past that: they know what the problem is and your data scientists are already coding predictive models against it. Now what?

The cold hard reality is translating programs and getting them to run are separate challenges. The data scientist may know his or her way around algorithms, but may not have the skills of the data engineer for physically deploying the programs and marshalling up the nodes and data sets to get them to execute.

This scenario shouldn't be shocking because, although data scientists may enjoy exalted status these days, their day to day issues are rather mundane.

So don't get surprised when the models that data scientists develop don't always get deployed. Matt Brandwein, a product manager at Cloudera, found an all-too-frequent scenario among customers where models got no further than PDF files that may or may not have made their way to somebody else in the organization.

And if the model does get to someone, don't be surprised if they code it in the language they know. And in that case, let's hope that the logic of the model survived translation.

Providers like IBM have addressed the gap with collaboration offerings like the Data Science Experience, which supports accessing notebooks, managing data science projects, scheduling analytic compute runs, and managing access and tracking lineage to different sources of data.

Providers like Alpine Data, Dataiku, Domino Data Lab in turn offer tooling attempting to bridge the gap between data scientists and the business, and in some cases, track deployment. But again, there's scant automation of the physical deployment step.

All too often, the reality of deploying data ML models to production is that there's a gap that depends on the kindness of strangers -- or more likely, that of your friendly local data engineer.

This is the second of two posts reviewing our take of Strata Fall 2016.

Editorial standards