Hats off to my New York-based Big on Data brother Andrew Brust for catching the theme that it's time to pick up the pieces and consolidate the Hadoop sprawl. And in fact, the Hadoop world has come to the realization that's it's not about any particular platform, but about the data. Channeling his inner Fred Armisen, Doug Cutting traipsed onto the stage to announce that forthwith, the conference would be rechristened Strata Data. While we're on the topic, the upcoming Hadoop Summit will now be known as DataWorks Summit.
Is there something in the water?
Part of the identity switch is due to the realization that with a proliferation of mix and match projects, it's getting futile to ask what makes Hadoop, Hadoop. And more to the point, as enterprises increasingly take to the cloud, whether you fire up a Hadoop or a la carte managed Spark or machine learning service is growing less and less relevant - the importance is on the results, regardless of whether there is an elephant in the room or not. As Merv Adrian observed, "stack expansion has ground to a halt." Maybe, as Cloudera's Mike Olsen predicted a couple years back, Hadoop is finally becoming invisible.
The good news is some of those zoo animals are coming together. We've been impressed by, but also a bit critical of, the proliferation of tools for curating and tracking the inventory of data lakes. But Trifacta and Alation have put a couple pieces together, literally, with a new integration that brings together data wrangling and data cataloging. So, as the business user or data engineer catalogs and annotates the data in Alation, they can clean it up within the tool by pushing the Trifacta button, or vice versa. For Trifacta, it's been a good couple weeks, as this follows up their coup, becoming the OEM data wrangling tool powering the forthcoming Google Cloud Dataprep that was just released for private beta.
Actually, this being 2017, surprise, surprise, AI was front and center. The coolest demo of the conference came from Thomson Reuters VP of R&D Khalid Al-Kofahi, showing how machine learning can be used for identifying fake news within 40 milliseconds. The model takes a layered approach, using natural language processing to parse the headline or social network post, tag it, assess the credibility of the sources) based on back history, and whether there are multiple sources and/or are those other sources bots.
"If you want to do data science, don't be like Watson. Be like Holmes."— IBM's Mike Olson
With a dig at IBM, Mike Olson came up with the sound bite of the week: for AI, "If you want to do data science, don't be like Watson. Be like [Sherlock] Holmes" - which lead to some rather interesting interpretations on the retweets. But as Andrew reported last week, Cloudera came up with a contribution of its own, a Data Science Workbench with the goal of bridging the gap between data science and data engineering to make sure that those models can actually run on the cluster. Cloudera's workbench is providing a lifecycle management approach so that data scientists and data engineers can save, reuse, and version models.
But beyond the AI headlines, we found new attention to the real-time, Fast Data side of Big Data. There's nothing more real-time than online gaming. Phil Keslin, CTO of Niantic Labs - the Pokémon Go folks - recounted some lessons learned during the unexpected spike of the game's debut. Not only did it test the elasticity of the Google cloud, but also reminded us that every software application must depend on the kindness of strangers - in this case, open source third party libraries that tended to crash at 2:30 each afternoon when the west coast player crowd joined the east coast group in sufficient numbers to almost break the cloud. In this case, the bug was discovered through scanning an obscure developer forum, then implementing a workaround.
If the Hadoop zoo animals are getting less visible, their counterparts in the streaming world are becoming more so. Just as AI experienced a winter where software and hardware could not meet expectations for sentient machines, streaming - or at least its predecessor, complex event processing - experienced a similar shortfall over the past decade.
We've been burnt before. Back in 2012, in these pages, we forecasted such a rebirth. In the years subsequent, we saw rapid uptake for Fast Data at rest, through in-memory and flash-based databases. We've seen multiple streaming engines emerge. But if Ovum client queries were a good yardstick, we saw little interest in streaming. In the recent past, it's been overshadowed with AI and data lakes.
So it's interesting that, although Spark has gained momentum, there remains white space over streaming. Yes, many in the Spark community are taking advantage of Spark Streaming (which for now is still micro-batching), but consensus has yet to coalesce over which streaming engines are going to emerge de facto standards. And so at Strata, we saw several open source engines (e.g., Apache Apex supported by DataTorrent; Flink, supported by data Artisans) fight it out, contested by SQLstream. We're seeing a rash of data flow engines for managing and integrating data pipes from Hortonworks, MapR, Confluent, Striim, and others.
When it comes to streaming, there's clearly smoke; the question is where's the fire? IoT, a topic that provided the undercurrent for much of the agenda, could be forcing the issue. Some announcements at the conference, such as MapR Edge, covered by Andrew, acknowledge the reality that you don't want raw data from proliferating devices congesting the backbone. The reality of IoT means that, at some point, enterprises are going to have to make decisions about streaming.