Strata NYC 2018: AI, data governance, containers and the production-ready data lake

Another Strata Data Conference in NYC has come and gone. Here's a roll-up of the news from and during the show, organized by the themes that emerged.
Written by Andrew Brust, Contributor

It's now a Fall ritual for me: emerge from the haze of summer, walk the kids to school and jump on the 34th Street crosstown over to Jacob Javits Convention Center. Once I get there, I badge up and join all my Big Data buddies who've come to town for Strata Data Conference New York, to show off what they did on their summer vacations.

The other part of the ritual is to collect all the press releases and briefing notes and put together a summary of the news, including a few announcements from vendors who weren't even at the show. This post constitutes the 2018 edition of that summary.

Typically, after so many briefings (I had 15 this year), some common themes emerge. This year the big ones were: the production-readiness of the open source data lake/analytics stack; the integration of container technology (Docker and Kubernetes, primarily) into that stack; the importance of data governance, and the continued march forward of machine learning and AI. I'll use these themes as an organizing tool to discuss all the news.

The Hadoop generation comes of age
Perhaps the capstone of my briefings this year was a discussion with Cloudera's Doug Cutting, the creator of Apache Hadoop. We'd never met before, and I was struck by the timing, given that the Big Data ecosystem is huge, but the importance of Hadoop itself within it has receded -- a phenomenon that was pronounced even at last year's conference:

Also read: Strata NYC 2017 to Hadoop: Go jump in a data lake

I asked Cutting how he feels about the status and role of Hadoop in what some consider to be the post-Hadoop era. His response was a two-parter:

  • The entire Big Data ecosystem is an outgrowth of Hadoop and related technologies, and it's going gangbusters
  • Hadoop has made open source data technology, consisting of a group of loosely-coupled projects a mature, working reality

Cutting's latter point contrasts with the old world of Enterprise data and BI stacks, wherein Enterprises would buy an array of interlocking products from one vendor. Many of those same customers are now bringing together numerous open source technologies that sometimes require a bigger integration effort. But today, through the evolution of the products and the skill sets in the buyer community, taking these products to production is much more feasible.

As an example, Cloudera announced the sixth major release of its distribution this week...more than four years after the release of its fifth. I can't really call it a "Hadoop distribution" anymore, because it now bundles 26 different open source projects within it (as Mike Olson, the company's chief strategy officer told me in a separate conversation this week). But Hadoop 3.x is a major part of the release, as is the Impala-based data warehouse technology that was also announced recently. Along with an IoT-centered partnership with Red Hat, Cloudera has had a lot to chat about recently.

Also read: Cloudera's a data warehouse player now

Another announcement in the Strata time frame, this time on the Enterprise BI front, was Information Builders' relaunch of its flagship WebFOCUS product. The decades-old company, whose headquarters are just a few blocks east of Javits Center, nonetheless made its announcement outside the auspices of the event. The company states WebFOCUS boasts a new user interface (shown below); it also sports data science functions, a new dynamic metadata layer and new data management features. There's new connectivity to cloud data warehouse technologies, including Amazon Redshift and Google BigQuery, too.


The revamped WebFOCUS UI

Credit: Information Builders

And, speaking of Redshift and BigQuery, online data connectivity player Fivetran just this week released its 2018 Data Warehouse Benchmark, measuring performance and cost of both of those products, along with Snowflake, Azure SQL Data Warehouse, and the Presto open source SQL query engine.

In other platform maturity news, Trifacta keeps plugging away at its market -- the company told me it's doubling revenue and tripling its customer count each year. It's entered into a partnership with IoT/machine data player Sumo Logic, and it's added scheduling, alerting, workload management and other features to boost the rigor of its use in production settings. Trifacta isn't just for casual self-service data prep anymore.

On the subject of IoT, quite separately from the Strata event, Sprint announced this week its new Curiosity IoT platform, a combination of a "dedicated, virtualized and distributed IoT core" network, and a new operating system, developed with Ericsson and based on technology from Arm.

Moving on, NoSQL databases are stepping up to production challenges themselves. This comes about through efforts by NoSQL vendors themselves, as well as third parties. As an example of the latter, Rubrik announced its Datos IO 3.0 release, which now provides full backup and recovery capabilities for both Cassandra/DataStax and MongoDB. Datos IO 3.0 can run in containers and across multiple public clouds, including Microsoft Azure and Oracle Cloud, which join Amazon Web Services and Google Cloud Platform as supported environments.

Contain yourself
Speaking of containers and the public cloud, the two together form another big theme at this year's Strata New York event. For instance, Hadoop 3.x itself has introduced the ability for Docker containers to be deployed as YARN jobs.

But, just prior to Strata's kickoff, Hortonworks announced its Open Hybrid Architecture Initiative which is an effort to containerize the entirety of Hadoop. Another facet of this is the separation of storage and compute in the Hadoop platform, leveraging the work of the Ozone File System. This is a big departure in the Hadoop world but, along with containerization / Kubernetes-compatibility efforts, should make Hadoop much more cloud-ready and much more portable between on-premises and public cloud environments.

Also read: Hortonworks unveils roadmap to make Hadoop cloud-native

El gobernador
Another common refrain at Strata was the importance of data governance. Part of this is driven by the need for compliance with regulatory frameworks like the EU's General Data Protection Regulation (GDPR), which went into effect in May of this year.

Also read: GDPR: What the data companies are offering

But there also seemed to be a general consensus that data governance and data cataloging is super-important to the effort of making the corporate data lake something that's usable and a true enabler of corporate digital transformation.

In that vein, Waterline Data and MapR announced a partnership, whereby the latter company will sell an integrated version of the former's product as Waterline Data Catalog for MapR, a new, optional, component in MapR's Converged Data Platform. And Alation announced a partnership with First San Francisco Partners "to deliver best practices for modernizing data governance with data catalogs."

Okera, which only recently came out of stealth, has already announced a v1.2 release of its platform, which combines a data catalog and a permissions-driven governed data fabric. The new release brings connectivity to relational databases, in addition to the data lake sources that were already supported; dynamically-generated role-based views; analytics on top of Okera's usage and audit data (useful for regulatory compliance and breach-detection); and fine-grained permissions allowing for varied data steward roles, so that data stewardship capabilities are not an all-or-nothing feature. The new Okera release is available now.

All about connections
By the way, you can't govern data if you can't connect to it. Accordingly, Simba Technologies, which co-developed ODBC with Microsoft in the 1990s and is now a unit of Magnitude Software, announced its new Magnitude Gateway product. Now, rather than buying individual data connectors, or even a big library of them, users connect to the Gateway product which connects through to multiple back end databases and applications via a framework of "Intelligent," "Standard" and "Universal" adapters.

Another facet of connectivity is access to public data sets. In that regard, Bloomberg announced its Enterprise Access Point, providing standardized reference, pricing, regulatory and historical datasets for Bloomberg Data License clients, developers and data scientists.

Artificial intelligence, naturally
A data service for data scientists is one thing, but, on the other end of the spectrum, SAP announced its new Analytics Cloud, a machine-learning enabled platform to let business users harness machine learning without necessarily needing data scientists. Given SAP manages customers' sales, supply chain and other business-oriented data, its offering contrasts with the Bloomberg service, which provides public/open data.

According to the SAP, Analytics Cloud gives business users the capability to do things like "forecast future performance with just a single click" and "provide risk and correlation detection, autonomous creation of advanced dashboards and storyboards, and hyper-personalized insights into data about suppliers, vendors and customers, including anomaly detection."

But what if you're a data scientist and want to get more hands-on with the data and predictive modeling? Dataiku announced today its Dataiku 5 release, which adds support for deep learning libraries (TensorFlow and Keras) and, just to prove my earlier point, can generate Docker containers that are deployable to Kubernetes clusters, as well.

That's all well and good on the modeling side, but Nvidia, the GPU chip maker that has become all about AI, made several announcements around AI infrastructure and inferencing. The announcements were made this week, not at Strata, but at GTC (The GPU Technology Conference) in Japan. These include:

  • The TensorRT Hyperscale Platform, a new AI data center platform
  • Tesla T4, an AI inference accelerator
  • TensorRT 5: a new version of Nvidia's deep learning inference optimizer and runtime
  • TensorRT inference server: a "microservice that enables applications to use AI models in data center production." (And guess what? It's containerized and scales using Kubernetes on Nvidia GPUs.)
  • CUDA 10: the latest release of NVidia's parallel GPU programming model.

Also read: NVIDIA morphs from graphics and gaming to AI and deep learning
Also read: NVIDIA swings for the AI fences
Also read: Nvidia doubles down on AI

And the kitchen sink
That's just about all the data news that's fit to "print" this week. And it's a lot. But, just as with big data, I find the higher the volume of news, the easier it is to draw out a small set of insights: production rigor, containerization, data governance/data access and AI are the big trends out of this year's Strata. They will likely be the big industry trends for the remainder of the year, and beyond, as well.

Editorial standards