Hortonworks, Confluent and Waterline attempt to make Big Data easier

In the last two weeks, three vendors, in separate announcements, have showed how analytics industry vendors recognize the complexity of their products, and are ready to do something about it.

Three vendors, with announcements over the last two weeks, are providing evidence of some soul searching in the analytics industry around the complexity of its technologies. Hortonworks, a spinoff of the original Hadoop development team at Yahoo, is offering managed services for its own Hadoop distribution. Confluent, the company behind the very popular Apache Kafka streaming data platform, put its SQL query layer into general availability. And Waterline Data, an important player in the data governance space, added new GDPR and data optimization features to its core data catalog functionality.

Horton hears a help cry
Hortonworks' offering may be the most radical, in its own way. In addition to preexisting support subscriptions for its Hortonworks Data Platform (HDP) and Hortonworks DataFlow (HDF) products, the company is now offering a new subscription level for both technologies called Hortonworks Operational Services.

According to the company's press release, the subscription, announced today, offers "a fully managed environment for customers...and ongoing access to Hortonworks engineers and support teams." In that same release, Hortonworks CTO Scott Gnau says"At its heart, Hortonworks Operational Services is designed to reduce complexity when building, deploying and managing big data, whether it is on-prem or in the cloud..."

In the old days, such an arrangement would be called a turnkey installation, with outsourced management. It's reminiscent of the way IBM used to sell and service its mainframe computers and even the way many Enterprises ran their data centers. It's also an admission that Hadoop has many moving parts, and that Hortonworks can probably shorten sales cycles and increase sales volumes by offering to take the management burden off the customer's shoulders.

Confluent's special K
Confluent, like Hortonworks, is a company founded to support an open source technology by the group of people who built it at a bigger company. In the case of Confluent, the team at LinkedIn that built Apache Kafka founded the company to support it, enhance it and build an Enterprise distribution around it.

Many new technologies first surface in the open source distribution at the beta level and later are added to the Enterprise distribution for general availability. And that's exactly what happened with KSQL, an adaptation of the Structured Query Language for streaming data. KSQL works with data streams and allows them to be addressed and queried much like static tables, to which the streams can be joined. KSQL was first announced in August, 2017 and as of March 7th of this year, is available in general release as part of the Confluent Platform Apache Kafka distribution.

KSQL brings working with streaming data into the skillset comfort zone of many more developers than those who previously would have had to work with Kafka's unique application programming interfaces (APIs). KSQL provides a relatively simple layer of abstraction around an otherwise complex set of data structures and commands used to work with streaming data. Overall, this means IoT analytics is much more accessible to mainstream developers than it was before.

Governed through the Waterline
On March 5th, just two days before the Confluent announcement, Waterline Data announced the transformation of its data catalog product into a full-fledged platform, which now sports applications for GDPR (the European Union's General Data Protection Regulation) compliance and data optimization/virtual data lake management. (Actually, the platform was there all along, but now multiple applications have been built on top of it.)

The GDPR application, geared towards Data Privacy Officers (DPOs) and Data Stewards, includes risk assessment through audit reports, monitoring through an operational dashboard and case management through a workflow facility. The data optimization piece helps merge redundant data sets and generate a "super schema" that represents all data sets, providing a virtual data lake.

Complexity be gone?
As in the Hortonworks and Confluent cases, Waterline's solution -- especially its data optimizer -- is meant as a simplification lifeline for otherwise complex work. So whether it be data governance and schema management, querying of streaming data, or management of Hadoop itself, it's clear the industry has realized that it must simplify the complexity it has created.

With so many different underlying open source components and the stringency of emerging regulatory regimes, the analytics industry has to make things easier. These three announcements show that vendors are getting the message. What remains to be seen is how quickly they can grab it off the service bus and process it.