StreamSets updates ETL to the cloud data pipeline

Real-time streaming has moved the center of gravity for data transformation off the cluster to serverless data pipelines. Cofounded by veterans of Informatica, StreamSets is providing a third-party alternative in a landscape populated by cloud provider dataflow services.
Written by Tony Baer (dbInsight), Contributor

The emergence of real-time streaming analytics use cases has shifted the center of gravity for managing real-time processes. Because they operate in the moment, streaming engines by nature have been confined to performing rudimentary operations such as monitoring, filtering, light transformations of data.

But as the need for performing more complex operations, such as using streaming data to retrain machine learning models, data pipelines have gained new prominence. Data pipelines pick up where streaming and message queuing systems leave off. They provide end-to-end management of data flows from ingest through buffering, filtering, transformation and enrichment, and basic analytic functions that can be squeezed into real time. Typical use cases encompass anything involving IoT, cybersecurity, real-time fraud detection, live clickstream analytics, choreographing online gaming sessions, and so on.

Given the breadth of use cases, it's no wonder that cloud providers Amazon, Microsoft Azure, and Google Cloud are each offering their own data flow services for managing data pipelines, and that data platform providers like SAP and Hortonworks are also getting in on the act.

StreamSets follows in the tradition of third party data integration providers like Informatica or Talend who promote themselves as data Switzerlands - being independent of database and cloud platforms. The resemblance is more than coincidental as the CEO was chief product officer for Informatica in a previous life.

It offers a cloud-based service (initially on AWS, but now expanding to Azure) that is built around an open source development environment and transformation engine, with a subscription enterprise offering that offers the requisite support. StreamSets Data Collector provides a web-based user interface to configure pipelines, preview data, monitor pipelines, and review snapshots of data. It shouldn't be surprising that at first glance, Data Collector looks like a visual ETL tool, but the difference is that you are configuring real-time, rather than batch operations, that operate in a cloud serverless environment rather than on a traditional staging server. It is supplemented by a monitoring piece, StreamSets Dataflow Performance Manager, which provides a control pane for monitoring and resolving data flow bottlenecks.

StreamSets has recently introduced several extensions to its product, including Data Collector Edge that provides an agent shrunk down to below 5 Mbytes that runs natively on Linux, Windows, or Mac machines, along with Android or IoS devices. It's a logical extension of the collector pipeline product for accommodating IoT use cases, and follows in the footsteps of similar offerings from most of the other data pipeline providers. For now, the Edge offering supports routing and filtering, but StreamSets plans to add support for deep learning frameworks such as TensorFlow.

This week, StreamSets is adding a higher level management tool for choreographing multiple data pipelines. StreamSets Control Hub, which has been added to enterprise edition subscription product, adds a cloud-based data pipeline repository that enables the entire team to share, develop, and refine data pipelines. It adds automatic deployment of pipelines, and enables elastic scaling of pipelines via Kubernetes. As an enterprise, team-focused offering, the control hub integrates with Cloudera Navigator and Apache Atlas for data governance.

Over the years, conventional wisdom about where and how to transform data has swung back and forth like a pendulum. When data warehouses emerged, the operable notion was treating data transformation as middleware, so that's where the staging server came in. When data got to big and varied, the center of action shifted, pushed down onto Hadoop clusters where you could run the same batch processes, but on commodity infrastructure with cheaper compute. Although Hadoop, with YARN, could adapt to separating real-time processing to specific parts of the cluster, pipelines proved a more expedient approach for offloading data transformation off the cluster, to the point of ingestion.

With data pipelines, history is repeating itself in another way. Just as databases offered their own data integration capabilities, a third-party ecosystem pioneered by Informatica emerged to provide a Switzerland approach, which would allow IT organizations to become database-independent. In an era where enterprises are looking to cloud deployment, providers like StreamSets are aiming to provide that same data Switzerland when it comes to managing data pipelines.

Editorial standards