StreamSets, purveyor of a DataOps suite for open source analytics and commercial databases alike, is today announcing support for Microsoft's recently announced SQL Server 2019 Big Data Clusters. StreamSets' solution adds operational rigor and, perhaps even more important, ecosystem acceptance for Microsoft's new analytics solution.
To assess that statement, a little background is on order. In November, at its Ignite conference in Orlando, FL, Microsoft announced the general availability of SQL Server 2019. In addition to being the newest version of Microsoft's veteran relational database platform, this release includes a new deployment option: the Big Data Cluster (BDC). The BDC solution unites SQL Server, the Hadoop Distributed File System (HDFS), Apache Spark and Kubernetes (K8s), the white-hot open source container orchestration technology. If you're interested in more details, check out my coverage from the day of the announcement.
Also read: SQL Server 2019 reaches general availability
Dose of Reality
As cool as all this tech is, there are a number of challenges in its way. For example, BDC's flavor of Spark is not quite a "vanilla" implementation, and BDC's usage of K8s introduces some special circumstances of its own. This is all by design, since SQL Server is primarily an on-premises database platform with the stringent, sometimes isolationist security characteristics such a platform must honor. The consequence of this is that third-party products built generically for Spark, and even for SQL Server, may not "light up" all the functionality BDC offers. And BDC's newness creates risks of its own; SQL Server's been around since the 90s (or even the late 80s, depending on how you count) and many of its long-time practitioners tend to be skeptical of adopting newer technologies Microsoft integrates into the platform.
That's why StreamSets announcement is so significant. Jobi George, StreamSets' General Manager, Cloud Business, explained to me that the company had been working with Microsoft for over a year to make this integration happen. As a result, StreamSets' is the first third-party solution that can be directly installed into the K8s "pods" in a BDC. This enables StreamSets' Data Collector product to ingest data directly into a BDC's HDFS implementation; its Transformer product to provide a visual/no-code ETL (extract, transform and load) and visual debugging solution directly over BDC's Apache Spark implementation; and StreamSets Control Hub to run in the cloud or on the BDC itself.
A little help from Redmond's friends
As it happens, I have been teaching SQL Server professionals the ins and outs of SQL Server 2019 BDC as part of SQL Server developer full-day workshops over the last year, and can report that I've seen a mixture of excitement and bewilderment from workshop attendees in reaction to the technology. Third-party ecosystem backing and support of SQL 2019 BDC is critical to its gaining traction in the market and the trust of customers, including those new to Microsoft's platforms, as well as those who have standardized on them for years, or even decades.
StreamSets' integration provides needed endorsement. It also enables data scientists, data engineers and analysts familiar with open source analytics and machine learning stacks to be part of the SQL Server ecosystem too. Lots of enterprise data, for reasons of regulatory compliance or corporate policy, cannot just be flung into any old data lake. SQL 2019 BDC provides a solution for new constituencies to get at this enterprise data, safely, using their own tool chain. StreamSets' support should at once make that easier and a more vetted, defensible technology investment.