Streaming data is nice and all, especially with the growth of Internet of Things (IoT) technology, and all the data thrown off by the sensors in IoT devices. And, yes, a growing number of streaming data platforms can "land" their data into cloud storage repositories like Amazon Simple Storage Service (S3).
But it's not like you can just wipe your hands clean at that point and get on with life, at least not if you want to do some serious analysis of the IoT (or other streaming) data that's sitting in your cloud storage account. If you're using a data warehouse platform, you'll still need to load the data into it from S3.
Snowflake Computing's Snowpipe, being announced this morning at Amazon Web Services' re:Invent conference in Las Vegas, does just that. The "zero management" service, as Snowflake describes it, watches for newly arrived data in S3 and immediately loads it into the Snowflake cloud data warehouse platform.
At first blush, Snowpipe may seem kind of tactical, straightforward and no big deal. After all, there are lots of ways to load data from an S3 bucket into a database. But a deeper analysis is in order; the raw capability here isn't the breakthrough.
Think about what data loaders need to do: an agent needs to monitor an S3 bucket at some interval, and then it must kick off some code or logic to load the data, optionally transforming it first. You could write your own script to do this, but you'd have to maintain the code, configure a scheduler and provision a virtual machine for all this to run on. You'd need to monitor the operation of it, make certain you're alerted if there was a failure, and respond quickly in that event.
You'd also need to pick the interval at which your process would run. If things run too frequently, you're wasting cycles, and probably dollars, too. If they run too seldom, then you're introducing latency on the analysis side, preventing the newly arrived data from being query-able until your loader runs.
And, yes, you could use Amazon Lambda to run your code in an event-driven and serverless fashion. You could also use Amazon services like Data Pipeline or Glue to do this, but the former involves some non-trivial workflow configuration and the latter involves the generation of Python code that then runs on Apache Spark.
These are all great solutions for the general case, but if you're a Snowflake customer, wouldn't you rather just have a feature in the product that takes care of all this for you, where all you have to do is point it at an S3 bucket and a destination table in your warehouse? That's what Snowpipe offers, in the form of a serverless computing service that's billed based on the amount of data ingested.
The real breakthrough here is the simplicity, the convenience, the single vendor and the low number of moving parts.
Swing your partner
Snowpipe also offers REST APIs, and Java and Python SDKs, that customers and partner companies can tap into, such that other products could serve as additional Snowpipe data sources, or could "listen" in on the loader pipeline and kick off their own logic to read, catalog or process the data as it's loaded into Snowflake. Snowpipe offers a platform where other products can process data on-ingest and, indeed, on-arrival. This essentially enables real-time streaming scenarios for products heretofore limited to operating on data at rest.
For Snowflake customers operating S3-based data lakes, this is a great enhancement to the platform. No, the capability of data loading isn't new. But taking an automated ingest engine that works in real-time, and making it accessible to an entire population of data warehouse customers? That's a good get. And Snowflake got it.