'

Cloud data warehouse race heats up

This week, startup Snowflake Computing GA'd its cloud data warehouse and Microsoft took its Azure SQL Data Warehouse service to public preview. Amazon Redshift isn't the only player anymore.

The notion of running a data warehouse in the cloud was a pretty novel thing when Amazon Web Services launched its Redshift service in November of 2012. Most on-premises data warehouse (DW) platforms are appliance-based, which makes them difficult to expand, and the resulting need to leave room for growth also makes them expensive to acquire. In the cloud though, economics are better, elasticity is realistic and logistics are streamlined. Combine that with the ability to handle "big data" volumes with the familiar SQL/relational model that Redshift uses and it's hardly surprising that the service has been one of Amazon's fastest growing since its launch.

Amazon has essentially had the cloud DW space to itself this whole time, but that changed in a big way this week. One new competitor, Snowflake Computing, went into general availability on Tuesday with its cloud DW service, as it closed on a $45M Series C funding round. And Microsoft, Amazon's growing nemesis in the cloud, launched a limited but public preview of its Azure SQL Data Warehouse service on Wednesday.

Things in common
The two new kids on the block have a couple of things in common. In a sense, they both have a Microsoft SQL Server pedigree, as Azure SQL DW is based on SQL Server technology and Snowflake's CEO, Bob Muglia, used to run Microsoft's Server and Tools Business (now its Cloud and Enterprise Division), under which the SQL Server organization falls.

The two challengers have an architectural approach in common as well, and it's one which Redshift doesn't share: they each scale computing resources independently of storage. In other words, while Redshift lets you add nodes to your DW cluster, each of which adds more processing power and storage capacity, Azure SQL DW and Snowflake let you add processing capacity, or storage space, separately.

In the case of Azure DW, that storage is provisioned from Azure Blob Storage (similar in concept to Amazon's S3). Storing the DW's data in Azure Blobs means the compute nodes can be shut down, and later resumed, without any damage to the data. It also means that processing capacity can be grown or shrunken very quickly, because storage does not have to be rebalanced as the compute nodes are added or dropped.

While Redshift's tethering of storage and compute together in the same node may look arbitrary and unwise, it's important to realize that this approach can offer better performance. Cloud storage is economical, but it's relatively slow, compared to the discrete drive storage that can be installed in the physical servers that the nodes are based on. Most Redshift customers I have spoken with have told me that they get phenomenal performance from the service. We need to wait and see if Microsoft and Snowflake customers say likewise.

Unstructured (data) play
Snowflake, while it seemingly does not offer the option to pause and resume your DW, separately scales compute and storage nonetheless. And, according to Snowflake's Web site, its service can "load and store semi-structured data in native form" (including data in Avro and JSON formats) alongside conventional, relational data, and can process the two in tandem.

Azure SQL DW is based on the same technology found in Microsoft's on-premises Analytics Platform System (APS), formerly known as SQL Server Parallel Data Warehouse. That product is a massively parallel processing (MPP) version of SQL Server. And given that SQL Server 2016, the next version of Microsoft's relational database management system, will have native support for JSON data, it's likely that Azure SQL DW will get such support as well.

Home field advantages
One of Redshift's unique advantages is that it offers integration with Amazon's S3, DynamoDB NoSQL service and its Elastic MapReduce (EMR) Hadoop service. But Snowflake's service runs on Amazon Web Services' cloud, which means that movement of S3, DynamoDB and EMR data into Snowflake should offer relatively low-latency, even if the integration is not as seamless as with Redshift. Since a lot of companies store data in the AWS cloud, this is an advantage for Snowflake.

That said, the Azure cloud is gaining momentum very quickly, especially amongst enterprise accounts. In addition to Blob Storage, Azure offers its own NoSQL services (Azure Table Storage for key-value storage; DocumentDB for JSON document storage) and Hadoop Service (HDInsight). Customers using those services may find Azure SQL DW more attractive than its competitors.

Quo vadis?
Where will the cloud data warehouse market go, now that it's a real category, with multiple competitors? Up, I suspect, as petabyte-scale, SQL/relational-based services offer a balanced combination of familiar technology and hyper scale. Lack of elasticity is likely the biggest strike against on-prem DW platforms, and bringing them to the cloud all but removes it.

This is now a space to watch.