Microsoft is making a slew of Azure data announcements today, on both the data lake and data warehouse fronts.
First, Microsoft's Azure Data Explorer (ADX) product is being released into general availability (GA). ADX, which I wrote about just last week, is a Big Data storage, query and visualization platform, with a special knack for time series analysis.
Also read: Fastly, Microsoft partner on real-time analytics with Azure Data Explorer
Next, Azure Data Lake Storage (ADLS) Gen2 hits GA today as well. Unlike the initial version of ADLS, the Gen2 release operates as a superset of Azure Blob Storage, but layers a true hierarchical file system on top of it, along with the ability to handle arbitrarily large files. Hierarchical file systems have first class support for folder structures. That's important in Big Data applications where data is often partitioned as groups of sequence files segregated by folder, necessitating folder-level operations that can treat all the files in a folder as a single unit of data.
Standard cloud object storage systems keep all files in a root-level container, and create the "illusion" of folders by embedding directory names into files' metadata. The availability of ADLS Gen2 will essentially give Microsoft a two-tiered storage solution to counter Amazon's S3. While Amazon's one-size-fits-all story has the advantage of simplicity, ADLS gives Microsoft a great Big Data solution, and one that is based on its object store technology, rather than being a completely separate product.
In this first release, ADLS Gen2 file systems will lack backward compatibility with Blob Storage APIs, but that will be added later. Plus, there's plenty of direct support for Gen2, out of the gate. For example, Apache Hadoop 3.2, released last month, offers direct support for ADLS Gen2. Various Big Data ecosystem vendors, including Cloudera, Dremio and Arcadia Data, have also committed to ADLS. And in the Microsoft world, numerous cloud data services, including Azure Databricks, HDInsight, Power BI and Azure Data Factory, support ADLS Gen2 directly, too.
Speaking of Azure Data Factory (ADF), that service will now offer a visual data flow facility, in public preview. While ADF has for some time provided a visual designer for the orchestrations it manages, actual data engineering work had to be done in external scripts that ADF could run. Visual data flows will allow the data engineering work itself to be done in a visual designer, which will generate code behind the scenes.
Also read: Azure Data Factory v2: Hands-on overview
ADF visual data flows should not be confused with the dataflow feature in Power BI, the public preview for which was announced three months ago. Power BI dataflows are a cloud implementation of the company's Power Query technology, which also allows for visual data engineering work (under the moniker of "data prep") to be carried out and which also generates code (in a Microsoft-proprietary language called M) to do it. The name collision is unfortunate, but hopefully Microsoft will rectify it.
For what it's worth, Power BI data flows utilize ADLS Gen2 storage, behind the scenes.
The last installment in Microsoft set of cloud data announcements today involves Azure SQL Data Warehouse (SQL DW) and, to a lesser extent, Power BI. In two rounds of benchmark tests carried out by GigaOm Research (see disclosure at end of this post), Azure DW was found to be 67 per cent faster than Amazon Redshift and up to 14x faster than Google BigQuery. Microsoft will begin a major push around this news and will tout its overall price/performance advantage over its public cloud provider data warehouse rivals, summing it up as outperforming the competition by up to 14x while being up to 94 per cent cheaper.
Also read: Azure SQL Data Warehouse "Gen 2": Microsoft's shot across Amazon's bow
Microsoft will also be pitching the combination of the price/performance-efficient SQL DW service with Power BI and two features recently added to the latter: composite models and aggregations. Together, these two features allow Power BI users to store aggregated data locally in a Power BI model while leaving the more voluminous detail data in an external store. For a given data model, Power BI users used to have to choose between the local "import" and external "DirectQuery" modes, but now they can mix and match. This makes Power BI Big Data-capable when the features are enabled through use of an external store like SQL DW.
Microsoft has now launched "Gen2" iterations of Data Lake Storage and Data Warehouse and a "v2" iteration of Data Factory. And with Power BI being updated each and every month, that product is arguably at about Gen42 now.
Also read: Cortana Analytics: Microsoft's cloud analytics prix fixe
Microsoft is all-in on the cloud, the cloud is all-in on data, and the cloud is now mature. The result? In an effort to win the Enterprise, the major public cloud providers are revving their data services to achieve, then exceed, parity with the best on-premises offerings. That goes not just for basic database services, but data warehousing, BI, data engineering and Big Data analytics. Today marks the start of Microsoft's next big chapter in that epic tale.
Disclosure: I myself do data- and analytics-focused analyst work for Gigaom, but I was not involved in the SQL DW benchmark work.