On September 28, Microsoft made it official: The technological underpinnings of the coming Azure Data Lake service are based on the very ones that the company uses internally as part of its "Cosmos" big-data storage and analytics service.
We've known since April that Azure Data Lake -- Microsoft's self-described "hyperscale repository for big data analytic workloads in the cloud" -- would be compatible with the Hortonworks Hadoop Distributed File System (HDFS). At that time, Microsoft advised those interested to sign up for notification for the upcoming preview of the Azure Data Lake store.
Microsoft officials said today that the analytics engine and store will be available in public preview later this year.
Azure Data Lake will work with HDInsight, Microsoft's Hadoop-on-Azure service for Windows and Linux. (The Linux version of HDInsight, which works on Ubuntu, is generally available as of today; the Windows version has been available since 2013.)
Microsoft's overarching goal for Azure Data Lake is to allow customers "to extract the maximum insight from all data, anywhere," said T.K. "Ranga" Rengarajan, Microsoft's Data Platform Corporate Vice President.
Cosmos was built using Microsoft's Dryad distributed-processing technology. Microsoft uses Cosmos internally to process telemetry data; to perform analysis and reporting on large datasets, such as those created via Bing and Office 365; and to curate and perform back-end processing on many kinds of data. A lot of the data used for these various purposes is shared. Queries on this data can run on anywhere from one to 40,000 machines in parallel.
"This (Azure Data Lake) is more ambitious than Cosmos," said Rengarajan. "It's also inspired by Apache Spark, data warehousing, and more. We've been thinking about this problem for years."
While Microsoft's internal usage of Cosmos has taught the company a lot about parallel computation, "Cosmos ws built in a different way and in a different age," Rengarajan said. These days, users are looking for solutions as to how to debug something running on thousands of machines in a few hours or how to execute a query across thousands of machines, but still have it look very familiar, he said.