Although I'm down in Orlando, Florida for the SQL Sever Live! and Visual Studio Live! conferences, Microsoft is putting on its annual Connect(); developer event, up in Manhattan where I normally spend most of my time. And though I'm missing the live event itself, Microsoft was kind enough to brief my on a slew of data-related announcements the company is making at Connect() today. I cover them in detail here.
Brick by brick
The lead item is a real biggie: Microsoft is getting the Apache Spark religion, introducing a new cloud service in preview, called Azure Databricks. This is noteworthy for a number of reasons. First, the service was developed jointly by Microsoft and Databricks (the company whose founders are Spark's very creators), to deliver this Spark-based Big Data analytics service as a first-party Azure offering, and not a mere partner service on the Azure Marketplace. Second, the service works independently of Databricks' own cloud service for Spark and of Azure HDInsight, Microsoft's own Big Data as a Service platform, on which Spark also runs.
Also read: Spark comes to Azure HDInsight
Azure Databricks has nonetheless been designed form the ground up to take advantage of, and be fully optimized for, various Azure services, including blob storage, Data Lake Store, virtual networking, Azure Active Directory and Azure Container Service. While Azure Databricks, like HDInsight, is still based on the creation a dedicated cluster, with the number and type of nodes (servers) being determined by the customer, it nonetheless has built-in auto-scaling and auto-termination, to grow the cluster as necessary and shut it down once it's no longer needed.
Like most Spark environments, Azure Databricks features a browser-based notebook facility as its primary user interface. But the Azure Databricks implementation allows notebooks to be edited by multiple users simultaneously, to accommodate collaborative data science and data engineering. Microsoft says Azure Databricks notebooks also provides an integrated debugging experience and includes a number of sample notebooks to aid users in connecting to common data sources and performing machine learning tasks in Python or R. Azure Databricks is also integrated with Power BI, Azure SQL Database and Azure Data Warehouse, as well as Cosmos DB - the Spark connector for which is being released to general availability (GA).
And speaking of Cosmos DB (Azure's globally distributed NoSQL database service that began commercial life as DocumentDB), there's news there too, also taking a cue from an Apache Software Foundation open source project. Microsoft is announcing Apache Cassandra as a Service, powered by Azure Cosmos DB. Now, in addition to Cosmos' supported SQL, Gremlin, and MongoDB APIs, Apache Cassandra developers will be able to take their applications, whether written for Apache Cassandra or DataStax Enterprise, and run the code on Cosmos DB. Microsoft does this in fulfillment of its promise to make Cosmos DB a true multi-mode database, supporting Cassandra's wide column store NoSQL approach in addition to MongoDB's document store paradigm, Gremlin's graph database constructs and the Azure Table Storage key-value store approach.
The Cassandra API for Cosmos DB is in public preview. Meanwhile, the Azure Table API is being released to GA, and Microsoft is announcing that the Gremlin API will GA next month. And while the MongoDB API has been GA for quite some time, new unique index and aggregation framework pipeline support are being added to it, in preview form.
Users of any of the APIs get access to all five of Cosmos DB's database consistency models, including strong and eventual consistency, as well as three consistency levels in between those two extremes. And Microsoft is announcing the availability of the strong consistency level for multi-region databases, spanning beyond the single-region scope for which that consistency model worked previously. On the other side, Microsoft is upping its service level agreement (SLA) to "five nines" (99.999%) availability for multi-region reads. The SLA had been "four nines" (99.99%) up until now; SLAs for throughput, consistency and latency remain unchanged.
Enterprise Devs get AI
First off, new cross-platform tooling, in the form of SQL Operations Studio, is being released in preview. To further the cross-platform database ethos, Microsoft has joined the Maria DB Foundation and is announcing that a new Azure Database for MariaDB service will be forthcoming, to join the existing Azure Database for MySQL and Azure Database for PostgreSQL preview services. MariaDB, by the way, is a fork of MySQL, created in the wake of Oracle's 2010 acquisition of Sun Microsystems, which gave it ownership of MySQL AB and stewardship of the MySQL database.
Microsoft is also adding an implementation of the SQL Server on-premises feature it now calls Machine Learning (ML) Services, to the cloud-based version of the product, Azure SQL Database. In SQL Server 2016, the feature was called R Services, and this first release for Azure SQL DB will also in fact support the integration of the R language into T-SQL scripts and stored procedures. Python language integration, which was added to SQL Server 2017, will come at later time.
Regardless of the language used, each of these implementations facilitates the creation and training of machine learning models - as well as scoring data against them to make predictions - in the database, without requiring any data to be queried and streamed out into another environment. ML Services in SQL Server 2017 also added a capability called "native scoring" which allows data to be scored against models directly from T-SQL (using the new PREDICT command) requiring zero code written in R or Python. That's a nice feature, and it's included in Azure SQL Database Machine Learning Services, too.
This notion of bringing machine learning services to application developers is further amplified with the introduction of Visual Studio Tools for AI (artificial intelligence), with tie-ins to running models in Microsoft's IoT (Internet of things) Edge. Please see a separate post, by ZDNet's Mary Jo Foley, for detailed coverage of both of these items.
Considering that Microsoft has made AI and underlying data analytics technologies one of its biggest company-wide bets, none of these announcements is surprising. Regardless, the combination of relational, non-relational, BI, Big Data and Machine Learning/AI capabilities - and tooling - provided by today's Microsoft data and developer platforms is unprecedented in scope and speed of delivery. There's a lot for developers, data engineers and data scientists to keep track of here, but the reward to Microsoft in the relevancy of its platform will likely warrant all the disruption.
This post was updated on November 15th, 2017 at 1:06pm ET to correct the original statement that the Cassandra API for Cosmos DB was being released to private preview. It has, in fact, been released to public preview.