VIDEO: Microsoft shifts gear in the data world
Microsoft has a tricky job in the data world. On the one hand, it has a 25-year legacy in the on-premises relational database business with SQL Server and needs to keep that lucrative business relevant and stable. On the other hand, as the company pivots toward the cloud, it needs to proffer relational OLTP, data warehouse, NoSQL, Big Data and machine learning technologies. And it need to make them credible and competitive against offerings from so many startups in the data and analytics world.
And then there was Strata...
Do the things Microsoft announced at Strata meet the difficult challenge before it? Let's go through them and see if we can make some determination.
Hortonworks goes cloud-first with HDInsight
To begin with, let's take a look at HDInsight, Redmond's cloud Hadoop/Spark core Big Data offering. First off, Microsoft announced that new releases of the Hortonworks Data Platform (HDP) distribution of Hadoop, on which HDInsight is based, would now surface on HDInsight before Hortonworks releases them to the on-premises Hadoop market. This starts now, with the incorporation of HDP 2.6 into HDInsight, and it's a sea change from the days when HDInsight used versions of HDP that were one or two releases behind.
By incorporating HDP 2.6, HDInsight will now also include Spark 2.1. And because HDInsight's 99.9% uptime service level agreement (SLA) applies, Microsoft says it's offering the only Spark 2.1 service with that level of uptime guarantee. For good measure, Apache Kafka is on-board in this HDInsight release as well, along with Spark Structured Streaming/Kafka integration. In fact, Microsoft is providing integration between Spark and its own Azure Event Hubs streaming platform too.
Security and notebooks
Building HDInsight on HDP 2.6 means that the Apache Ranger role-based access controls now extend to both Hive LLAP ("Live Long and Process") and Spark. Apache Zeppelin notebooks, popular with the data science crowd, come along for that ride as well. LLAP - which Microsoft sometimes refers to as "interactive Hive," is competitive with Spark SQL in terms of performance, and Microsoft contributed to the effort of getting it there. So it's no wonder that Microsoft wants folks who use Zeppelin to have access to both SQL-on-Hadoop platforms.
Microsoft's also a champion of R, ever since its acquisition of Revolution Analytics. So to round out the story, Jupyter, the other major data science notebook platform, will now have access to the version of R Server for HDInsight on those clusters which are configured to include it.
There are lots of developers beyond the Zeppelin/Jupyter notebook world though, and Microsoft needs to reach them too. The company doesn't disappoint here either, as it now offers HDInsight tooling for its own Visual Studio integrated development environment (IDE) as well as open source IDE Eclipse and IntelliJ.
What about third party products? Well, Dataiku and H20.ai can now themselves be provisioned in tandem with the provisioning of HDInsight clusters, joining Cask, StreamSets and, my employer, Datameer in the club vendors that support such integration with HDInsight. Beyond that, Microsoft Power BI, as well as competitors Tableau, Qlik and SAP Lumira are now supported query clients against Spark on HDInsight.
The story so far
There's one more lobe of announcements to discuss, but first let's take an inventory of all the bases Microsoft is covering with the announcements we've highlighted to this point. Microsoft continues to modernize its Hadoop platform, now to an extent that goes beyond what's available on-premises; Enterprise-grade SLAs and security are part of that; so is data science notebook support, for Spark, for Hive LLAP and for R; developer tools continue to enjoy tighter integration; and so do 3rd party data science and BI tools.
This means Big Data purists, data scientists, Enterprise IT and corporate developer constituencies are all being accommodated. That's a lot of technology "subcultures" to corral, manage and, in a sense, bring together. And that's exactly what Microsoft needs to do in order to make progress on all fronts.
But what about the database world, both NoSQL and relational? Remember, they're on the roster as well. Well, to start with, Microsoft released Community Technology Preview (CTP) 1.4 of the upcoming version of SQL Server (known only as "vNext" for now). This is the version that's going to run on Linux as well as Windows, and the CTP is available for both platforms. And don't forget that SQL Server includes technologies like PolyBase and R Server that tie it back to the Big Data and data science worlds.
Again, these inter-technology, inter-generational tie-ins are important, as Microsoft needs to provide points of entry into its entire data stack, regardless of which part of that stack a given professional may specialize in.
In that spirit, Microsoft has created native integration between Spark (on HDInsight as well as other distributions) and its own NoSQL database, DocumentDB. The Spark Connector for DocumentDB makes that happen and does it in a non-trivial way: the connector supports "predicate push-down," meaning that when Spark queries DocumentDB, it will also delegate as much of the work in that query's execution to DocDB itself. That maximizes efficiency and minimizes data movement. I've seen a comparison of Spark-to-DocDB queries run with and without predicate pushdown and, believe me, it aids in performance too.
DocumentDB may be Microsoft's own NoSQL database, but Spark connectivity opens it up. So too does its compatibility with the MongoDB API, allowing it to function as a replacement for that product, compatible with applications written to use it. Given Mongo's popularity with devs, this can be seen as yet another developer outreach initiative, and as another manifestation of Microsoft catering to as many data and analytics ecosystems as it can.
No good deed goes unpunished?
If there's a worrisome part in all of this, it's what Microsoft can do for an encore. Is this level of enhancement across platforms, technologies and data access paradigms sustainable? Is it even wise to carry out?
As someone who has worked with and observed Microsoft for most of my career, I'll say this much: I've not seen the company fire on all cylinders the way it is doing now, and it's creating a cycle which I think is virtuous.
Sure, the company would do well to evaluate whether this continues to be the best way forward, in order to avoid burnout or being spread too thin. Right now, though, not only is the approach working, but it's inspiring, to partners, to third party ISVs and, most important, internally at Microsoft. Progress is begetting more progress, and the proverbial flywheel is running. The last thing Microsoft should do now is pull the plug on that.