R, the open source programming language that is extremely popular with data professionals, seems to be Microsoft's latest religion. And, with the planned June 1st General Availability date for SQL Server 2016, which features an integrated version of R, the Bar Mitzvah/Confirmation is upon us.
Why is Microsoft integrating an open source programming language into its flagship commercial database, a product among its biggest revenue breadwinners? And even if we can answer that one, we need to consider that Microsoft has also integrated R into its HDInsight, Azure Machine Learning and Power BI cloud offerings, as well as its Visual Studio developer environment.
So, really, what's going on? In addition to the technology itself, factors of policy, leadership, and strategy are at play here. If you want to understand Microsoft in the post-Gates/Ballmer era, this question is worth exploring.
First off, Microsoft's embrace of open source is now a fact, rather than an issue. The company gets that open source platforms are de facto industry standards, and that customers like products that support them. Microsoft already has a version of HDInsight, its Big Data platform based on open source Hadoop and Spark technologies, that runs on Linux. It is also developing a version of SQL Server itself for Linux. Then there's Visual Studio Code, which runs on Windows, Mac or Linux. And a large portion of the virtual machines in the Microsoft Azure cloud are running Linux too.
But Microsoft did more than adopt the open source standard. It took leadership in it, by acquiring Revolution Analytics, last April.
Microsoft bought wisely: Before the deal was contemplated, Revolution took the R language, which was a client-side, memory-bound technology, and it made it server-based and multi-threaded. As a second act, Revolution built a distributed version of the product that could run in clustered/grid environments, so it could take on Big Data workloads and avoid sampling.
Finally, in order to combat the inefficiencies (and occasional non-feasibility) of moving data to R (on the client or server) from its origin, Revolution created integrated versions of its server product, for Hadoop and various data warehouse platforms. While, perhaps ironically, Revolution didn't create such a version for SQL Server, the efficacy was already proven out. And so was R, as an Enterprise technology.
Beyond the technology fit, consider the personnel at play here. Microsoft CEO Satya Nadella, who had been on-board for over a year when the Revolution deal closed, was a huge proponent of open source and data technology. Nadella had come out of the Bing operation, before becoming President of Cloud and Enterprise (née Server and Tools) and had a great appreciation for what analytics at scale was all about. He's also the one who pushed Microsoft toward Hadoop and away from its home-grown Dryad technology. He's also spoken elegantly on "systems of intelligence," "ambient intelligence" and "data culture."
Then there's Joseph Sirosh, who before coming to Microsoft was at Amazon.com, in a VP/CTO role. Sirosh's title at Microsoft is now "Corporate Vice President, Data Group," a promotion from his original role, which was more specifically focused on machine learning. Famous (in a certain analytics universe) for his Strata keynote presentation on the "Connected Cow," Azure Machine Learning was Sirosh's baby and so -- by all accounts -- was the Revolution acquisition. Sirosh reports directly to Scott Guthrie, Executive Vice President for Enterprise and Cloud and himself a champion of open source technologies at Microsoft. Guthrie reports directly to Nadella.
The integration of R in SQL Server is clever, if a little bit Rube Goldberg. Data professionals and analysts comfortable working with R from their workstations can do so, and yet they can still delegate the actual compute work to happen on the SQL Server. By using a set of R functions provided by Revolution, and which emulate a set of standard R functions, R pros can set the "compute context" to SQL Server in their R scripts. Then everything executes remotely.
In addition, code in SQL Server's native language, Transact SQL (T-SQL), can run in a "polyglot" fashion whereby R code is embedded inside it. Unfortunately T-SQL sees the R code as a simple text string, which it sends to a special system stored procedure for execution. That means the R code is not color syntax highlighted or checked before execution. Developers are well-advised to test their R code in a client tool first, and then bring it to SQL Server.
The upside of this approach is that R code doesn't require a special mode to be enabled in order for it to run. Compare this to SQL Server's adoption of .NET code, dating back to 2005, which does require explicit enablement, and which has not enjoyed widespread adoption, as a result. R's integration has a simpler, more native feel to it, and for things like user-defined aggregates, where .NET code was arguably most useful in SQL Server, R will provide a lot more power.
But, ultimately, this about more than just R; it's about Microsoft's very identity. Microsoft has decided -- and I think rightly so -- that the next era of computing, while enabled by the cloud, will feature data-driven intelligence; in platforms, in applications and in devices.
The cloud is where things monetize though, and having a suite of powerful data services in the Azure cloud establishes its credibility and should ultimately drive a lot of revenue.
So, really, the company's "mobile first, cloud first" mantra is an allusion to another: "data and machine intelligence, first, last and always."