I've mentioned before that I've done a lot of work with Microsoft. Recently, I was visiting the company's corporate campus in Redmond, Washington, for the Global Summit of its Most Valuable Professionals program, in which I participate. As I was on campus, and it was the week before O'Reilly's big data-focused Strata conference, of which Microsoft is a big sponsor, I took the opportunity to sit down with Microsoft's Director of Program Management for BI in its Data Platform Group, Kamal Hathi.
It's not just about Strata, either. Hathi is gearing up to deliver the keynote address at the PASS Business Analytics Conference in Chicago next month and so his mind is pretty well immersed in strategic questions around big data that are relevant to the software giant that employs him.
Redmond's big data worldview
My goal was to find out how Redmond views the worlds of big data, analytics, and business intelligence, and what motivates those views, too. What I found out is that Microsoft sees big data mostly through two lenses: That of its business intelligence sensibility, formed over more than a decade of being in that market; and those of its other lines of business, including online services, gaming, and cloud platforms.
This combination makes Microsoft's analytics profile a combination of old-school mega vendor BI market contender and modern-day customer of analytics technologies. And because Microsoft has had to use its own BI tools in the service of big data analyses, it's been forced to find a way to make them work together, to ponder the mismatch between the two, and how best to productize a solution to that mismatch.
I mentioned last week's Strata conference, and that really is germane to my conversation with Hathi, because Microsoft made three key announcements, all of which tie into the ideas Hathi and I discussed. Those announcements are as follows:
Version 2 of its SQL Server Parallel Data Warehouse product is complete, with Dell and HP standing by to take orders now for delivery of appliances this month. PDW v2 includes PolyBase, which integrates PDW's Massively Parallel Processing (MPP) SQL query engine with data stored in Hadoop.
Microsoft released a preview of its "Data Explorer" add-in for Excel. Data Explorer can be used to import data from a variety of sources, including Facebook and Hadoop's Distributed File System, and can import data from the web much more adeptly than can Excel on its own. Data Explorer can import from conventional relational data sources as well. All data imported by Data Explorer can be added to PowerPivot data models and then analyzed and visualized in Power View.
Hortonworks, Microsoft's partner in all things Hadoop, has released a beta of its own distribution of the Hortwonworks Data Platform (HDP) for Windows. This more "vanilla" HDP for Windows will coexist with Microsoft's HDInsight distribution of Hadoop, which is itself based on the HDP for Windows code base.
As I said, these announcements tie into the ideas Hathi discussed with me, but I haven't told you what they are yet. Hathi explained to me that Microsoft's strategy for "Insights" (the term it typically applies to BI and analytics) is woven around a few key pillars: "democratization", cloud, and in-memory. I'll try now to relay Hathi's elaboration of each pillar.
"Democratization" is a concept Microsoft has always seen as key to its own value proposition. It's based on the idea that new areas of technology, in their early stages, typically are catered to by smaller pure play, specialist companies, whose products are sometimes quite expensive. In addition, the skills required to take advantage of these technologies are usually in short supply, driving costs up even further. Democratization disrupts this exclusivity with products that are often less expensive, integrate more easily in the corporate datacenter and, importantly, are accessible to mainstream information workers and developers using the skills they already have.
In the case of Hadoop, which is based on Apache Software Foundation open-source software projects, democratization is less about the cost savings aspect and much more about datacenter integration and skill set accessibility. The on-premises version of Microsoft's HDInsight distribution of Hadoop will integrate with Active Directory, System Center, and other back-end products; the Azure cloud-based version integrates with Azure cloud storage and with the Azure SQL Database offering as well.
In terms of skill set accessibility, Microsoft's integration of Excel/PowerPivot and Hadoop through Hive and ODBC means any Excel user that even aspires to power user status will be able to analyze big data on her own, using the familiar spreadsheet tool that has been established for decades.
The other thing to keep in mind is that HDInsight runs on Windows Server, rather than Linux. Given that a majority of Intel-based servers run Windows and that a majority of corporate IT personnel are trained on it, providing a Hadoop distribution that runs there, in and of itself, enlarges the Hadoop tent.
Big data in the cloud; big data and the cloud
The cloud isn't just about the Azure version of Hadoop. Rather, it's about ease of provisioning (itself another democratizing benefit), and access to public datasets. Cloud-based Hadoop clusters can be built by filling out a web-based form, clicking a submit button, and waiting for about 10 minutes. And once that cluster is up, it has access to data sources that are themselves cloud based. Yes, on-premises clusters can get to that cloud-based data too. But the effort and fixed infrastructure required to do that work on-premises is more significant.
Overall, Microsoft wants its big data technologies to scale from the desktop to the cloud. Hathi likened the goal to having a "flying car", permitting you to go from ground based to airborne (or desktop to private/public cloud) without having to switch to a plane. Writing that up makes it sound fairly corny, but Hathi was in earnest. It's just too inefficient to make trained information workers and database specialists change to a plane (Hadoop and MapReduce?) just because they want to work with really large datasets.
In-memory is another very important area for Microsoft in the analytics world. Starting with the release of SQL Server 2008 R2 and its companion PowerPivot add-ins for Excel and SharePoint, Microsoft has seized on the value of in-memory technology. With the release of SQL Server 2012, the columnar engine in PowerPivot was brought to the company's Analysis Services product and even to its relational database in the form of special columnstore indexes.
The next version for SQL Server (currently referred to as SQL Server "14") will enhance columnstore indexes and will introduce a second in-memory database engine, code-named "Hekaton," designed for transactional workloads. Hekaton not only keeps data in-memory, but turns database stored procedures that query and manipulate that data into fast, compiled, native code.
Hekaton will bring big performance gains for certain workloads. And if you think those transactional workloads don't impact analytics, think again. As it turns out, Hekaton should be very beneficial for certain data extract, transform and load (ETL) applications as well.
Predicting predictive analytics
I got a lot of good information and insight (in the plain-English sense) chatting with Kamal Hathi, but a few things still concerned me. The biggest thing on my mind was the topic of predictive analytics, sometimes called machine learning, or data mining. Microsoft added a data mining engine to its Analysis Services product all the way back in 2000. That engine was enhanced rather dramatically with the release of SQL Server 2005, and brought the kind of democratization with it that Microsoft seeks to bring now for big data overall.
But since 2005, not much has been done with SQL Server Data Mining. It's still a very useful product, and it still ships with SQL Server, indicating that it's still important to Microsoft, even if the company hasn't significantly invested in it for eight years. But at this point in the market, many of Microsoft's competitors are in the predictive analytics game and Microsoft has fallen way behind. So what does it plan to do about it?
Hathi was a bit cagey with me in his answer. In other words, he didn't tell much of substance. But he assured me that predictive analytics is "super important" to Microsoft and that we will see progress on this front from the company. He indicated something similar for the territory of streaming data/complex event processing (CEP). Right now, the only offering Microsoft has in that arena is StreamInsight, a rather raw CEP engine, geared mostly to developers, that ships with SQL Server.
As I've written previously, Microsoft had excellent showings in Gartner's latest magic quadrants on data warehousing and business intelligence. In the last few years, the company has revamped almost its entire analytics stack, and has embraced big data and Hadoop, despite them being so Linux, open-source oriented. If Microsoft could just get its mobile BI (ie, data visualization for major smartphone and tablet platforms) story going, it would find itself in a fantastic position as the big data and enterprise BI worlds converge.
Disclosure: I'll be speaking at the PASS Business Analytics Conference event in Chicago next month myself.