Hadoop, which has been the poster child for Big Data, started out using an open source implementation of Google's MapReduce as its exclusive processing engine. But for many small and medium-sized data sets, the overhead of MapReduce's batch-mode, two-pass approach made it hard to realize any real benefit.
Hadoop 2.0's YARN cluster management layer opened up Hadoop to other processing engines, and its Tez DAG (directed acyclic graph) engine makes interactive processing much more feasible. Nonetheless, Hadoop is painted with the broad brush of batch processing.
Other engines, most notably Apache Spark, have become known as the opposite. Because Spark can manipulate data in-memory, it offers the interactivity of an OLAP or Data Warehouse column store solution. Yet it maintains the approach of processing data on a distributed basis across servers in a cluster.
Cloudera jumped on the Spark bandwagon early; MapR added Spark to its own Hadoop distribution soon after. Hortonworks, the company who contributed much of the engineering effort and leadership to YARN and Tez, was less interested, but eventually got on board. Next came IBM, with a splashy set of announcements at Spark Summit in San Francisco last month.
Now Microsoft has joined in, launching today a public preview of Spark on Azure HDInsight, its cloud-based Hadoop offering. Spark on HDI (as Microsoft sometimes refers to HDInsight) includes integration of Spark's dashboard user interface into its own. It also includes support for Jupyter and Apache Zeppelin notebooks, which provide browser-based environments for creating Python and Scala code (respectively) as well as SQL queries, for manipulating, querying and visualizing Spark data.
What's especially noteworthy about Microsoft's Spark implementation is that the company has integrated it directly with its own Power BI service, which the it announced today will enter general availability on July 24.
Alongside a number of other data and cloud services, Power BI allows users to select Spark on HDInsight as a service connection from which to build a dataset.
Microsoft's adoption of Spark, and simultaneous integration of it with its strategic BI platform, sends a clear message. In-memory technologies that can handle small and medium-sized data sets in quick, interactive form, herald the convergence of Big Data and traditional BI. Whether Microsoft means to convey this message explicitly or not, one look at the above screenshot, which places Spark next to Microsoft's own OLAP platform (SQL Server Analysis Services), makes this point crystal clear.
Hadoop is changing. It is becoming more defined by the ecosystem of tools and projects compatible with its HDFS storage system, and less by its own processing infrastructure.
Is this good? Is it misguided? Frankly, it doesn't matter. The state of the art in the analytics space is more empirical than prescribed. And the unavoidable observation is that Spark is mainstream in the industry and helping break down the walls between Big Data and BI.