After an absence of about a year, and a stint as Research Director at the now defunct Gigaom Research, I've returned to ZDNet to cover Big Data. The year went by pretty quickly, but a number of things have changed:
- SQL-on-Hadoop has become ubiquitous to the point that almost every Hadoop and relational database vendor has its own solution
- Industry consolidation has begun. Companies like Jaspersoft, Pentaho, Hadapt, RainStor and Revolution Analytics have been acquired or soon will be.
- YARN and Hadoop 2.x now have all the mindshare and old-school MapReduce is in retreat
But one change that has become especially noteworthy is the degree to which Apache Spark has captured the attention and excitement of the industry.
Spark can run independently of Hadoop or as a YARN application on a Hadoop cluster. In the latter configuration it can read data in the Hadoop Distributed File System (HDFS) and can then enable a range of workloads to be carried out on that data. Spark SQL enables a HiveQL-compatible SQL execution environment; Spark's MLLib enables machine learning; Spark Streaming provides for high-speed stream processing of data and GraphX provide for graph processing.
See Spark run
In addition to the familiarity that Spark SQL provides, Spark code can be written in Scala, Java and Python. Spark can (but does not have to) use memory, and in a distributed fashion across the RAM facilities in its cluster's nodes. Getting a sample application running in Spark is fairly straightforward. That, combined with its memory-based, non-batch processing capabilities, provide interactive experimentation and near-instant gratification - something that has not been the norm in the Hadoop world.
That relatively friction-free experience, even if at the command line, can be intoxicating. And intoxicated the industry is. While Spark is still quite new and several people have reported to me that it's not ready for prime time, industry support for Spark is intense. Cloudera has promised to re-platform most Hadoop ecosystem components in its distribution onto Spark. MapR includes Spark in its distro and Hortonworks, once a Spark holdout, has jumped on the bandwagon as well, including Spark in HDP (Hortonworks Data Platform), its own Hadoop distribution.
Getting started is easy
While neither Amazon's Elastic MapReduce nor Microsoft's Azure HDInsight cloud Hadoop services include Spark automatically, both companies have enabled installation of Spark via custom script steps that simply require specifying a URL when a cluster is created. Both companies also provide samples and tutorials that make it easy to run quick-and-dirty Scala code or SQL queries.
And if none of that works for you, then Databricks, the company founded by Spark's creators, has its Databricks Cloud offering (something you might wish to call Spark as a Service, if that didn't overload an already well-worn acronym) in the wings.
Some companies, like Paxata and ClearStory Data, have built their products on Spark. Others, like Platfora, have deployed new product capabilities that have dependencies on, and certain integrations with, the Apache Software Foundation project. Adoption of Spark in the enterprise may be low so far, but industry adoption is formidable.
So what happens next with Spark? Some in the industry have predicted that Spark's popularity and its ability to run without Hadoop mean it may overtake it. Others, myself included, are more skeptical of that, given that HDFS alone has become enough of a standard to keep Hadoop entrenched, and YARN allows challengers to run as applications on the cluster.
In general, vendors seem so far ahead of customers on Spark that it's almost worrisome. If Spark isn't yet stable and robust enough for big enterprise production jobs, if even the companies that have standardized on Spark say they have had to write their own enhancements to make it work for them (something I have been told by important vendors in the Big Data space), then is Spark just hype?
Readiness is in the eye of the beholder. Robin Bloor of Bloor Research, a well-respected industry analyst firm, once told me this (and I'm paraphrasing): when platforms get beyond a certain critical mass of support, they eventually become what the hype has made them out to be. In other words, belief in the quality of a platform tends to self-fulfill. Once the industry commits to something, it creates an imperative around getting it stable and well-performing, even if the committers themselves have to pitch in.
We're now a bit more than three months into the year; I saw my first Mr. Softee truck yesterday, a sure sign that Spring has finally come to New York. Before the big Christmas tree goes up in Rockefeller Center at the end of the year, Spark seems likely to achieve at least some of its own self-fulfilling maturation and reliability. There's a bunch of shopping days to go, in the interim; let's wait and see the outcome.