Along with phenomena such as container technology Docker, Apache Spark has emerged as a new darling of the open-source world, with widespread take-up by data teams and developers, backed by a highly active community.
Started in 2009 as a UC Berkeley research project to create a clustering computing framework addressing target workloads poorly served by Hadoop, Spark went open source in 2010 and its September 1.1 release counted over 170 contributors.
Being optimised to perform not only on disk but also for in-memory computations is one of Spark's main attractions, according to Ion Stoica, CEO of Databricks, the company started last year by the creators of Spark's various components.
"It also has a more general and easy-to-use API. So when you write applications in Spark, you don't need to cast them as a bunch of maps and reduces. You can almost write arbitrary applications," Stoica said. "When we do regular surveys and ask people why they like Spark, half say it's speed and half say it's ease of use."
That general API has also facilitated the creation of a set of libraries on top of Spark, which target workloads ranging from streaming, SQL queries, machine learning and analytics to graph computation.
The original goal of the Spark project was to provide a clustering framework that could handle interactive queries and iterative computation, such as machine-learning algorithms — two areas not addressed well by Hadoop at that point.
"Machine-learning people wanted to run algorithms at scale but then each iteration of the algorithm was a memory-use job and between each iteration you had to save and load the data from HDFS [Hadoop Distributed File System]. That was very slow," Stoica said.
"At that time the companies we were collaborating with had already started to have these real-time interactive computation needs versus the batch computation that was provided by Hadoop."
He said Spark's underlying appeal is in providing a unified framework to create sophisticated applications involving workloads that until now might have required several systems.
"Imagine you want to perform interactive queries on streaming data — the data that has just arrived in the system. Today you have to use Storm for streaming the data and then maybe Impala [Cloudera's query engine] or something like that for interactive queries," Stoica said.
"However, these two systems must be separate and you have to maintain them and develop in different languages. But beyond that you need to move the data from Storm to Impala and to write to HDFS. That takes time, which almost negates the advantage of doing interactive queries on streaming data because now the data is delayed."
With Spark, that task becomes easier because the data is not copied. Instead, users have a stream component — Spark Streaming — and write the streamed data in the same data structures that are also stored in memory and which can be read by the Spark SQL component running on top of Spark.
Another example is Spark's ability to provide online machine learning where it can stream the data and then call machine-learning libraries, functions or algorithms, removing the need to combine streaming with a library such as Apache Mahout.
"One of the advantages of Spark is that it unifies multiple workloads. So even if you are just thinking of using it for in-memory computation, you soon start using its other capabilities," Stoica said.
"Why should you go and use a different system when once you learn Spark, you can use the same API, the same development environment, to run your other applications?"
With a goal of commercialising the technology, Databricks' main business model is providing unmodified Spark as a service in the cloud, offering management tools to simplify the creation, maintenance and scaling of Spark clusters.
Stoica said Spark is being used for a number purposes, thanks to its API and also because people are applying the technology from various starting points to solve a range of problems.
"We see everything from ETL [extract, transform, load] to machine learning, and some of the most interesting applications are combining streaming with interactive queries and machine learning," he said.
"Spark handles unstructured data very well and you can you write very general programs with it, not only MapReduce. We also see — because we have Spark SQL, which is interactive in particular on data that's in memory — it being also used on BI tools. Many of the BI tools today are working on top of Spark, so Spark SQL."
Because Spark is a processing engine, it can also work on data that is not necessarily stored in HDFS.
"One of the most popular use cases is Spark over HDFS and we see a lot of that. But Spark can also run on different storage engines. So for instance, in Amazon it will run over S3. We also have Spark running on top of Cassandra," Stoica said.
"So companies that adopt Cassandra, which is a large-scale transaction storage engine, and want to do data analytics beyond just reading and writing data to the store and so forth and querying it, they use Spark. They don't go to Hadoop."
Looking ahead for Spark, Stoica said on the open-source side there are two priorities.
"The first one is, of course, increasing the scale, efficiency, robustness of the Spark core. The second one is improving and developing libraries — where now we have streaming, SQL, MLlib for machine learning, and GraphX, which is in alpha for graph multiplication," he said.
As well as looking to add new libraries, such as one for the R language, SparkR, new machine-learning algorithms will be added to existing libraries.
"If you look into the future, all these applications are emerging on top of Spark. A strong application ecosystem also depends on how easy it is to build these applications. How easy it is to build applications depends on having a good API but also very strong libraries" Stoica said.
"All these successful platforms for building applications, they all have very strong libraries — so that's one of our key focuses in terms of open source."