Apache Spark becomes top-level project

The Hadoop-based in-memory cluster computing engine moves from Apache incubator status to top-level project

The Apache Software Foundation announced this morning that Spark, the distributed, in-memory cluster computing framework that runs on Hadoop, has graduated from incubator status to top-level project.

It's seemed for a while now that 2014 would be Spark's year.  Spark moves Hadoop beyond the disk-based, batch mode operation of MapReduce to full-on interactive, distributed in-memory processing colossus.  According to the Spark Web site home page, the engine can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk." 

Begun out of UC Berkeley's AMPLab and now with commercial backing from startup company Databricks, Spark has been the talk of the Big Data town, and holds the promise of making Hadoop into a real-time computing engine.  Leading Hadoop vendor Cloudera has already taken the plunge and integrated Spark into its Hadoop distribution, CDH (Cloudera Distribution including Apache Hadoop).

Big data: An overview

Big data: An overview

Data is being generated about the activities of people and inanimate objects on a massive and increasing scale. We examine how much data is involved, how much might be useful, what tools and techniques are available to analyse it, and whether businesses are actually getting to grips with big data.

Read More

Spark has all the in-style Hadoop accouterments.  For example, it can run on the YARN component of Hadoop 2.0, and its companion project, Shark, implements a SQL-on-Hadoop engine that is syntax-compatible with Apache Hive, but claims the same 10x/100x increases in performance over it that Spark claims over raw MapReduce.

The Apache Software Foundation explains that Spark has APIs that allow developers to quickly write applications for it in Java, Python, or Scala.  The foundation's press release also explains that "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source."

It's not just the engine that's fast.  Spark's own success has moved at a rapid pace as well, given the project only entered the Apache incubator this past June.  

While it may be moving quickly, keep your eye on this ball.  By combining fast, interactive computing with the cooperative power and economics of Hadoop and its file system (HDFS), Spark has the potential to transform Hadoop from the thing people know they should appreciate to the technology they just have to have.