It's seemed for a while now that 2014 would be Spark's year. Spark moves Hadoop beyond the disk-based, batch mode operation of MapReduce to full-on interactive, distributed in-memory processing colossus. According to the Spark Web site home page, the engine can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk."
Begun out of UC Berkeley's AMPLab and now with commercial backing from startup company Databricks, Spark has been the talk of the Big Data town, and holds the promise of making Hadoop into a real-time computing engine. Leading Hadoop vendor Cloudera has already taken the plunge and integrated Spark into its Hadoop distribution, CDH (Cloudera Distribution including Apache Hadoop).
Spark has all the in-style Hadoop accouterments. For example, it can run on the YARN component of Hadoop 2.0, and its companion project, Shark, implements a SQL-on-Hadoop engine that is syntax-compatible with Apache Hive, but claims the same 10x/100x increases in performance over it that Spark claims over raw MapReduce.
The Apache Software Foundation explains that Spark has APIs that allow developers to quickly write applications for it in Java, Python, or Scala. The foundation's press release also explains that "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source."
It's not just the engine that's fast. Spark's own success has moved at a rapid pace as well, given the project only entered the Apache incubator this past June.
While it may be moving quickly, keep your eye on this ball. By combining fast, interactive computing with the cooperative power and economics of Hadoop and its file system (HDFS), Spark has the potential to transform Hadoop from the thing people know they should appreciate to the technology they just have to have.