Apache Spark becomes top-level project

Apache Spark becomes top-level project

Summary: The Hadoop-based in-memory cluster computing engine moves from Apache incubator status to top-level project

SHARE:
TOPICS: Big Data
0

The Apache Software Foundation announced this morning that Spark, the distributed, in-memory cluster computing framework that runs on Hadoop, has graduated from incubator status to top-level project.

It's seemed for a while now that 2014 would be Spark's year.  Spark moves Hadoop beyond the disk-based, batch mode operation of MapReduce to full-on interactive, distributed in-memory processing colossus.  According to the Spark Web site home page, the engine can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk." 

Begun out of UC Berkeley's AMPLab and now with commercial backing from startup company Databricks, Spark has been the talk of the Big Data town, and holds the promise of making Hadoop into a real-time computing engine.  Leading Hadoop vendor Cloudera has already taken the plunge and integrated Spark into its Hadoop distribution, CDH (Cloudera Distribution including Apache Hadoop).

Big data: An overview

Big data: An overview

Big data: An overview

Spark has all the in-style Hadoop accouterments.  For example, it can run on the YARN component of Hadoop 2.0, and its companion project, Shark, implements a SQL-on-Hadoop engine that is syntax-compatible with Apache Hive, but claims the same 10x/100x increases in performance over it that Spark claims over raw MapReduce.

The Apache Software Foundation explains that Spark has APIs that allow developers to quickly write applications for it in Java, Python, or Scala.  The foundation's press release also explains that "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source."

It's not just the engine that's fast.  Spark's own success has moved at a rapid pace as well, given the project only entered the Apache incubator this past June.  

While it may be moving quickly, keep your eye on this ball.  By combining fast, interactive computing with the cooperative power and economics of Hadoop and its file system (HDFS), Spark has the potential to transform Hadoop from the thing people know they should appreciate to the technology they just have to have.

Topic: Big Data

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

0 comments
Log in or register to start the discussion