IBM makes cluster compute engine Apache Spark core to its cloud

IBM is to offer Apache Spark as a service through its Bluemix cloud platform and will deploy thousands of developers to work on the distributed computing framework for big data analytics.

IBM today pledged to put the Apache Spark data processing platform center stage in its cloud services.

The technology giant plans to embed Spark into its analytics and commerce offerings, and to offer Spark as a cloud service on its Bluemix platform.

Spark was started in 2009 as a UC Berkeley research project to create a clustering computing framework addressing target workloads poorly served by Hadoop. It went open source in 2010 and last year had more than 450 contributors. Its creators went on to found Databricks.

Databricks CEO: Why so many firms are fired up over Apache Spark

In five years, analytics cluster framework Spark has moved from a research lab to the datacentre and production workloads. Databricks CEO Ion Stoica charts its rise.

Read More

Spark has various advantages over Hadoop's MapReduce execution engine when it comes to processing big data, in both the speed with which it carries out batch processing jobs and the wider range of computing workloads it can handle. Spark SQL supports a HiveQL-compatible SQL execution environment; Spark's MLLib enables machine learning; Spark Streaming provides for high-speed stream processing of data and GraphX provide for graph processing.

Big Blue sees a role for Spark in providing the backend for apps and Internet of Things appliances - supporting real-time analysis and predictions from big data.

IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its IBM SystemML machine learning technology to the Spark open source ecosystem; and help provide training for more than one million data scientists and data engineers on Spark. This training will be provided in partnership with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Spark will also be used to power the insight platform for IBM's Watson Health Cloud - which IBM claims will deliver faster results to doctors and medical researchers when analysing population health data.

One of the organizations that will use the Spark service on Bluemix will be the SETI Institute, which is working with IBM and NASA to analyze terabytes of deep space radio signals using Spark's machine learning capabilities in a hunt for patterns suggest the existence of intelligent extraterrestrial life.

"With Spark as a Service on Bluemix, we'll be able to work with IBM to develop promising new ways to analyze signal data as we hunt for evidence of intelligence elsewhere in the cosmos," said Dr. Seth Shostak, senior astronomer and director of the Center for SETI Research.

IBM is one of four founding members of the UC Berkeley AMPLab, where Spark was first invented, and as a result works closely with AMPLab researchers on projects of mutual interest.

More on big data