Ion Stoica, CEO of Databricks and a professor of computer science at University of California Berkeley, and Arsalan Tavakoli-Shiraji, Head of Business Development and Partnerships, recently stopped by to talk about Apache Spark, the role Databricks has in that project and helping organizations extract real value out of the operational data they already have.
What is Apache Spark?
Apache Spark is a project designed to accelerate Hadoop and other big data applications through the use of an in-memory, clustered data engine. The Apache Foundation describes the Spark project this way:
Spark is a fast and powerful engine for processing Hadoop data. It runs in Hadoop clusters through Hadoop YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Which languages does Spark support?
Spark supports Scala, Java and Python.
How large a cluster can Spark scale to?
We are aware of multiple deployments on over 1,000 nodes.
Who is Databricks?
Databricks is a company founded by the creator of Apache Spark and a number of executives with strong past experience starting up companies, such as Conviva, Opsware, and Nicria.
The company is offering a cloud service, Databricks Cloud, that makes it possible for organizations to quickly get started with Apache Spark. Databricks Cloud handles the metadata, launching and provisioning a Spark Cluster, and makes it easy for that cluster to process an organization's data stored in Amazon's S3 service.
Databricks cloud helps analysts by organizing the data into "notebooks" and making it easy to visualize data through the use of dashboards. It also makes it easy to analyze data using machine learning (MLib), GraphX and Spark SQL.
Getting real value out of big data
How does Databricks help organizations get real value out of their data? The challenge, Databricks points out, is that Apache Hadoop consists of a number of independent, but related, projects. First, an organization must learn about these projects, what the technology does and how to put them together to solve the organization's problems. Then they have to learn how to build a Hadoop cluster and how to prepare the data. Only then can they start the process of exploring the data and gaining some insights.
Databricks wants to reduce that to signing up for its service, pointing to the organization's data and then beginning the process of cataloging and analyzing data. Databricks has done the work of collecting the appropriate tools, configuring them and turning a bunch of independent projects into a tool that an organization can quickly use.
Although we really didn't have time to get into the details of working through an organization's data, it appeared that Databricks has significantly simplified the process. If your organization is beginning a big data project, Databricks would be a good company to know.