Apache Spark 1.4 adds R language and hardened machine-learning

With support for stats language R, along with a range of new features, the latest update to in-memory data-processing engine Apache Spark is now out.

Read this

ClearStory CEO: How Apache Spark is helping bring analytics to the average Joe

With a new analytics cloud service unveiled earlier this month, CEO Sharmila Mulligan explains how ClearStory's engine is shifting data insights to ordinary users.

Read More

By providing access to the popular R statistical programming language, the latest iteration of fast-growing analytics cluster framework Spark is aiming to make life easier for data scientists.

Along with support for Python 3, Spark 1.4, which is now generally available, lets R users work directly on large datasets through the SparkR R API.

"Because SparkR uses Spark's parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs," Patrick Wendell, Spark committer and software engineer at Spark firm Databricks, said in a blogpost.

SparkR is an R package initially developed at UC Berkeley's AMPLab to provide an R frontend to Apache Spark. By using Spark's distributed computation engine, users can run large-scale data analysis from the R shell, Wendell wrote in an earlier post.

Spark 1.4 also offers improvements and new features to Spark's DataFrame API, adding window functions to Spark SQL and in the DataFrame library. Window functions enable users to compute statistics over window ranges.

"In addition, we have also implemented many new features for DataFrames, including enriched support for statistics and mathematical functions - random data generation, descriptive statistics and correlations, and contingency tables - as well as functionalities for working with missing data," Wendell said.

How big data gone bad could cost you your job

The number of CEOs ready to fire those behind a failing big-data project says a lot about the growing importance of data analytics, according to a new study.

Read More

"To make DataFrame operations execute quickly, this release also ships the initial pieces of Project Tungsten, a broad performance initiative which will be a central theme in Spark's upcoming 1.5 release. Spark 1.4 adds improvements to serializer memory use and options to enable fast binary aggregations."

According to Wendell, the machine-learning pipelines API introduced in Spark 1.2, which allows workloads consisting of many steps, is now stable and production-ready.

"This release brings the Python API into parity with the Java and Scala interfaces. Pipelines also add a variety of new feature transformers such as RegexTokenizer, OneHotEncoder, and VectorAssembler, and new algorithms like linear models with elastic-net and tree models are now available within the pipeline API," he said.

Because production Spark programs can be complicated, with workflows consisting of many stages, Spark 1.4 is adding visual debugging and monitoring utilities, which are designed to help users understand how Spark apps are running.

For example, an application timeline viewer is now available to show the completion of stages and tasks inside a running program. Spark 1.4 offers a visual representation of the underlying computation graph tied directly to the metrics of physical execution. Visual monitoring also enables users to track the latency and throughput of data streams.

Started in 2009 as a UC Berkeley research project to create a clustering computing framework addressing target workloads poorly served by Hadoop, Spark went open source in 2010. Last year Spark had more than 450 contributors. Its creators went on to found the Databricks company.

This year's Spark Summit conference takes place next week in San Francisco.

More on big data