X
Business

DataStax's 4.5 Cassandra fires up Apache Spark in-memory analytics

With DataStax 4.5, the NoSQL company is offering fast analytics through Apache Spark as well as the option to merge Cassandra and Hadoop data.
Written by Toby Wolpe, Contributor

DataStax says the latest version of its Apache Cassandra NoSQL database puts the focus on analytics, offering for the first time in-memory processing via the Apache Spark open-source engine.

The use of Spark in DataStax Enterprise 4.5, now on general availability, also means the database offers in-memory analytics in addition to its existing in-memory transactional processing, the company said.

Along with improved visual management tools, and automated diagnostics and performance tuning, DataStax 4.5 is also certified to run on the Cloudera and Hortonworks Hadoop distributions, allowing the integration of operational and historical data.

"What we're bringing to the table with 4.5 are two new analytic options. We're enabling more near real-time capabilities with the Spark integration, so that's number one," DataStax products VP Robin Schumacher said.

"Number two is the need to link up your operational database, your transactional database, with historical Hadoop data warehouses or data lakes. There are times you need to link the two of them together to satisfy certain operational use cases."

Schumacher said online apps often need different analytic tempos, with certain aspects of an application demanding fast analytic response times but in other situations requiring longer-running sets of analytics.

"They may be programmatic in nature, computational, crunching a bunch of stuff, and they're going to be slower running," he said.

"With Spark we're able to handle the very near real-time analytics. With the external Hadoop capabilities, we're enabling the longer-running, batch analytic nature where you want to be able to link up your operational database with an external Hadoop system, be that on Cloudera or Hortonworks.

"This capability has been present in the relation world for a while. For example, you might join an Oracle table and a SQL Server table together because they're satisfying different use cases, different applications. Now we're bringing that same capability to the modern NoSQL-Hadoop world."

A practical example might be running a Hadoop Hive query that joins together a Cassandra table with a Cloudera Hive table, running the same query against those objects and returning an analytic result set that either stays on DataStax Enterprise or which could be transferred to the Hadoop deployment.

Schumacher said in practical terms the way Spark would be used by DataStax Enterprise would be determined by the admin when specifying a cluster's nodes, which can be, for example, transactional, analytic, or search.

"Now one of the options you have for your analytic workloads is that the nodes that handle analytic operations are Spark. You just start up those nodes in Spark mode and you're able to run Spark on top of Cassandra," he said.

"There's no need for HDFS or anything like that. It operates directly on Cassandra data. The end result is much faster response times for analytic queries over what they've had in the past with Hadoop Hive queries and things like that that ran on Cassandra data."

DataStax has been working with Databricks, the commercial company behind Spark. Cassandra and HDFS are the data source targets that Spark can utilise.

In addition to analytics, the other theme running through the DataStax Enterprise 4.5 release is performance and the tools to improve its management, Schumacher said.

"In the past we've enabled various sets of statistics that [people] can monitor but it's not been well organised," he said.

Enterprise 4.5 offers a performance service consisting of a Cassandra query language-based set of diagnostic objects for answering questions ranging from cluster problems and their causes, hottest-running objects, to which statements are consuming the most resources.

"What we're doing is we're helping two different types of personas when it comes to database troubleshooting. There's going to be the person who wants to work at the command line. They like to write queries and do all that type of thing," Schumacher said.

"So for the command-line people you've got our new performance service. Then for people who want a point-and-click way of doing things there's our OpsCenter 5.0 release."

OpsCenter is a web-based visual management utility for tasks such as creating new clusters, backup recoveries, and system monitoring. The new release is now more scalable so a single installation can now support up to a 1,000-node cluster.

Together with more security on the clusters that individuals can manage and monitor, OpsCenter 5.0 also contains a service that automatically scans clusters for departures from best practice.

"Maybe you haven't configured your security right, maybe you haven't set up your memory configuration parameters optimally. It scans your clusters automatically for you, brings back any deviations it finds in that best practice, and then gives you expert advice on how to fix it," Schumacher said.

DataStax will be open-sourcing part of the work it has done on Apache Spark.

"The connectivity to Cassandra from Spark, the data type mapping, some performance optimisations that we've made — we're giving all that back to the open-source Spark and Cassandra communities," Schumacher said.

"We're keeping a few things on the commercial side for Spark. Automatic failover in the tool, very easy setup and configuration, point-and-click management and things like that are some of the things we're retaining inside DataStax Enterprise — and certification between Spark and Cassandra."

More on NoSQL and Hadoop

Editorial standards