DataStax Brisk: Marrying big data tools Hadroop and Apache Cassandra

What happens when a company marries big data and analytics? DataStax Brisk.
Written by Dan Kusnetzky, Contributor

A while ago, I had a chance to speak with some folks from DataStax —Matt Pfeil, CEO and co-founder, Ben Werther, VP of Products, and Michael Weir, VP of Marketing. It was an interesting discussion of something DataStax was announcing and an exploration of one of the newer catch phrases, NoSQL, and what it really means for organizations.

What was DataStax announcing?

DataStax was announcing Brisk, a new distribution that enhances the Hadoop and Hive platform with scalable low-latency data capabilities.

DataStax's goal was to produce a single platform that can act as the low-latency database for extremely high-volume web and real-time applications while providing tightly coupled Hadoop and Hive analytics.

What does that mean? It means having the tools to both support extremely fast applications that access and update big, rapidly changing stores of data and apply powerful analytics to learn how that data is being utilized.

Here's what DataStax has to say about Brisk

DataStax’ Brisk is an enhanced open-source Hadoop and Hive distribution that utilizes Cassandra for many of its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by Cassandra. It also exposes the full power of Cassandra for real-time applications. The result is a single integrated solution that provides increased reliability, simpler deployment and lower TCO than traditional Hadoop solutions.

A key benefit of DataStax’ Brisk is the tight feedback loop it allows between real-time application and the analytics that follow. Traditionally, users would be forced to move data between systems via complex ETL processes, or perform both functions on the same system with the risk of one impacting the other.

DataStax’ Brisk Uses:

  • High-volume websites – Provide real-time data access and storage for millions of simultaneous users. Directly perform Hive analysis on the latest data, and immediately feed analytic insights back into the application behavior. Finance and capital markets – Process, store and trigger actions based on a high-volume real-time event stream. Perform analytics on historical data, and update models directly into the application.
  • Retail - Maintain real-time summaries and aggregates to allow a continuously up-to-date view of important business metrics. Alert when anomalies occur.
  • High-volume event processing - Track and react instantly to millions of sensors or other distributed feeds, while allowing deeper analytic questions to be asked of the historical data at any moment.

Snapshot analysis

In my article, What is "Big Data?", I examined the concept of "Big Data" and the impact extreme transaction processing, extremely large databases that change very rapidly and the need to manage both structured and non-structured data efficiently. Here's a segment of that article:

In simplest terms, the phrase refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities. Does this mean terabytes, petabytes or even larger collections of data? The answer offered by these suppliers is “yes.” They would go on to say, “you need our product to manage and make best use of that mass of data.” Just thinking about the problems created by the maintenance of huge, dynamic sets of data gives me a headache.

An example often cited is how much weather data is collected on a daily basis by the U.S. National Oceanic and Atmospheric Administration (NOAA) to aide in climate, ecosystem, weather and commercial research. Add that to the masses of data collected by the U.S. National Aeronautics and Space Administration (NASA) for its research and the numbers get pretty big. The commercial sector has its poster children as well. Energy companies have amassed huge amounts of geophysical data. Pharmaceutical companies routinely munch their way through enormous amounts of drug testing data. What about the data your organization maintains in all of its datacenters, regional offices and on all of its user-facing systems (desktops, laptops and handheld devices)?

Large organizations increasingly face the need to maintain large amounts of structured and unstructured data to comply with government regulations. Recent court cases have also lead them to keep large masses of documents, Email messages and other forms of electronic communication that may be required if they face litigation.

If we consider the need for real time analytics of the use of this data, we start to get at the heart of what DataStax is doing.

Marrying two open source projects isn't an easy task. The folks at DataStax seem to be up to the challange and Brisk is the result. If your organization is working with Big Data, DataStax could be an interesting partner to have on your team.

Editorial standards