John Bantleman, CEO, and Deirdre Mahon, VP of marketing, of RainStor, introduced me to some enhancements the company was just about to announce. The goal was to make Hadoop easier to use for corporate developers, improve the performance of Hadoop and also dramatically reduce the number of systems needed to process Hadoop-based analytics.
Before we get into what RainStor had to say, let's take a moment to look at Hadoop.
What is Hadoop?Hadoop is a set of Apache open source projects that is getting quite a bit of interest recently. Hadoop is mentioned almost every time the catch phrase "Big Data" is discussed. It has had a strong impact on organizations needing to analyze huge volumes of rapidly changing data.
The Apache foundation describes Hadoop in the following way:
The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The project includes these subprojects:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
What did RainStor have to say?RainStor claims to have added the first enterprise database running natively on Hadoop." Furthermore, the company states that it's product enables faster, more flexible analytics on multi-structured data, without the need to move data out of the Hadoop Distributed File System (HDFS) environment.
RainStor has added the following enhancements to the Hadoop environment:
- RainStor has added compression technology that can reduce the size of Hadoop data sets by up to 40 times. The compressed multi-structured data set running on HDFS improves overall processing efficiency and reduces the size of clusters by 50-80 percent according to RainStor. This one factor, the company points out, would significantly lowers operating cost.
- The company has provided SQL access to Hadoop so that it can be used along side of the more traditional MapReduce access mechanism. RainStor claims 10 to 100 times performance improvements for analytic applications.
If your organization is using Hadoop or thinking about using Hadoop for business analytics, it would be worth the time to talk with RainStor.