Real-time Hadoop analytics: ScaleOut turns on in-memory tech

Open-source big-data platform Hadoop excels at batch-mode processing at scale but it was never designed for real-time analytics. ScaleOut Software has middleware that it thinks addresses the issue.
Written by Toby Wolpe, Contributor

ScaleOut Software says its new in-memory processing middleware addresses a gap in Hadoop's capabilities by bringing real-time analytics to the open-source distributed computing software.

Conventionally, Hadoop is used to analyse large static offline datasets from an historical perspective. ScaleOut's hServer software, which runs on commodity servers, enables a Hadoop cluster to work on live data held in memory rather than on disk.

ScaleOut expects the hServer Hadoop technology to find uses in areas such as equity trading, e-commerce, reservations systems, and credit-card fraud detection.

Updating data while Hadoop executes

According to ScaleOut CEO Bill Bain, with hServer, the analytics capability — the MapReduce algorithm — is used not just to analyse the data but also to update that data in parallel.

"This is an important new capability that the traditional analytics community does not consider. They're looking at a static dataset and seeing what they can mine out of it," Bain said.

"But the thought of actually using that same set of algorithms to manage that dataset and update it and keep it current as inputs come in, that's a new capability that's closer to high-performance computing," he said.

COO Dave Brinker said ScaleOut's approach also differed from complex-event processing, which allows organisations to look at a stream of data as it passes and pick out trends.

"That technology is out there. What we're bringing is the ability to analyse a complete set of data that's changing rapidly. We're able to look at operational data like an entire set of stock portfolios, for example, as they're changing and analyse that entire dataset as a whole," he said.

ScaleOut's Bill Bain said hServer is based on a MapReduce-style of computation that provides answers in one to five seconds. "On the other hand, standard Hadoop is giving you answers in minutes or even hours, so we're coming down from minutes and hours to seconds," he said.

"But there are technologies that are millisecond and also other technologies — algorithmic trading platforms — that will give you answers in microseconds. There are many orders of magnitude. Real time means different things to different people."

In-memory computing

Massimo Pezzini, vice president and Gartner Fellow, said in-memory computing technologies — such as in-memory data grids, but also complex-event processing platforms and in-memory DBMS — are effective at addressing the velocity aspect of big data.

"In-memory computing and big data can potentially drive a lot of benefits in terms of mixing real-time and historical analysis for better operational and strategic business insights," he said.

ScaleOut's Dave Brinker said the issue with Hadoop is that the data is stored on disk in the Hadoop Distributed File System (HDFS).

"It's not by its nature an in-memory kind of system. What we're doing with [hServer] is we're allowing Hadoop to access data from our in-memory data grid," he said.

ScaleOut's Bill Bain conceded that other in-memory data grids had connectors to Hadoop but argued that hServer, which automatically caches HDFS data in the grid, is integrated more deeply.

"Their level of connectivity is more superficial than what we're offering. This transparent HDFS cache in addition to the ability to update data live while it is being analysed by Hadoop — I think those two capabilities may be unique to our offering," he said.

hServer's first release

The first release of hServer consists of Grid Record Reader, which provides Hadoop with access to data held in the in-memory data grid, and Dataset Record Reader for caching HDFS in the grid.

As well as providing the commercial version, ScaleOut will be offering the proprietary software in a free community edition that can be used for evaluation and production purposes on up to four servers with dataset sizes of up to 256GB. It is also starting a community forum for discussion on hServer and other related topics.

"The Hadoop community is very open source-oriented and we want to be friendly to that community and do things as close as we can to the way they're used to using products," said ScaleOut's Dave Brinker.

"The API libraries will be open source," Brinker added.

Editorial standards