1,000-node big data workbench to crunch analytics' toughest problems

EMC launches Greenplum Analytics Workbench to analyse petabyte-scale datasets and pave the way for future big data platforms

EMC has unveiled a system to analyse petabyte-scale datasets and help develop the next generation of big data analytics platforms.

The Greenplum Analytics Workbench, which was revealed at EMC World 2012 in Las Vegas on Tuesday, will be available free of charge for analysis of extremely large data volumes.

It can process both structured and unstructured data using the open-source data analytics system Hadoop and EMC's Greenplum Database, a heavily-customised version of the open source PostgreSQL database that is able to carry out massively parallel processing.

Hadoop is suited to analysing petabyte-scale datasets because each node in a Hadoop cluster processes data in parallel.

Each node in the 1,000-node workbench cluster has two Intel X5670 processor, 24TB of storage and 48GB of RAM. The workbench uses 10/40GbE and FDR 56Gb/s InfiniBand interconnects provided by Mellanox Technologies, including its Unstructured Data Accelerator software that accelerates Hadoop job time.

Scott Yara, senior VP for products at Greenplum, said there is already a long queue of organisations waiting to get time on the machine: "They span healthcare research, manufacturing, drilling, mining, fraud detection in financial services - there's a lot of really advanced use cases."

Results from the workbench analysis will be made available to the Hadoop open source community, and will be used to inform future development of Hadoop and converged Hadoop / SQL analytics platforms.

"We're really trying to look for use cases that either require access to a large-scale of compute nodes or that pushes the limit in terms of analytic work that's been done before," said Yara.

The Workbench lives in hardware operated by EMC and is accessible via the cloud. The announcement puts an end to years of rumours that EMC was planning to take its Greenplum analytics into the cloud, though the node functions more like a rentable supercomputer than a provisionable service.