X
Business

Greenplum takes the open source road to datawarehousing

Yesterday I talked with executives from a company making the transition from selling a proprietary enterprise application to embracing the open source way. The newly minted Greenplum is actually a new iteration of Metapa, a company with an application for data distribution and query execution across a large cluster of commodity Linux machines to boost database performance on large tables.
Written by Dan Farber, Inactive
greenmetapa_1.jpg
Yesterday I talked with executives from a company making the transition from selling a proprietary enterprise application to embracing the open source way. The newly minted Greenplum is actually a new iteration of Metapa, a company with an application for data distribution and query execution across a large cluster of commodity Linux machines to boost database performance on large tables.
 
With a fresh round of investment and some new management, Greenplum is taking the open source plunge for business, rather than religious, reasons. "Taking the open source route is a terrfic opportunity with regard to broadening our audience and marketing. We can leverage the technical contributors and the marketing power of the community," Luke Lonergan, CTO and co-founder of Metapa and Greenplum, told me. "The free, open source version expands our community of users, and has possible support revenue, similar to the Red Hat model," he added. But, the major financial benefit will come from upgrading customers with more high end needs who download and play with the free version to the commercial version, which starts at $100,000 and can go into the millions. Lonergan said that the company has a direct sales force to cater to high end customers.

Available on Monday, the free datawarehousing product, DeepGreen (named after IBM's Deep Blue supercomputing chess playing cluster) PostgreSQL is built on top of the open source PostgreSQL database, running on Linux and OpenSolaris on x86 systems. It uses BSD-style license, and operates best in a single server environment with 10 to 300 gigabyte workloads.

joshberkus.jpg
Greenplum is also sponsoring the Bizgres (www.bizgres.org won't be live until Monday) open source project, which is focused on making PostgreSQL a leading database for gleaning business intelligence from data. "If you are trying to establish a de facto standard in the software industry, one of best ways to do it is with a BSD or other liberal license," said Josh Berkus, a steering committee member of the PostgreSQL community. "Then other companies can improve on the stuff and it can become a de facto standard. Greenplum and Fujistsu, for example, don't plan to make money on PostgreSQL, but they can make money on top of it."

Berkus is working half-time for Greenplum, which he said is contributing code, expertise and his time working on scalability issues and new features, such as an interactive performance configuration utility. Greenplum has already contributed an improved bulk data loader that works of a network, Berkus said. Lonergan said that Greenplum is also funding some I/O enhancements to PostgreSQL. 

The commercial product, DeepGreen MPP (Massively Parallel Processing), is built for multi-terabyte environments and based on the open source version and supercomputing concepts. It includes proprietary, secret sauce features, including the clustering technology and massively parallel interconnects for dealing with complex join patterns and a database optimizer. General availability is scheduled for July.

Greenplum makes claims that DeepGreen will bring the power of open source to business intelligence. To be clear, the company doesn't provide the analytical tools associated with BI applications. Instead, it is beginning to work to integrate with tools from companies like Microstrategy. Greenplum also claims that it can outperform the enterprise datawarehousing giants Teradata and Tandem in cost per terabyte benchmarks. "We are 10 to 50 times faster and cheaper than anything on the market," company officials said. Phillip Howard of Bloor Research is dubious about those claims. In a recent article he wrote about 10x-50x faster claim:

It was also exactly the same claim that was made by Metapa. Unfortunately, no-one put forward any evidence to justify this and I don't believe it. Metapa presumably didn't manage to convince anyone of this otherwise why did the company become GreenPlum?
Frankly, this is crass. The whole 10-50x faster and cheaper thing is ridiculous. Ten plus times cheaper than SQL Server? Ten plus times faster than Teradata or WhiteCross? Ten plus times cheaper and faster than DataAllegro? I don't think so.

Lonergan responded to Howard's critique via e-mail:

The strength of our solution is that it provides customers the choice of hardware to create their own version of “best bang for the buck”.  We have provided some sample configurations that achieve either deep, or fast solutions, based on customer needs.  In all cases, our customers benefit from the MPP architecture that enables speed increases of 10-50x over comparable shared disk approaches, at prices that are far below MPP appliance vendors.
The price per terabyte for deep and fast configurations is based on “net usable” database storage, including the effect of increased data volume due to mirroring and indexes.  The price per terabyte ranges from $15K/TB to $225K/TB and includes estimated hardware and software costs.
The fast configuration provides 96 x 72 GB SCSI hard drives for a total of 6TB of gross storage, and we use approximately 1/3 of that for primary access, yielding 2TB usable database capacity per rack.  The estimated hardware cost is $128K (from Dell) and software is $320K giving a price per usable Terabyte of $450K / 2TB = $225K/TB.
By contrast, the deep configuration as chosen by a customer in early trials of our software uses a system from Rackable that has 16 SATA drives per server for a total of 128 drives per rack.  They also chose a 400GB capacity per drive, leading to a usable capacity (with indexes and mirroring) of 2TB per server.  With ten servers per rack, the hardware plus software price is $300K and rack capacity is 20TB, for a price per TB of $15K/TB.

Customers will ultimately sort out whether Greenplum's claims hold water. In any case, Greenplum represents a growing trend of companies leveraging the open source world to seed a market and deliver differentiation on top of commodity systems and "free" software. If Greenplum continues to give back enough to the community, the commercial interests and open source dynamic can nicely co-exist. If not, then I would place my bet on the open source community...

Editorial standards