Back in July, Data Warehouse vendor ParAccel announced it had a new investor: Amazon. Then yesterday, Amazon announced its new cloud Data Warehouse as a service offering: Redshift. And, none too surprisingly, it turns out that Redshift is based on ParAccel’s technology. I spoke to Rich Ghiossi and John Santaferraro, ParAccel’s VPs of Marketing and Solutions/Product Marketing, respectively, who explained some of the subtleties to me and helped me think through some others.
- Also read: Amazon announces “Redshift” cloud data warehouse, with Jaspersoft support
- Also read: Amazon Web Services launches Redshift, datawarehousing as a service
We don't need no stinkin' appliances
ParAccel takes a rather radical approach compared to other vendors in the Massively Parallel Processing (MPP) Data Warehouse category: the company designed its software to run on commodity hardware. Most MPP vendors (including Teradata, HP/Vertica, IBM/Netezza, EMC/Greenplum and Microsoft) sell their products only in the form of an appliance either sell their MPP products or storage only in the form of appliances, or are owned by hardware or storage companies that may prefer to sell it that way. Inside those MPP appliance cabinets, typically, lies a cluster of finely tuned server, storage and networking hardware, It’s an optimized, high-performance approach to data warehousing. It’s also expensive, and it keeps certain customers out. ParAccel decouples MPP technology from expensive appliance hardware.
Down with false choices
Hadoop, of course, takes the commodity hardware approach as well. And that likely accounts for its runaway popularity as a Big Data platform. But MPP is big data technology too, as I’ve said many times before:
The problem with Hadoop, though, is that its native query mechanism is MapReduce code, rendering it incompatible with the massive product and skillset ecosystem around SQL. Over the last several months, vendors such as Cloudera and Microsoft have sought a fusion of SQL and Hadoop. Other vendors, like Rainstor and Hadapt, have been pursuing that fusion for a while.
But why hybridize SQL with Hadoop, when MPP data warehouses that can handle Petabyte-scale big data workloads use SQL natively? Chiefly, the reason has been because MPP carried the appliance barrier-to-entry, so you had to choose between SQL on an appliance and Hadoop on commodity hardware. ParAccel smashed that dichotomy, but the company is still growing and so, for many, the dichotomy has stood.
But Amazon is attacking that dichotomy further, because now ParAccel-based, petabyte-scale MPP technology is elastic. It’s available in the cloud, on-demand, running on a cluster sized according to your needs. You don’t have the build the cluster; and you don’t have to provision the hardware.
Appliances only scale up to what’s inside them, and that may be a lot more than needed initially. As far as elasticity goes, that’s the worst of both worlds. With Redshift, and these are Amazon’s own words, "Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime."
This opens up all sorts of scenarios. Amazon claims the cost of Redshift is under $1000 US per Terabyte, per year. So many organizations could quite easily keep their core data warehouse in the cloud. But Redshift seems to lend itself to ephemeral use too: why spin up an Elastic MapReduce Hadoop cluster to analyze your data when you can spin up an MPP data warehouse (that your existing BI tools can query) just as easily?
On-prem, and off
Of course $1000/TB/year that means you’ll be paying at least $1 million/year for a Petabyte data warehouse. But when you factor in the hardware, storage, personnel/management, power and other costs of running such a large warehouse on premise, that ain’t so bad. If you’re really working at Petabyte-scale, that number shouldn’t bother you.
Does that mean on-premise MPP data warehouses are passé? I wouldn’t say so. First, there’s the issue of bandwidth restraints on data movement that I cited in my news piece on Redshift yesterday. But second, the full on-premise ParAccel product includes features like On-Demand Integration Services, extensibility, user-defined functions, embedded analytics ans certain optimizations that Redshift doesn’t offer.
This is definitely a case of "use the right tool for the right job." But the appliance-shy, who have been trying to run their data warehouses on conventional, non-MPP relational databases and have found performance lacking, now have some choices, including the ability to try-before-they-buy by using Redshift in the cloud.
And which conventional relational database might Amazon wish customers to "shift" that warehouse from? Well there’s a big one that uses a lot of red in its logo. Just sayin’.