Greenplum 6 ventures outside the analytic box

Pivotal’s Greenplum database is about to finally align with the open source project. What will that mean for the platform?


It's about six months early, but Pivotal is talking about Greenplum version 6. It's a milestone release, as v6 is the one that will finally put the Greenplum database in full sync with the open source PostgreSQL trunk. And in turn, that has freed the development team to spread its wings to cover ground outside Greenplum's traditional MPP analytics footprint. At the Postgres Conference in New York this week, the company provided a peek into the roadmap for the next version that is currently scheduled for September release.

A prime advantage for being on the PostgreSQL trunk is that Pivotal doesn't have to reinvent the wheel each time a version changes. For v6, it cleared the way for Greenplum database to add some transactional features. That allows support for analytic workloads that encompass some transaction features such as block range indexes that can shorten data lookup times, or "upsert" for real-time conditional updates and inserts.

This is part of a broader trend for databases to add overlapping capabilities. While Microsoft, with Cosmos DB is the poster child for multi-model database, on a more modest level, transaction platforms like Oracle and SQL Server have long handled mixed workloads, and even Amazon, which focuses on fit-for-purpose databases, has added some light analytics capabilities with parallel query to its Aurora transaction platform.

The real reason for Greenplum to add some transaction support is not to turn it into an Oracle or SQL Server replacement for back office financial applications, but instead for IoT. The database is still an analytic column store, but it supports faster reads and writes to make it suited for operational analytics. For IoT, the benefits are compounded with Apache Kafka support. Kafka support enables Greenplum to exploit its massive parallelism to process incoming IoT streams that have requirements for real-time processing and scale.

Another major enhancement for v6 is Kubernetes support. It provides the means for simplifying deployment in a private cloud environments. It handles provisioning (and deprovisioning), installation of packages, scaling, and recovery – essentially turning the firing up of a cluster into the convenience of a self-service cloud.

Containerization is starting to come to database, but most confine it to putting the entire database in a container. Greenplum's container support is far more granular: you can containerize "segments" that are logically isolated workloads and groups of resources. The notion of supporting isolation within Greenplum is not new; it had equivalent support for Linux control groups, so you could logically isolate multiple workloads across a cluster. For Kubernetes, Pivotal had to develop an operator for configuring stateful workloads because neither the Kubernetes or Postgres open source communities had yet to step up to the plate to develop one (Kubernetes has been more associated with stateless workloads up until now). Pivotal claims that the operator it developed could be generalized for PostgreSQL.

The upcoming version of Greenplum adds more machine learning support, and clears the way for deep learning. Apache MADlib, the open source machine learning library project that Pivotal has led, has added new support for Keras with TensorFlow as the back end, and also adds GPU support. There are new capabilities for version management of models and comparing the performance of different models. Combined with containerization, it facilitates deployment techniques such as champion/challenger or canaries.

Getting on the PostgreSQL trunk will accelerate the onramping of new features to Greenplum. It is still a work in progress; the current development version has gotten to PostgreSQL 8.4 (which adds columnar  permissions), but the goal is to get Greenplum 6 up to PostgreSQL 9 when it goes GA. That will add hot standby and streaming replication, among other goodies. But PostgreSQL is now getting to v11, which AWS has just released on its RDS service. That means that goodies like table-level partitioning and hash partitioning – which ease load balancing – will have to wait, but probably not that long.