Hortonworks and Hive make beeline for better SQL on petabyte-scale Hadoop

Enterprise Hadoop firm Hortonworks has unveiled its HDP 2.2 distribution, which bears the first fruits of efforts to speed up Apache Hive for SQL queries over petabytes of data.
Written by Toby Wolpe, Contributor

Hortonworks says the latest version of its Hadoop platform will allow users to extract information from petabyte-scale datasets far more rapidly and simply.

Hortonworks Data Platform 2.2, due for general availability in November, offers updated SQL semantics for Hive data warehouse transactions as well as a cost-based optimiser to improve performance.

"We've been doing work to advance ACID transactions within Hive and phase one of that has been delivered in 2.02," Hortonworks director of product strategy Jim Walker said.

"If you think about ACID transactions, it isn't ACID in terms of replacing OLTP, where you have millions of transactions across millions of concurrent users. It's more to accommodate use cases where you're doing things like stream ingest, where you do need to update tables in real time. Adding insert, update and delete is absolutely critical."

When it comes to improvements in Hive and SQL, there is a tendency to think purely about speed, according to Walker.

"But it's not about just speed. It's also about SQL semantics. Can I actually execute what I want to execute? Can I use the engine in the way I want for some analytic purpose. So it's speed, it's the SQL semantics, but it's also that these things have to be done at scale. We're talking about terabytes and petabytes of data," he said.

The optimiser uses statistics to create execution plans and picks the most efficient in terms of system resources.

"The other side of the big advancement in this release is that the community has also added the concept of a cost-based optimiser within Hive, which really speeds performance when you're doing more complex joins."

Walker said advances such as transactions for streaming ingest will enable enterprises to develop more advanced uses for Hadoop.

"If I stream data in and I want Hive queries to be executed on that in real time, that opens up more advanced use cases," he said.

"Then the cost base optimiser is critical as we move higher to do more rich analytics — feeding into some of the more complex dashboards and visualisation tools. Speed, scale, and SQL are the key things here."

Having supported the Spark in-memory analytics framework as a tech preview for the past few months, Hortonworks is integrating it into HDP 2.2 so that it runs on the YARN resource management layer and works better with Hive 0.13, with ORCFile support available by the end of 2014.

"Integration across YARN and making sure that Spark is going to run as a first class citizen and in the cluster was really critical," Walker said.

"The ORCFile support is awesome as well, because now the same tables and queries that I'm using in Hive, I now can use in Spark as well, without having to transform the data or do anything else at the same time."

Also added in the HDP 2.2 distribution is the Apache Kafka high-throughput message broker.

"This is a project that allows you to monitor events that are coming off any wire that's feeding into Hadoop. It's really for real-time streaming, so using Kafka streaming in via Storm data, that data is now being written natively into HDFS," Walker said.

"Maybe I'm writing it and using the ORC file that I'll be using for Hive, so I'm ingesting via Storm and running SQL queries at the same time — and you know what? My Spark environment and Spark tools are accessing that same set of data as well."

Hortonworks said HDP 2.2 also introduces work from the Apache Argus incubator project, which is designed to provide comprehensive security policies across the Apache Hadoop ecosystem through authorisation and auditing.

There are also more than a dozen improvements in the way Hadoop clusters are managed and monitored, according to Hortonworks, including custom views for the Ambari monitoring and provisioning tool, and Ambari Blueprints declarative cluster definitions.

Blueprints enable users to specify a stack, component layout and configurations to spin up a Hadoop cluster instance, without using the Ambari cluster install wizard.

More on Hadoop and big data


Editorial standards