Hadoop's Tez: Why winning Apache's top level status matters

This week's announcement by the Apache Software Foundation of top-level project status for the Tez framework is a significant step, according to Hortonworks' Shaun Connolly.
Written by Toby Wolpe, Contributor
Shaun Connolly: Top-level status will accelerate Tez's momentum. Image: Hortonworks.

The Apache Software Foundation's promotion of Tez to a top-level project not only endorses the technology but also the strength of the community behind it, according to Hortonworks, the Hadoop distribution and services company that originally developed the framework.

Tez, which entered the Apache Incubator in February 2013, is backed by code contributions from Cloudera, Facebook, Hortonworks, LinkedIn, Microsoft, Twitter, and Yahoo.

It's an extensible framework for building high-performance batch and interactive data-processing apps that need to integrate easily with the YARN resource management layer and handle petabyte-scale datasets.

"The significance is not only the maturity of the technology itself but the maturity of the community," Hortonworks product strategy vice president Shaun Connolly said.

"The fact that it gets top-level status will continue to accelerate its momentum. It's an important step."

The project currently has 31 committers — the engineers who can commit code into the project — of which Hortonworks has 15 because of its involvement in incubating the technology.

"I expect more to come out of that [community] as others, particularly commercial software vendors beyond just Microsoft and those who are focused on it right now, begin to join it and bring some of their technologies and data-processing techniques to the project," Connolly said.

He added that some people have been confused about the role of Tez, which is an enabling API and framework that developers can embed in tools and engines that need to do high-performance and high-scale batch and interactive data processing.

Connolly defined batch as minutes, hours, and days while interactive is handfuls of seconds and is more human interactive, as opposed to sub-second real time, which Tez does not target.

"It's a framework. It's not really an engine which is where some of the confusion comes into play. It enables things like Apache Hive and [scripting platform] Apache Pig, which use the framework, to build their own purpose-built engine and embed it in those technologies," Connolly said.

"So Hive with Tez effectively has its own embedded high-scale data processing engine."

Apache Tez has been embedded in the Apache Hive Hadoop data warehouse infrastructure for several months and was one of the technologies that enabled Apache Hive to achieve "interactive performance characteristics of a handful-of-seconds response times running out SQL queries while retaining petabyte-scale capabilities", Connolly said.

"It really helped drive 10 times the throughput in queries that were expressed through hive and correlated performance with that improved throughput," he said.

According to Connolly, it is incumbent on the community to ensure whatever engine is used it plugs cleanly into YARN so that its resources are managed centrally.

"Tez very much helps do that. But it also plugs into things like [Hadoop cluster management framework] Ambari for visioning and monitoring and management and it plugs into security mechanisms consistently, as well as into governance-type technologies like Apache Falcon," he said.

However, when you bring a new engine into the platform, it is important that it not only has the rest of the platform's capabilities and solves a particular problem for developers but that it operates at scale.

"You can achieve both in open source, as long as you have an architecture that plugs into YARN and into operations and security and governance cleanly," Connolly said.

"Then these new engines like [analytics framework] Spark and others can come into the platform in a consistent way and in a way that enterprises can embrace."

He said it was important to understand Tez in the context of the distinction between purpose-built and general-purpose engines.

"Hive with SQL is an example of a purpose-built engine. It's intended to do petabyte-scale SQL processing, interactive and batch. Spark and arguably even classic MapReduce are more general-purpose engines, where the APIs were intended for mainstream developers to program against," he said.

"Spark, for example, does that very well. It has very nice, simple, elegant APIs. It's a multipurpose engine, mostly for interactive workloads, since it takes advantage of memory quite well at scale. It doesn't go up to petabyte scale but it's a good general-purpose engine for that need.

"Whereas Tez really enables things like Hive, Pig and others to express their purpose-built needs. It's not a general-purpose engine but more of a framework for tools to express their purpose-built needs."

More on Hadoop and big data

Editorial standards