​How open source Apache's 'survival of the fittest' ethos breeds better software

The Apache Software Foundation operates on the soundest Darwinian principles, according to Hadoop firm Cloudera's CTO, Amr Awadallah.
Written by Toby Wolpe, Contributor
Cloudera CTO Amr Awadallah: More than one project solving the same exact problem.
Image: Cloudera

From HTTP Server, to Hadoop and Cassandra, there's no doubting the effectiveness of the Apache Software Foundation in fostering open-source innovation.

Yet the other side of its collaborative, consensual approach is the freedom it gives people to duplicate software engineering efforts, which in other contexts might be seen as wasteful.

"That's the genius of Apache. The Apache Software Foundation allows for there to be more than one project solving the same exact problem, such that the best project can win," Cloudera CTO and co-founder Amr Awadallah said.

"There are many examples of that in the past. One of the most recent would be the Spark-versus-Tez fight that took place over the past two years."

Spark started in 2009 as a UC Berkeley research project to create a clustering computing framework addressing target workloads poorly served by big-data technology Hadoop.

Spark went open source in 2010 and its popularity has soared over the past 12 months. Tez is a rival - although very different, according to some - Apache technology promulgated by Cloudera's Hadoop competitor, Hortonworks.

"Spark was supposed to be a much better high-performance in-memory job-processing system, and at the same time Hortonworks launched an effort called Apache Tez, trying to do that exact same thing," Awadallah said.

"Over the course of the past year, almost all the industry, including Hortonworks, has now gravitated towards Apache Spark. That's one of the key operating tenets of the Apache Software Foundation - to have this Darwinian effect of, 'Let's have more than one project trying to solve the same problem and let's have the best project win'."

However, the profusion of Apache software projects in the area of Hadoop security and the absence of coordination has the potential to cause confusion among customers.

Last month Hortonworks co-founder Arun Murthy discussed the Apache Atlas Hadoop metadata data-governance project, which will combine with existing Apache projects: Knox for perimeter security, Ranger for central security policies, and Falcon for data lifecycle management.

Cloudera in turn has espoused the Apache Sentry and Apache Rhino initiatives, which merged in June last year, for authentication and single sign-on for Hadoop services.

"Now, obviously I'm biased, I work for Cloudera but the proof is in the pudding, as they say. We have people today, deployed, running Apache Sentry, and running Cloudera Navigator [governance technology] and running our encryption solution," Awadallah said.

"Our primary competitor Hortonworks has made a big mistake. They should have leveraged the existing investments that Cloudera made, meaning Sentry and Rhino and so on, as opposed to creating new projects. But they've chosen to go down the path of creating a number of new projects that frankly are not anywhere as mature as what we provide."

The Apache Software Foundation is a non-profit corporation set up in 1999 to provide software for the public good by offering services and support to development projects.

Despite those numerous projects covering various aspects of security and the undoubted advances in that area, Awadallah said the Hadoop access-control layer can stand further refinement.

"Right now, our access-control layer is not very uniform. For some of our engines, like the SQL engine, it's very well done and you can do very fine-grained access control and specify at the level of a table and a column who can access, who can change, who can see," he said.

"But for some of our other workloads, like MapReduce or Spark, the access control is not as fine-grained - it's more coarse-grained. You can access that whole table, you can access that whole file but you cannot really control within the file what you can do. That's really a key area of investment for us."

As well as working on improving Hadoop security through software, Cloudera has been pursuing hardware-based measures through its close relationship with chip-maker Intel.

Intel invested $740m in Cloudera in March 2014 for an 18 percent stake in the company, simultaneously discontinuing work on its own Hadoop distribution and instead throwing its weight behind Cloudera's.

The work between Intel and Cloudera has so far focused primarily on the area of encryption.

"Authorisation and access control - they're not very compute-heavy. They're really look-up services: you're trying to look up whether you are who you say you are and trying to look up what you can do and what you can't do. They don't require a lot of hardware and doing it in software is fine," Awadallah said.

"The problem is encryption. When we were doing encryption before the integration with Intel, when a customer chose to turn on encryption, the performance of their cluster would go down by 30 percent because the CPU becomes very busy encrypting and decrypting data in software."

The optimisation work Cloudera conducted with Intel took advantage of the AESNI native instructions, or Advanced Encryption Standard New Instructions, that the chip giant had added.

"AESNI essentially is a bunch of new instructions in the Intel chipset that does encryption and decryption in hardware. You just point it to a block in memory and give it the address of that block and give it the private key to encrypt and then the chip will take care of doing the work," he said.

"That's reduced the performance impact from 30 percent to less than three percent. Now it becomes a no-brainer that it can become encrypted by default, meaning a customer faced with that choice will say, 'Yes, sure. For a three percent overhead to have all my data have the extra protection of being encrypted, let's do that'."

Intel earlier this month refreshed its Xeon E7 processor family with new chips aimed at helping businesses conduct real-time analytics on large datasets.

Cloudera is also working with Intel on other integrations that are expected to hit the market over the next two years, in the area of memory and transactional memory.

The company is already working on preparing its software to take advantage of the eventual appearance of the Intel technology, and those modifications will be made available to the open-source community.

"Our commitment with Intel is that this all goes back into the Apache Software Foundation. Cloudera will have a few months' advantage," Awadallah said.

"Whenever we do something, obviously it hits our releases a bit earlier than it hits the Apache Software Foundation. But at the end of the day, this will trickle through to all our competition as well. We just have a brief time advantage."

More on Hadoop and big data

Editorial standards