Hadoop is not what it used to be, and we mean this in a good way.
If we were to speak in hype cycle terms, we would say sometime in the past few years Hadoop exited the peak of inflated expectations and entered the plateau of productivity. So the fact that you may not hear about it as much is a good thing.
There are the eye-catching ones, such as support for data science, machine learning and artificial intelligence, GPUs, and edge computing. And then there's also progress on less shiny items, but ones which could arguably make an even bigger difference in day-to-day operations, such as metadata and governance, containers, and storage in the cloud.
Among Hadoop vendors, Hortonworks is the one that's closer to Hadoop's open source foundation. Last week, Hortonworks' partners and clients showcased progress on all of those fronts, and discussed the way forward in the DataWorks EMEA Summit in Berlin. Let's take a look at what's going on, and how the Big Data landscape is evolving.
The key theme last year was Hadoop 3.0, and a good deal of the new items we mentioned are under the Hadoop 3.0 umbrella. Picking up from where we left off last year, the progress in adding support for containers and the improvements in the file system are clear. What was then in alpha stage then has now been released as Hadoop 3.x series.
Just like last year, Sanjay Radia, Hortonworks co-founder, Hadoop's PMC, and HDFS architect, gave a talk on the bleeding edge of Hadoop's filesystem. This bleeding edge goes by the name of Ozone, and it allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects.
Cloud storage was the backdrop for this, as well as other features, as part of what Hortonworks calls Global Data Management. This is the term Hortonworks used to refer to the ability to manage data on premise and across clouds, as well in the edge. NiFi, the basis for Hortonworks' streaming platform HDF, is now complemented by MiNiFi, a version with minimum footprint, making it functional on devices with minimum capabilities.
Another key development in the new Hadoop is support for AI. We have wondered before whether it would make sense to use Hadoop as a platform for ML & AI, since it is where a significant part of the world's data lives. The question is so obvious, it could not possibly remain unanswered for long.
In DataWorks, emerging support in Hadoop for libraries such as Tensorflow, MXNet, and Caffe on Hadoop clusters was highlighted. But for these libraries to work optimally, adding GPUs in the mix is also needed. Now, Hadoop boasts GPU and FPGA support as well, as part of its new YARN environment.
The new YARN also boasts federation, containerized apps, support for long running services, seamless application upgrades, powerful scheduling features, and operational enhancements.
Even though Hortonworks contributes to all the above, it is yet to be included in its own platform. The reason? Apache Foundation policy, on the one hand, maturity, on the other one.
As Hortonworks engineers and executives explained, since Apache Hadoop 3.1 was just released, there are some constitutional procedures to be observed before incorporating this into Hortonworks' platform.
Furthermore, we may add, Hortonworks would probably want to harden something released just a few days ago before including it in its platform. Although there's no official date yet, we expect to see the new features added in the next couple of months.
Metadata, Governance, Innovation, and GDPR
Hortonworks positions itself as a vendor that wants to get the basics right and then build on this for more advanced features. Metadata and governance is something that Hortonworks has traditionally been emphasizing, and the fact that GDPR is just one month away from becoming effective, and this was an event with a European focus, only helped to tone that message up.
A new product called Dataplane has been unveiled by Hortonworks. Dataplane offers capabilities in data source integration, cataloguing. and security controls. Its goal is to help users access and understand data assets and apply security and governance policies. Therefore, goes Hortonworks' message, it's a perfect match to address GDPR concerns.
Dataplane primarily builds on two open-source projects in the Hadoop ecosystem: Apache Ranger, used to define policies; and Apache Atlas, used to manage metadata. This was highlighted in one of the keynotes that showed how these can be leveraged to comply with GDPR. But that's not all.
GDPR and Dataplane are a bit of a chicken and egg thing according to Hortonworks. It's not so much that it thought of developing a product to deal with GDPR, but more like seeing GDPR as an opportunity for users to get their metadata and governance right, and advocating accordingly.
This touches upon a number of things. Take Hive, for example, one of the traditional solutions for SQL-on-Hadoop. Hortonworks, which has been standing behind Hive for a long time, mentioned Hive is now getting ACID capabilities. This looks strange initially: Getting access to data stored in Hadoop is one thing, but would the goal be to use Hadoop as an operational database?
Not really. But as Hadoop's file system is append only, deleting records is not very straightforward. Yet, this is exactly what GDPR's right to be forgotten dictates. So, this is an example of regulation pushing innovation, as giving Hive the ability to do deletes and offer ACID compliance was driven to a great extent by this requirement.
But what is even more interesting is the emphasis on Atlas. Hortonworks, in collaboration with IBM, is pushing toward a major upgrade in Atlas. The vision is to use Atlas as a standard repository for metadata across enterprises, and there are a number of things to note there.
First, Atlas is being developed in the direction of a distributed architecture, which means that different repositories will be able to exchange metadata. Second, Atlas is going the way of schema.org. Schema.org is the most prominent use case of semantic web vocabularies being used in the real world. Atlas utilizes standards such as SKOS and DCAT to handle metadata.
These are advanced technical approaches. However, they point to some key non-technical facts.
Hadoop dynamics and strategy
Metadata is a part of data management that has traditionally been underserved and underutilized. GDPR can indeed serve as an opportunity to get data management right. Metadata is the key, and adding such advanced capabilities to Hadoop's ecosystem is a great way to facilitate this. But it's not enough in and by its own.
These capabilities need to be usable. Not many people are familiar with meta-models such as SKOS and DCAT, for example. There are repositories with domain models that Atlas could draw from to give users a quick win. Hortonworks and IBM people said they are aware, and added the quick win they are aiming for is to give Atlas the ability to instantly capture metadata upon connecting to Hadoop.
But even that may not enough. In the end, it all comes down to adoption and industry support. Schema.org succeeded in part because of the industry support it garnered. With practically all major search engines supporting it, it would be hard to ignore. What about Atlas, though? What kind of support can we expect to see there?
To answer this question, we need to understand vendor dynamics and strategy. The fact that Hortonworks and IBM are in this together is part of a broader strategic alliance between them that started last year.
IBM has given up on its own Hadoop distribution, and formed a partnership with Hortonworks. We've seen something similar in the past, when Intel dropped its own distribution to go with Cloudera's, but this is different.
Hortonworks and IBM go back to the ODPi days. ODPi is an initiative aiming to standardize Hadoop distributions, but it never got much traction. It was basically Hortonworks and IBM, as Cloudera and MapR were not interested. It makes sense on their side, as this would be undermining their strategies.
Now that IBM and Hortonworks are working closely together, ODPi is being repurposed into what they hope will become an ecosystem for Dataplane. Even though Dataplane is not production ready in its entirety yet (Dataplane went GA in September 2017, but not all extensible services are there at this point), the vision is to enable other vendors to plug into it to do tasks such as data wrangling for example.
How this will play out remains to be seen. What we can note is that IBM is a key partner for Hortonworks. Being the Hadoop distribution of choice for IBM does not only mean getting access to its clients, but also embarking on joint development projects as Dataplane goes to show. IBM's DataWorks, for example, is now on its way to becoming part of Dataplane as DSX (Data Science Experience).
The theme is common across vendors: Hadoop is hosting a good part of the world's datasets, and facilitating their growth. It makes sense to use it as a platform to bring the applications to where the data is, and we will continue to see this happening.
There is enough innovation to get attention, but more importantly, there is enough maturity to be productive.
Note: The original version of this article was amended to clarify that Dataplane has been GA since September 2017, but not all of its extensible services are GA at this point.
Azure HDInsight click-by-click guide: Get cloud-based Hadoop up and running today
Last week I wrote about a 300 node cluster using Raspberry Pi (RPi) microcomputers. But can you do useful work on such a low-cost, low-power cluster? Yes, you can. Hadoop runs on massive clusters, but you can also run it on your own, highly-scalable, RPi cluster.