The new Cloudera-Hortonworks Hadoop: 100 percent open source, 50 percent boring

How do you bring Hadoop to the AI, hybrid, and multi cloud era, making it so easy to use and reliable that it's boring? How do you build a sustainable business doing that, while switching to a 100-percent open source model? The new Cloudera raises more questions than it offers answers at this point

The future of cloud-based services Jason McGee, IBM fellow, VP and CTO, IBM cloud platform, talks about how IBM continues to grow within the open-source community.

Hadoop is the operating system for big data in the enterprise. So when Cloudera and Hortonworks, the two leading Hadoop distributions and vendors, merged, that was big news in and by itself. Last week's DataWorks Summit Europe was the first big public event for the new Cloudera after the merger, and it sure was not short of interesting news, both on the technology and the business front.

Also: Cloudera eyes Cloudera Data Platform launch 

First off, in case you're wondering, it's all Cloudera from now on. That's the name the new company will go by, and there's a new-ish logo and branding to go with this too. DataWorks historically was a Hortonworks event, and a few people noted they will miss those Hadoop elephants.

If anecdotal evidence is anything to go by, however, keeping Cloudera as the new name may be a good choice business-wise: In a quick poll we did with people beyond the Hadoop scene, more of them seemed to be familiar with Cloudera as a brand than Hortonworks. 

Now, there are many aspects in every merger, and many hard decisions to be made. The brand name decision may have been made in favor of Cloudera, but if you think that means Cloudera has set the tone in the new company, well, maybe not so fast.

Cloudera drops the bomb: From open core to 100 percent open source

For any open source company these days, how to go about their business model and licensing is probably the most important decision to be made. As we have argued, and as fellow ZDNet contributor and Ovum analyst Tony Baer recently noted, too, open source is becoming the new default business model for enterprise software. It has proven to be a better model, for a number of reasons.

As Baer, and a number of others have pointed out in the past, enterprise software vendors based on a 100 percent open source model find it very hard to scale. It essentially means the only viable pathway for revenue is services. This is why, in the cloud era, open source enterprise software vendors practically have to choose between two strategies.

The first one is to go open core. That is to not open source the entirety of their software, but keep some parts of it proprietary and charge for those. The second one is to keep all the software open source, but rely on offering it as a service in the cloud for revenue. Cloudera used to be a firm believer in open core. That's not the case anymore, so let's ponder on what this means, and how it could play out. 

During the initial briefing on the analyst day organized in DataWorks before the event kicked off officially, Cloudera executives made some statements on new developments, strategy, etc. As part of those, they repeatedly referred to a "100 percent open source platform."

Had this been the old Hortonworks days, nobody would have bat an eyelash. But as Cloudera has historically been a strong proponent of the open core strategy, asking for clarification was in order. So, we had the fortune of hearing the bombshell news first: the new platform will be 100 percent open source. Does this mean we'll see a Commons Clause in its license? No comment on that from Cloudera executives.

If there is no Commons Clause, what's to stop AWS from appropriating Cloudera's codebase, as it has done with others before it? If there is one, what does it mean for the shifting open source licensing battleground? This does warrant further analysis, and we will embark on it. Teaser for open source enthusiasts: hold your horses. But let's first see what this new Cloudera platform will be, exactly.

The new Cloudera platform: It will get complicated before it gets boring

Many people we spoke to in DataWorks were of the opinion that the merger made a lot of sense, and this is something we share. Cloudera executives themselves pointed out that there was something like 75 percent overlap in the clients the two companies were competing for, as well as in the codebase they were developing. But that does not necessarily mean integration will be easy.

There were a lot of Xs in Cloudera's marchitecture slides. And in that case, X did not mark the spot, neither did it stand for some piece of the platform targeted for obsolescence. The idea is that no component will be thrown out of the new platform. Current users will continue to get support for their distribution, be it Cloudera or Hortonworks. 

Also: Cloudera's Hilary Mason: To make AI useful, make it more boring

The goal is for the new, merged platform to be available in Q2 2019. When this happens, customers will be offered a clear migration path. But they will also have the option to keep using their current distribution. Eventually, the idea is that the codebase will merge and everyone will be on the same platform. But that's going to be complicated. 

bored-businessman-ios-9.gif

Merging codebases, and catering to things such as data management, governance, and security, may come across as boring. But this is what it takes to be the data fabric for the enterprise.

Let's take one example to see how this would work. To access data stored in Hadoop using SQL today, Cloudera and Hotonworks users rely on Impala and Hive, respectively. In the short term, the new Cloudera platform will integrate and support both. In the mid-to-long term, the goal is to have one solution there. That explains all those Xs -- not even the names of the new, integrated components have been figured out yet.

What seems to have been figured out, however, is the focus of the new company: It's all about the enterprise. According to Cloudera executives, Cloudera is not interested in getting your local bank in its clientèle. It's only interested in its parent bank, or holding company, which it most likely has in its clientèle already. In other words, Cloudera is going boring. 

Also: Cloudera and Hortonworks' merger closes; quo vadis Big Data?

The "make AI boring" notion was something Hilary Mason, Cloudera's GM of Machine Learning, shared with ZDNet's Andrew Brust before sharing with the DataWorks crowd. The essence of Mason's plea is something others have argued for as well: the cool machine learning algorithms are just a part of the so-called AI stack, and not even a big part for that matter.

To make this work, the infrastructure to collect the data needed to train the algorithms and to deploy them in production is needed. Things like data management, governance and security. These may sound boring, but it's the substrate that cool algorithms need to work in production environments. This is what Cloudera is aiming for, and managing Hadoop, the platform on which tons of data in the enterprise live, is at least 50 percent of the job.

Hadoop is passe: It's all about the enterprise data cloud operating system for AI and the cloud era

Ah, Hadoop. Not many people use the "H word" much these days. Perhaps having a legacy, and code to go with it, is part of being boring. Hadoop certainly has it anyway. To recap: Hadoop's original main premises seem increasingly less relevant today. Cloudera promotes its platform not as Hadoop, but as an enterprise data cloud. Making the point that it is in a position to leverage both on premise and in the cloud resources. This is a common argument from data platform vendors to differentiate from cloud vendors.

Hadoop was built to deal with chunks of data on premise, organized in big files. To do this, the idea was to co-locate compute and storage, organizing storage around HDFS and compute around MapReduce, based on a batch processing abstraction. 

Also: Cloudera Machine Learning release takes cloud-native path

  Today, data seems to be organized in many small files, often originating and/or stored in the cloud -- think S3. MapReduce has proven to be a rather cumbersome API, although lots of higher-level APIs have been built on top of it. Batch processing has not gone away, but increasingly, real-time processing is becoming the norm.

Hadoop was built to disrupt data warehouses, dealing with their inefficiencies. It did that, bringing a wave of innovation which also transformed data warehouses, at least in part. Data warehouses are still around, but this time it's Hadoop that's being disrupted by the cloud, AI, and real-time processing. The question is: Can Hadoop react fast enough to avoid being the data warehouse of the future -- i.e. still around, but not as relevant in a few years? 

go-elephant-go.jpg

Hadoop is moving forward, reinventing its core premises

Unlike data warehouses, Hadoop is in a better position to deal with disruption. Its key strengths are open source, and decoupled architecture. Open source means the pace of innovation can be faster, and decoupled architecture means its components can change in parts, while the whole can remain in place. The parts of Hadoop on the forefront of this race today are Ozone, Submarine, and the push toward a cloud-native platform.

Ozone is the codename for the ongoing work to enable Hadoop to operate seamlessly across HDFS and S3. Last year, it was Sanjay Radia, one of Hortonworks co-founders, who introduced Ozone. Now, it's Marton Elek, Hortonworks lead software engineer who presented the latest in Ozone. We observed it seems like Ozone is essentially recreating S3 on premise. Elek concurred, adding Ozone will be consistent, not eventually consistent.

Also: Cloudera and Hortonworks merger: A win-win for all

Submarine is the codename for the ongoing work to make distributed deep learning/machine learning applications easily launched, managed, monitored in Hadoop. Support for GPUs in Hadoop was already there, now Submarine is working on improvements such as container-DNS support, scheduling on YARN, etc.

YARN, Hadoop's job scheduler, is another key area for Hadoop's future. YARN is mature and battle-tested, but if Hadoop wants to be cloud-native, it will have to adapt to using Kubernetes, which comes with its own scheduler. We briefly discussed this with Tristan Zajonc, Cloudera's CTO for Machine Learning. Zajonc presented future directions for Cloudera Data Science Workbench, and much of this revolved around Kubernetes.

Our takeaway is that those projects will likely move faster than the core Hadoop codebase. They are the front runners needed to bring Hadoop in the AI and cloud era. At this point, Hadoop's core value proposition seems to be this "boring" middle layer, unifying data access in the enterprise. That's no small feat, and should be enough to grant the new Cloudera a spot in the enterprise.

But to stay relevant in the long term, the platform formerly known as Hadoop should be able to make rapid progress on those fronts. We'll be here to keep track.