Have we reached Peak Hadoop?

The fact that IBM is not using Hadoop for its new Watson Data Platform cloud service provides a fresh reminder that the Hadoop vs. Spark debate isn't dead. But don't count Hadoop out. Its data management capabilities are not yet being matched by Spark or other fit-for-purpose big data cloud services.
Written by Tony Baer (dbInsight), Contributor
Valerii Kaliuzhnyi, Getty Images/iStockphoto

Leaving Las Vegas, or more specifically IBM's World of Watson, we're reflecting on the progress that IBM's made towards delivering cognitive computing and the reality that the onramp remains a long one. For the most part, were seeing the steady rollout of point services, such as the Watson Virtual Agent reported by Larry Dignan that offers modest chatbot capabilities.

But the biggest release of this cycle, which IBM has been taking a rolling thunder strategy, is what's now finally called the Watson Data Platform. My fellow Big on Data blogger George Anadiotis, covered the initial announcement, which was all about the strategy (and an interim product name), while Larry's report documented the final announcement.

At the conference, we had a chance to take a deep dive into the new platform, not to mention stepping up to the bar and letting Watson pick out a craft brew based on your profile. For the record, Watson blew it in deciphering my taste, concluding that I prefer lagers because I like blackberries, chocolate cake, and drinking beers in wintertime. We'll do a much deeper dive later this week into the Watson Data Platform, and what it means for IBM's big data strategy,

See also: IBM expands Watson's reach with data platform, iOS integration, bots, education efforts | IBM DataWorks, a holistic approach to leveraging data | SQL on Hadoop benchmarks get serious

Ironically, rollout of Watson Data Platform reminded us of the ongoing Hadoop vs. Spark meme, as the new cloud-based offering emphasizes Spark technology and is not actually based on Hadoop. While clearly, IBM continues to offer BigInsights, which is its packaging based on the ODPi platform, the implication is that if you're running a big data service the cloud, is Hadoop really the best place to deploy it?

The rap against Hadoop is that unlike databases, it is not a monolithic platform, but an assemblage of projects or components. The core HDFS file system was very much a bare bones invention designed for rapid scanning at scale on commodity hardware. Hadoop was never meant to be or designed as a database. And even after years of commercial packaging, Hadoop is still a vendor-curated collection of projects that is complicated to deploy.

Of course, pitting Spark against Hadoop is a case of comparing apples against oranges. Spark is a compute engine, while Hadoop is a storage and compute platform that runs many compute engines, including Spark. Hadoop has the beginnings of resource management, security and data governance, while Spark has none.

Nonetheless, the fact that IBM is creating a cloud-based big data collaboration hub is not necessarily a question of Spark vs. Hadoop, but cloud vs. Hadoop.

If you just have a complex analytic problem that you want to run on a fast cluster without worrying about security, data governance, or resource utilization, Spark works fine as long as there's a JVM or a lightweight cluster manager like Mesos. Otherwise, reinventing the wheel in making that Spark standalone cluster a good enterprise IT system citizen won't be worth the effort.

But that's where the cloud is different - specifically, the Spark Platform as a Service (PaaS) cloud. Maybe it doesn't make sense for an enterprise to deploy the storage engine, security infrastructure, and data governance, but it should be part of the core competency of any cloud PaaS provider, because the first name of PaaS is platform.

And the trend is clear that enterprises are increasingly embracing cloud. Amazon's AWS cloud service is now the company's profit engine, as the business is on track for a lucrative $10 billion year. Ovum's survey of global enterprise IT spending priorities shows that roughly a third of respondents plan to grow their cloud spend in 2017. And Oracle, which has become a more recent convert to cloud, expects 80% of its client base to embrace cloud over the next decade.

We've believed that cloud adoption would become increasingly mainstream for Hadoop. But if you ask Cloudera, Hortonworks, or MapR, cloud deployment in their base runs around 15 - 20%. But this sample skews toward early adopters that had the skills, resources, and drive to implement on premises. Our take has been that the next wave of Hadoop adopters would of necessity be comprised of less IT-savvy organizations for which the platform would have to get simpler. Managed cloud services are the obvious path, yet the question is whether Spark there is stealing the thunder.

Does that mean that Hadoop's best days are behind it? There's a silver lining, in (or out of) the cloud: the data lake. The alternative to Hadoop, which provides a platform where governance capabilities are steadily improving, is storing miscellaneous data in a cloud-based object store. But as with Spark clusters, enterprises will have to improvise security and governance. IBM's roadmap for Watson Data Platform will encompass data governance. Nonetheless, rumors that we have reached peak Hadoop are still exaggerated.

Editorial standards