Hadoop and big data: Where Apache Slider slots in and why it matters

Hadoop veteran Arun Murthy says Apache Slider will have a major bearing on the future versatility and take-up of the distributed big-data technology.
Written by Toby Wolpe, Contributor
Arun Murthy_Co-founder_Hortonworks300x379
Arun Murthy. Image: Hortonworks

Code submitted this week for inclusion in the Hadoop stack will help speed the spread of the distributed big-data platform, according to Hortonworks co-founder Arun Murthy.

The submission of the Slider framework to the Apache Software Foundation Incubator will result in existing applications, such as NoSQL databases, running unmodified on Hadoop and its YARN resource-management layer.

"It's looking ahead to the future of Hadoop and YARN. This is a really important step forward because it allows us to expand the spectrum of applications and use cases that you can actually service with Hadoop and YARN," Murthy said.

"A NoSQL database is an example; an analytic service is an example. We expect these things to use Slider to bridge the gap between a silo that they live in today to run natively in Hadoop."

Work on Apache Slider has already been going on for the past eight or nine months and the framework is expected to be available to the wider market by the second half of this year.

YARN was released last October in Hadoop 2.0 and separates MapReduce's resource management and processing components, allowing other processing algorithms to be used.

Murthy, who has worked on Hadoop since day one in 2006, described Slider as broadening Hadoop beyond just processing data.

"It allows us to run services like [open-source Apache NoSQL database] HBase and mission-learning apps all in the context of YARN. This takes YARN beyond one or two use cases into hundreds if not more," he said.

"Slider is a framework that allows you to bridge existing always-on services and makes sure they work really well on top of YARN without having to modify the application itself. That's really important.

"Right now it's HBase and Accumulo but it could be Cassandra, it could be MongoDB, it could be anything in the world. That's the key part."

Murthy said those willing to modify applications can already use YARN directly and don't need Slider.

"But a lot of customers and partners don't want to modify an existing application, so that's us making it really easy to bridge that gap between an existing app and Hadoop and YARN," he said.

The goal is for YARN to become the datacentre operating system, capable of running other types of always-on services as well as data processing.

Murthy said Apache HBase, which is a distributed database based on Google's BigTable and written in Java, is a simple example.

"People run HBase, they run MapReduce. If they're independent systems, they're still running on the same physical box. HBase is consuming some CPU and RAM and disk. MapReduce is consuming some CPU and RAM and disk," he said.

"If they don't know about each other, they're going to step on each other and the customer's SLA is going to suffer. At some point HBase is going to do something nasty and MapReduce is going to do something else."

It's a question of thinking about the wider architecture: "As we bring Hadoop to the masses, it's really important to provide a clean, consistent, resource-management framework," Murthy said.

For that framework to be truly consistent, it must be capable of supporting data services and not just data applications.

"Because we're really talking about over time people starting off with 10, 15 or 20 nodes and they quickly get to 200. If we do our job well, they'll get to 2,000 and 5,000 and 20,000. So if you go at that scale, you're talking about tens of millions of dollars of capex and opex," he said.

"If you want to be able to do that, you've got to be able to manage all these resources in a consistent manner, whether it's CPU, or disk or memory, or network."

According to Murthy, where software vendors have built a service on an application, they traditionally have to negotiate with the customer and the inhouse IT shop about how to install that service in their datacentre, which can entail many months of talks.

"We expect Hadoop and YARN to be in everybody's datacentre. It's increasingly true today but in six or 12 months it's going to be absolutely true," he said.

"So if you can assume that Hadoop and YARN exist and you can work on top of YARN-Slider, you can assume so much more about the environment.

"You're now no longer in a conversation with the IT shop or the IT side of the business; you're now in a conversation with the actual line of business where you can demonstrate a use case and you can demonstrate value."

Consequently, Murthy said the target audience for Slider is not as much the individual developer but more the independent software vendor and partners.

"What we'll do is we'll work with these guys to bind Slider in their database, bind Slider in their analytic app, bind Slider in their ETL [extract, transform, load] app — essentially any service, and a service is something that runs forever and provides a service to multiple users, not just one user," he said.

"We're already seeing partners taking their existing applications — whether they are analytics applications or ETL applications — and they don't want to change a lot.

"But they want to come on top of Hadoop and have access to the resources that Hadoop has and have access to the data that Hadoop has."

Murthy said people have ambitious plans for projects on top of YARN-Slider and Hadoop.

"We are seeing people build web farms — JBoss, concat computers — on top of Hadoop. But we've got to take it a step at a time," he said.

"We're big believers in crawl, walk, run. The crawl phase is HBase, Accumulo, and other databases running on top of YARN."

More on Hadoop and big data

Editorial standards