Having worked on Hadoop since day one in 2006, Hortonworks co-founder Arun Murthy is clear about the significance of the latest version of the open-source big-data technology.
"Hadoop 2 is a big step. I've worked with Hadoop for seven and a half years and this is the first big architectural change they've done, especially with YARN," Murthy said.
Earlier this month, the Apache Software Foundation announced the general availability of Hadoop 2, and Hortonworks launched what it describes as the first commercial distribution based on the release.
YARN is the next-generation Hadoop MapReduce project that Murthy has been leading.
"It opens up Hadoop to so many new use cases, whether it's real-time event processing, or interactive SQL. Machine learning is another example — people are building native machine-learning apps on top of Hadoop right now, thanks to YARN," he said.
Murthy says he started thinking about YARN as a concept in late 2007, with development work starting in earnest in 2010 while he was at Yahoo.
"You have this entire data lifecycle from real-time event processing, to human interactive query capability, to batch. Hadoop was only good at one of them in terms of processing. So what we did was — this was 2008 — we looked at it and said we've got to do something better for Hadoop and the ecosystem and the customer."
Hadoop consists of two main elements: HDFS and MapReduce. HDFS is the storage aspect and you can put any data you want on it, Murthy said, but MapReduce has been the only way to process it.
"And MapReduce, although it's great for a lot of tasks, is not a silver bullet. The insight was that MapReduce was actually two things. It was the system to take the application from the user and run it. So it's sort of the operating system, if you will," he said.
"That was the system and then there was the user-facing framework and APIs and so on that the end-user used. So you have the system and framework. What we did was we separated the MapReduce system from the MapReduce framework."
The result is a system to run applications, with MapReduce becoming just one of the applications running on that system, instead of it being the only application.
"That's where YARN comes in. It becomes this datacentre Hadoop operating system."
Murthy said an analogy he uses is that MapReduce was like Microsoft Windows having Notepad as its only application.
"The operating system and the app were fused. We took it apart and now we have Windows with Notepad and Word and PowerPoint and everything else," he said.
Murthy said the integration with the Apache Storm real-time event processing system is an example of the applications you can have running on top of YARN.
"It sits there watching events stream in memory and you can write a bit of code to catch something in the event and process it in real time. So, for example, you could say, 'What's the number of errors coming in from this cell phone tower?' or 'What are the locations at which this credit card has been used?'," he said.
"If one location is London and one is California, then you can have a bit of logic that says if within 30 minutes you see transactions from the same credit card coming in from locations a very large distance apart, then something's wrong."
Murthy said once you have YARN you can still use MapReduce for batch processing, but you can run Storm for real-time event processing, and new engines such as Apache TEZ, which allows you do interactive SQL at large scale with short latencies.
"On top of YARN, Facebook and LinkedIn have built the Giraph application, a specialised framework for graph processing and used for social-network analytics — who's connected to whom. So you're seeing a lot of projects in the ecosystem being built on top of YARN," Murthy said.
"If you think of YARN as a sort of operating system on which you build applications, now a lot of people are building the applications, not just open source but also proprietary vendors."
Murthy believes two of the fundamental advantages that YARN provides are an ability to experiment more rapidly and to create alternative frameworks.
"Separating the MapReduce system from the framework allows us to build alternatives — TEZ is an example," he said.
"MapReduce is actually a really simple framework. It's got map and reduce — you can think of it as a graph with just two nodes — a map node and a reduce node. With TEZ, you actually get much more complexity. You can have maps and maps and maps, reduces and reduces and reduces and do different aggregations in different ways."
Although the creation of YARN has involved at least three years' work, much development lies ahead to turn it and Hadoop into a more integral part of the overall enterprise data architecture.
"The way we look at it is you've got Hadoop, which is going to be a really core part of the enterprise architecture, but it's also got to play really well with the rest of the ecosystem. So one of the things we've done is spend a lot of time making sure Hadoop and YARN work really well on Microsoft Windows," Murthy said.
"YARN itself, if you think of it as a datacentre operating system, we've got lots of development ahead of us to add more scheduling features, higher availability, more enterprise-class features and to scale even more. We're confident we can scale to maybe 5,000 to 10,000 nodes at this point but why stop there? Why not go to 30,000, 40,000 nodes?"
Murthy said Hortonworks' focus on open source is because it not only provides the best way to innovate — "because you can get help from the Facebooks and the LinkedIns and the Yahoos and the Twitters of this world" — but also because it is the best way to build the Hadoop ecosystem rapidly.
"It's also so gratifying to see so many people use and — I won't say abuse — but at least scream at my code. In my relatively short career, I've learned that people either use what you've done or they just ignore it. They either scream at you or ignore you. I'd rather be screamed at."
More on Hadoop and Hortonworks
- Splunk's big data Hunk gives Hadoop muscle to non-techies
- Microsoft makes available its Azure-based Hadoop service
- Rackspace plots Hortonworks, Apache Hadoop services
- Hortonworks Data Platform 2.0 ships
- Open source Jaspersoft BI links into Amazon Hadoop
- HANA has her sisters: GridGain and ScaleOut bring in-memory to Hadoop
- SAP's cloud and big data race against time
- Spotify changes tune on Hadoop with switch to Hortonworks