Pepperdata CEO Sean Suchter can claim, with justifiable pride, a notable place in the history of Apache Hadoop. When he worked at Yahoo, he managed the world's first production deployment of the distributed computing platform.
His experience running Yahoo's web search engine team for many years has shaped his perceptions of Hadoop technology — its strengths and limitations — and feeds directly into his present project.
"At the beginning — this was before it was in production — it was on 10 nodes. We were happy if it stayed up for a day," Suchter said.
"But then as we started going to production on it, we were using it to do full web-scale processing of all the data we had from the web."
That was in 2006. The second deployment of Hadoop was carried out by Suchter's Pepperdata co-founder, Chad Carson, who was working on advert optimisation for Yahoo's sponsored-search product.
Although Hadoop is now a far more potent framework than it was when Suchter first used it at Yahoo all those years ago, he believes some of the issues people experience now would be instantly recognisable to anyone who had used it in its earliest days.
"We have seen these symptoms before and been in the guts of how Hadoop was used in the very early days and how it formed," Suchter said.
"We saw these patterns emerge that once people start counting on Hadoop in production, they hit the same issues over and over again. It was a very familiar pattern.
"Hadoop is so powerful and tries to use so much hardware. But it has really no idea of what's it's actually doing and how hard it's hitting things and when it's hitting limits and when it's not hitting the limits. Because at its core it's a distributed scheduling system."
Sucher gave one example of potential resources problems that can span CPU, RAM, disk and network.
"There was one time in which Yahoo search — like the actual production-facing search — was taken down by Hadoop. Let me explain how that happened," he said.
"The original version of Hadoop was meant to do large-scale batch processing. Now it does a lot more. But Hadoop at the time could launch a lot of things really quickly and use up a lot of hardware resource. There was one thing — a big job — that got launched and saturated the network.
"This Hadoop job ran and took 100 percent of the network bandwidth for several minutes and that killed us.
"But of course the Hadoop job didn't know. Hadoop had no idea of what was going on around it. It just said, 'My job is to run this as fast as possible and let me do so'.
"If it had run at 95 percent of the network bandwidth, this would have been a non-issue. It would have taken one-twentieth longer to execute. That's not a big deal. But it was a huge deal, of course — with a very large business impact."
The arrival of the YARN resource-management layer has simultaneously made Hadoop even more versatile while adding to the optimisation issues.
"The more complexity you put on the cluster, the more different conflicts and esoteric conflicts and things that happen second by second, where the issue may exist for a few seconds and cause a lot of impact and then go away. There's no way you can plan for that," Suchter said.
Nevertheless, he is adamant that YARN is a great step forward because it allows people to do more powerful things with Hadoop.
"An analogy is that Hadoop was this great innovation, like paved highways. You could drive a lot of trucks on it and they'd pretty reliably get to your destination," Suchter said.
"With YARN you don't to have just trucks. You can put cars and motorcycles and Formula 1 racers and donkeys and whatever the heck you want on the road. That's a great innovation.
"But if you, for example, have an ambulance that's coming on and some emergency, it needs to get through in a very predictable way. That ambulance is going to need to come through a second before it hits the next car. Let's move that car out of the way so it can get through and they can come back right after."
This is where the real-time cluster supervisor that Pepperdata has been developing comes in, according to Suchter.
"Different things are going to run on the cluster and you're going have things that are very important to your business with very tight SLAs compared with batch MapReduce jobs. Every time somebody uses some compute, every time somebody sends a packet, every time somebody seeks a disk, the real-time cluster optimiser is aware of that," Suchter said.
"It's watching on every node everything that's happening. It's making global and local decisions that say, 'Hey, this is a really important application to the business. I can see some ad hoc job that's really not important. Let me tell it to go a little bit slower — just enough that the high-priority job can get prioritised access'.
"The flipside of this is, of course, the optimiser knows what the hardware usages across all those four resources are at any given point, so it knows all the holes. Not only are there problems where people hit the real limits of the hardware but there are also problems where they don't and they leave capacity on the table."
Suchter argues that the way Hadoop has historically been adopted means businesses find themselves dependent on it but without appropriate enterprise-level resource management in place.
"When people just get into Hadoop and they just start using it, they're not counting on it at all. It's a new technology. It's like a trial and they're not sure.
"Somebody started using it, got a really great data product as a result and all of a sudden it's, 'Let's put that in production because we can make a lot more money if we do that. And by the way that needs to be regenerated every day, every half of a day, every hour.
"We need an SLA on that and 'Poof, you're in production'. That's the pattern. Sometimes it's more deliberate than that but a lot of the time it's organic."