Big data projects: Is the hardware infrastructure overlooked?

Aside from a few detours into storage, the actual infrastructure underneath big data applications is often overlooked. It shouldn't be.

Most conversations about big data revolve around business cases, keeping every bit of information possible and discovering game changing insights.

But aside from a few detours into storage, the actual infrastructure underneath big data applications is often overlooked. It shouldn't be.

I caught up with Gary Tyreman, Univa CEO, to talk hardware infrastructure and big data. Tyreman's theory is that big data will lead to more high-performance computing (HPC) in the enterprise. Univa is a company that is developing Grid Engine, HPC software that was originally developed by Sun Microsystems.

Grid Engine code was released by Sun in 2001 and was used in more than 10,000 data centers when Oracle bought the company in January 2010. By the end of 2010, Oracle closed Grid Engine's open source community and wound down the HPC business Sun created. In January 2011, Univa hired the core Grid Engine development team and developed it. Now Univa competes with Oracle's Grid Engine. 

Also:  30 big data project takeaways  |  How big data is being used today: Three ways  | TechLines panelists:  T-Mobile's Christine Twiford  |  Archimedes' Katrina Montinola  |  Ford's Michael Cavaretta  |  NASA's Nicholas Skytland  |  IBM's James Kobielus

Univa has been entering the big data market as its customers ask for help. Univa developed the architecture that Archimedes, a TechLines panelist, uses for its Hadoop workloads.

Here are the highlights of my conversation with Tyreman:

Are hardware issues overlooked in all the big data talk? "I don't know if they are forgetting or just not appreciating the challenges," said Tyreman. "Hadoop nodes today are 10 or less so it's not hard to get it working. Companies are underestimating how much it takes to roll into production and get it running." In a nutshell, there's a jump from a Hadoop pilot to actually scaling it.

What's the solution? Tyreman said that clusters today are one way to get big data environments set. The time has to be put in to configure the software behind the infrastructure, set storage and fix network settings. "If those configurations take two days it's not a big deal, but then it is rolled into production and there are more complications," he said.

TechLines panel: Debunking big data [full video]

Why isn't hardware a consideration? At this juncture, companies are primarily focused on the outcome of big data and what can be done. Enterprises need to focus on the outcome as well as what they want to know. Existing business intelligence tools also have to be considered.

The companies that will get the big data game down initially are the ones who have already invested in high-performance computing. "When CIOs get to the point where they actually have to decide where big data infrastructure goes they'll have to consider hardware," said Tyreman. Enterprises will talk to Oracle, HP, IBM and Dell to get the best deal. Storage costs will be tricky. Big data needs more than a few "beefy drives" since data scientists want to be packrats.

What are you seeing in the field? Tyreman said customers have been approaching his company for advice. "If customers have an investment in HPC that's what they are leveraging," he said. "The reason isn't the hardware but operational expertise." Companies need to figure out how to get data from storage systems from the likes of NetApp and EMC to a Hadoop framework and back again.

Three ways big data is being used today: T-Mobile exec

Are appliances a cure? Appliances solve the initial problems with set-up and configuration, but don't address the core issues with operations on an ongoing business. A server focused on big data doesn't operate the way one cut out for ERP would.