Big data is the latest trend to obsess the technology industry, but companies need to be careful about where they deploy its tools or they could find themselves at the sharp end of a stinging bill.
Many cloud computing vendors offer big-data tools that, paired with the ability to rent scalable on-demand compute and storage resources, can provide a potent on-paper justification for analysing data in the cloud.
However, the costs and time it takes to upload large quantities of data into and out of a cloud could mean that businesses that crunch datasets could be hit by unforeseen costs.
"Data has a lot of inertia," Charles Zedlewski, Hadoop-vendor Cloudera's vice president of products, says. "If you run [Hadoop] in the public cloud, that application is going to spit off data. If your data is generated in the [cloud] datacentre, your Hadoop [cluster] is going to be in that datacentre."
Hadoop is a popular open-source platform for data storage and big data analysis. It is based on technologies originally developed by Google to help the search giant store and query massive amounts of data.
Cloudera aims to be to Hadoop as Canonical is to Ubuntu - a proposition that saw the company close a $65m funding round on Thursday.
Location, location, location
Hadoop is representative of a problem facing the big data industry as a whole: where you locate your data storage and analysis engine will determine where the data ultimately resides. And it's for that reason that choosing to locate your big data in the cloud, or in your own datacentre, could have huge ramifications down the line.
For example, if a company wants to analyse its customer data, it could buy in several low-cost servers with large amounts of storage based on chassis designs from companies like Supermicro and run a Hadoop cluster on top. That would give the company control over its infrastructure, where its data resides and the cost of its kit.
However, if the business gets a sudden spike in data that it doesn't have the capacity to process in a timely manner, it will need to kick this data up into the cloud to process and analyse it as an entire set.
For that, it will pay the typical charges to your ISP, along with the fees for renting the associated storage and compute in Amazon, Google, Microsoft or other vendors' public clouds. Upon completing the processing, it may even have to pay additional charges to get data out of the cloud and back into its datacentre.
This is an example of an effect known as "data gravity", which has been outlined by researcher and former EMC employee Dave McCrory. Put simply, data gravity means that the infrastructure where you perform actions upon a dataset will attract more and more data over time and get more and more difficult to drastically change.
"Where public cloud shines is the more moderate amounts of data, where you care about bursting and care about expandability" - Charles Zedlewski, Cloudera
For some web applications it will make sense for data analysis to be done in the cloud, but for others the value is doubtful. This example illustrates why data location can have a big impact on companies' bottom line.
"Where public cloud shines is the more moderate amounts of data, where you care about bursting and care about expandability," Zedlewski says.
Workloads that should be kept in the datacentre are those that generate a huge amount of information that will need to be repeatedly worked on and enlarged, such as those that convert information from the physical world into digital information, like genetic sequencing.
For example, a small sequencing machine can generate a minimum of a terabyte of information for each operation. This data then needs to be fed through the network and into Hadoop, where it is stored then worked upon. Companies that do this kind of work have many of these machines running in parallel. Uploading, processing and downloading terabytes of information from a remote public cloud is not a trivial matter, and the costs and time expenses can be great.
"The rental fees you're paying in public cloud are five-10 times higher" than on-premise, Zedlewski says. (Cloudera's distribution of Hadoop is run both in public clouds and on premise.)
It all goes to show that although the economics of rentable technology are clear in some cases, such as for scaling websites or one-time number crunching, in others they can be rather cloudy.