Big data: all you need to know

What is big data?

As with cloud, what one person means when they talk about big data might not necessarily match up with the next person's understanding.

The easy definition

Just by looking at the term, one might presume that big data simply refers to the handling and analysis of large volumes of data.

According to the McKinsey Institute's report "Big data: The next frontier for innovation, competition and productivity", big data refers to datasets where the size is beyond the ability of typical database software tools to capture, store, manage and analyse. And the world's data repositories have certainly been growing.

In IDC's mid-year 2011 Digital Universe Study (sponsored by EMC), it was predicted that 1.8 zettabytes (1.8 trillion gigabytes) of data would be created and replicated in 2011 — a ninefold increase over what was produced in 2006.

The more complicated definition

Yet, big data is more than just analysing large amounts of data. Not only are organisations creating a lot of data, but much of this data isn't in a format that sits well in traditional, structured databases — weblogs, videos, text documents, machine-to-machine data or geospatial data, for example.

This data also resides in a number of different silos (sometimes even outside of the organisation), which means that although businesses might have access to an enormous amount of information, they probably don't have the tools to link the data together and draw conclusions from it.

Add to that the fact that data is being updated at shorter and shorter intervals (giving it high velocity), and you've got a situation where traditional data-analysis methods cannot keep up with the large volumes of constantly updated data, paving the way for big-data technologies.

The best definition

In essence, big data is about liberating data that is large in volume, broad in variety and high in velocity from multiple sources in order to create efficiencies, develop new products and be more competitive. Forrester puts it succinctly in saying that big data encompasses "techniques and technologies that make capturing value from data at an extreme scale economical".

Real trend or just hype?

Contents

The doubters

Not everyone in the IT industry is convinced that big data is really as "big" as the hype that it has created. Some experts say that just because you have access to piles of data and the ability to analyse it doesn't mean that you'll do it well.

A report, called "Big data: Harnessing a game-changing asset" (PDF) by the Economist Intelligence Unit and sponsored by SAS, quotes Peter Fader, professor of marketing at the University of Pennsylvania's Wharton School, as saying that the big-data trend is not a boon to businesses right now, as the volume and velocity of the data reduces the time we spend analysing it.

"In some ways, we are going in the wrong direction," he said. "Back in the old days, companies like Nielsen would put together these big, syndicated reports. They would look at market share, wallet share and all that good stuff. But there used to be time to digest the information between data dumps. Companies would spend time thinking about the numbers, looking at benchmarks and making thoughtful decisions. But that idea of forecasting and diagnosing is getting lost today, because the data are coming so rapidly. In some ways we are processing the data less thoughtfully."

One might argue that there's limited competitive advantage to spending hours mulling over the ramifications of data that everyone's got, and that big data is about using new data and creating insights that no one else has. Even so, it's important to assign meaning and context to data quickly, and in some cases this might be difficult.

Henry Sedden, VP of global field marketing for Qlikview, a company that specialises in business intelligence (BI) products, calls the masses of data that organisations are hoping to pull in to their big-data analyses "exhaust data". He said that in his experience, companies aren't even managing to extract information from their enterprise resource-planning systems, and are therefore not ready for more complex data analysis.

"I think it's a very popular conversation for vendors to have," he said, "but most companies, they are struggling to deal with the normal data in their business rather than what I call the exhaust data."

Deloitte director Greg Szwartz agrees.

"Sure, if we could crack the code on big data, we'd all be swimming in game-changing insights. Sounds great. But in my day-to-day work with clients, I know better. They're already waging a war to make sense of the growing pile of data that's right under their noses. Forget big data — those more immediate insights alone could be game changers, and most companies still aren't even there yet. Even worse, all this noise about big data threatens to throw them off the trail at exactly the wrong moment."

However, Gartner analyst Mark Beyer believes there can be no such thing as data overload, because big data is a fundamental change in the way that data is seen. If firms don't grapple with the masses of information that big data enables them to, they will miss out on an opportunity that will see them outperform their peers by 20 per cent in 2015.

A recent O'Reilly Strata Conference survey of 100 conference attendees found that:

18 per cent already had a big-data solution
28 per cent had no plans at the time
22 per cent planned to have a big-data solution in six months
17 per cent planned to have a big-data solution in 12 months
15 per cent planned to have a big-data solution in two years.

A US survey by Techaisle of 800 small to medium businesses (SMBs) showed that despite their size, one third of the companies that responded were interested in introducing big data. A lack of expertise was their main problem.

Seeing these numbers, can companies afford not to jump on the bandwagon?

Is data being created too fast for us to process?
(Pipe stream image by Prophet6, royalty free)

Is there a time when it's not appropriate?

Szwartz doesn't think that companies should dive in to big data if they don't think it will deliver the answers they're looking for. This is something that Jill Dyché, vice president of Thought Leadership for DataFlux Corporation, agrees with.

"Business leaders must be able to provide guidance on the problem they want big data to solve, whether you're trying to speed up existing processes (like fraud detection) or introduce new ones that have heretofore been expensive or impractical (like streaming data from "smart meters" or tracking weather spikes that affect sales). If you can't define the goal of a big-data effort, don't pursue it," she said in a Harvard Business Review post.

This process requires understanding as to which data will provide the best decision support. If the data that is best analysed using big-data technologies will provide the best decision support, then it's likely time to go down that path. If the data that is best analysed using regular BI technologies will provide the best decision support, then perhaps it's better to give big data a miss.

How is big data different to BI?

Fujitsu Australia executive general manager of marketing and chief technology officer Craig Baty said that while BI is descriptive, by looking at what the business has done in a certain period of time, the velocity of big data allows it to be predictive, providing information on what the business will do. Big data can also analyse more types of data than BI, which moves it on from the structured data warehouse, Baty said.

Matt Slocum from O'Reilly Radar said that while big data and BI both have the same aim — answering questions — big data is different to BI in three ways:

1. It's about more data than BI, and this is certainly a traditional definition of big data
2. It's about faster data than BI, which means exploration and interactivity, and in some cases delivering results in less time than it takes to load a web page
3. It's about unstructured data, which we only decide how to use after we've collected it, and [we] need algorithms and interactivity in order to find the patterns it contains.

According to an Oracle whitepaper titled "Oracle Information Architecture: An Architect's Guide to Big Data" (PDF), we also treat data differently in big data than we do in BI.

Big data is unlike conventional business intelligence, where the simple summing of a known value reveals a result, such as order sales becoming year-to-date sales. With big data, the value is discovered through a refining modelling process: make a hypothesis, create statistical, visual or semantic models, validate, then make a new hypothesis. It either takes a person interpreting visualisations or making interactive knowledge-based queries, or by developing "machine-learning" adaptive algorithms that can discover meaning. And, in the end, the algorithm may be short lived.

How can we harness big data?

Contents

The technologies

RDBMS

Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows.

Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data.

MapReduce

MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses.

Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop (see below).

Massively parallel processing (MPP)

Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result.

However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialised hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware.

Complex event processing (CEP)

Complex event processing involves processing time-based information in real time from various sources; for example, location data from mobile phones or information from sensors to predict, highlight or define events of interest. For example, information from sensors might lead to predicting equipment failures, even if the information from the sensors seems completely unrelated. Conducting complex event processing on large amounts of data can be enabled using MapReduce, by splitting the data into portions that aren't related to one another. For example, the sensor data for each piece of equipment could be sent to a different node for processing.

Hadoop

Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware.

Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data.

The data can then be analysed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analysed using other tools. Alternatively, the data can be loaded into traditional data warehouses for use with transactional processing.

Apache is considered to be the most noteworthy Hadoop distribution.

NoSQL

NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.

Cassandra

Cassandra is a NoSQL database alternative to Hadoop's HDFS.

Hive

Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realising this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop.

Vendors

There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume. For an idea of how many vendors are operating in each area of the big-data realm, this big-data graphic from Forbes is useful.

Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off-putting.

Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so they can sell the result to enterprise as more comfortable and familiar packaged solutions.

Below, we've laid out the plans of some of the larger vendors.

Cloudera

Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.

Since its creation, various vendors have chosen Hadoop distribution for their own big-data products. In 2010, Teradata was one of the first to jump on the Cloudera bandwagon, with the two companies agreeing to connect the Hadoop distribution to Teradata's data warehouse so that customers could move information between the two. Around the same time, EMC made a similar arrangement for its Greenplum data warehouse. SGI and Dell signed agreements with Cloudera from the hardware side in 2011, while Oracle and IBM joined the party in 2012.

Hortonworks

Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012, the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere.

Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and better ways".

Teradata

Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011. Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor networks, social networks, genomics, video and photographs.

Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualisation and analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop.

Oracle's big-data appliance
(Credit: Oracle)

Oracle

Oracle made its big-data appliance available earlier this year — a full rack of 18 Oracle Sun servers with 864GB of main memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems; and 10Gbps Ethernet connectivity.

The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and a distribution of R (an open-source statistical computing and graphics environment).

It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimised datasets to load and analyse in the database.

The appliance costs US$450,000, which puts it at the high end of big-data deployments, and not at the test and development end, according to analysts.

IBM

IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core technologies for its big-data push.

The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to "withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing the continuous analysis of streaming data so that companies can respond to events.

IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse appliance (its high-performance, massively parallel advanced analytic platform that can crunch petascale data volumes); its InfoSphere Warehouse; and its Smart Analytics System.

SAP

At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance, unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to provide real-time results for analysis and transactions (Oracle's rival product, called Exalytics, hit the market earlier this year). Business applications, like SAP's Business Objects, can sit on the HANA platform to receive a real-time boost.

SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether it's in SAP software or software from another vendor.

Microsoft

Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from Hive into Excel and Microsoft BI tools, such as PowerPivot.

EMC

EMC has centred its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried out using data from both sets, which is fed in to dashboards.

What are firms doing with these products?

Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of them below.

Ford

Ford is experimenting with Hadoop to see whether it can gain value out of the data it generates from its business operations, vehicle research and even its customers' cars.

"There are many, many sensors in each vehicle; until now, most of that information was [just] in the vehicle, but we think there's an opportunity to grab that data and understand better how the car operates and how consumers use the vehicles, and feed that information back into our design process and help optimise the user's experience in the future, as well," Ford's big-data analytics leader John Ginder said.

HCF

HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyse claims as they are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need to stay fit and healthy.

Klout

Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed by the company, and to provide those insights to customers. For example, Klout might provide information on how certain peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a separate data silo for each social network. It used custom web services to extract data from the silos. However, maintaining this customised service was very complicated and took too long, so the company implemented a BI product based on Microsoft SQL Server 2012 and the Hive data-warehouse system, in which it consolidated the data from the silos. It is now able to analyse 35 billion rows of data each day, with an average response time of 10 seconds for a query.

Mitsui knowledge industry

Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was able to shorten genome-analysis time from several days to 20 minutes.

Nokia

Nokia has many uses for the information generated by its phones around the world; for example, using that information to build maps that predict traffic density or create layered elevation models. Developers had been putting the information from each mobile application into data silos, but the company wanted to have all of the data that's collected globally to be combined and cross referenced. It therefore needed an infrastructure that could support terabyte-scale streams of unstructured data from phones, services, log files and other sources, and computational tools to carry out analyses of that data. Deciding that it would be too expensive to pull the unstructured data into a structured environment, the company experimented with Apache Hadoop and Cloudera's CDH (PDF). Because Nokia didn't have much Hadoop expertise, it looked to Cloudera for help. In 2011, Nokia's central CDH cluster went into production to serve as the company's enterprise-wide information core. Nokia now uses the system to pull together information to create 3D maps that show traffic, inclusive of speed categories, elevation, current events and video.

Walmart

Walmart uses a product it bought, called Muppet, as well as Hadoop to analyse social-media data from Twitter, Facebook, Foursquare and other sources. Among other things, this allows Walmart to analyse in real time which stores will have the biggest crowds, based on Foursquare check-ins.

What are the pitfalls?

Contents

Do you know where your data is?

It's no use setting up a big-data product for analysis only to realise that critical data is spread across the organisation in inaccessible and possibly unknown locations.

As mentioned earlier, Qlikview's VP of global field marketing, Henry Sedden, said that most companies aren't on top of the data inside their organisations, and would get lost if they tried to analyse extra data to get value from the big-data ideal.

A lack of direction

According to IDC, the big-data market is expected to grow from US$3.2 billion in 2010 to US$16.9 billion in 2015; a compound annual growth rate (CAGR) of 40 per cent, which is about seven times the growth of the overall ICT market.

Unfortunately, Gartner said that through to 2015, more than 85 per cent of the Fortune 500 organisations will fail to exploit big data to gain a competitive advantage.

"Collecting and analysing the data is not enough; it must be presented in a timely fashion, so that decisions are made as a direct consequence that have a material impact on the productivity, profitability or efficiency of the organisation. Most organisations are ill prepared to address both the technical and management challenges posed by big data; as a direct result, few will be able to effectively exploit this trend for competitive advantage."

Unless firms know what questions they want to answer and what business objectives they hope to achieve, big-data projects just won't bear fruit, according to commentariats.

Ovum advised in its report "2012 Trends to Watch: Big Data" that firms should not analyse data just because it's there, but should build a business case for doing so.

"Look to existing business issues, such as maximising customer retention or improving operational efficiency, and determine whether expanding and deepening the scope of the analytics will deliver tangible business value," Ovum said.

Big-data skills are scarce.
(IT knowledge image by yirsh, royalty free)

Skills shortages

Even if a company decides to go down the big-data path, it may be difficult to hire the right people.

According to Australian research firm Longhaus:

The data scientist requires a unique blend of skills, including a strong statistical and mathematical background, a good command of statistical tools such as SAS, SPSS or the open-source R and an ability to detect patterns in data (like a data-mining specialist), all backed by the domain knowledge and communications skills to understand what to look for and how to deliver it.

This is already proving to be a rare combination; according to McKinsey, the United States faces a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts to analyse big data and make decisions based on their findings.

It's important for staff members to know what they're doing, according to Stuart Long, chief technology officer of Systems at Oracle Asia Pacific.

"[Big data] creates a relationship, and then it's up to you to determine whether that relationship is statistically valid or not," he said.

"The amount of permutations and possibilities you can start to do means that a lot of people can start to spin their wheels. Understanding what you're looking for is the key."

Data scientist DJ Patil, who until last year was LinkedIn's head of data products, said in his paper "Building data science teams" that he looks for people who have technical expertise in a scientific discipline; the curiosity to work on a problem until they have a hypothesis that can be tested; a storytelling ability to use data to tell a story; and enough cleverness to be able to look at a problem in different ways.

He said that companies will either need to hire people who have histories of playing with data to create something new, or hire people who are straight out of university, and put them in to an intern program. He also believes in using competitions to attract data scientist hires.

Privacy

Tracking individuals' data in order to be able to sell to them better will be attractive to a company, but not necessarily to the consumer who is being sold the products. Not everyone wants to have an analysis carried out on their lives, and depending on how privacy regulations develop, which is likely to vary from country to country, companies will need to be careful with how invasive they are with big-data efforts, including how they collect data. Regulations could lead to fines for invasive policies, but perhaps the greater risk is loss of trust.

One illustration of distrust arising from companies using data from people's lives is the famous example from Target, where the company sent coupons to a teenager for pregnancy-related products. Based on her purchasing behaviour, Target's algorithms believed her to be pregnant. Unfortunately, the teenager's father had no idea about the pregnancy, and he verbally abused the company. However, he was forced to admit later that his daughter actually was pregnant. Target later said that it understands people might feel like their privacy is being invaded by Target using buying data to figure out that a customer is pregnant. The company was forced to change its coupon strategy as a result.

Security

Individuals trust companies to keep their data safe. However, because big data is such a new area, products haven't been built with security in mind, despite the fact that the large volumes of data stored mean that there is more at stake than ever before if data goes missing.

There have been a number of highly publicised data breaches in the last year or two, including the breach of hundreds of thousands of Nvidia customer accounts, millions of Sony customer accounts and hundreds of thousands of Telstra customer accounts. The Australian Government has been promising to consider data breach-notification laws since it conducted a privacy review in 2008, but, according to the Office of the Australian Information Commissioner (OAIC), the wait is almost over. The OAIC advised companies to become prepared for a world where they have to notify customers when data is lost. It also said that it would be taking a hard line on companies that are reckless with data.

Steps to big data

Contents