How can we harness big data?
Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows.
Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data.
MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses.
Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop (see below).
Massively parallel processing (MPP)
Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result.
However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialised hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware.
Complex event processing (CEP)
Complex event processing involves processing time-based information in real time from various sources; for example, location data from mobile phones or information from sensors to predict, highlight or define events of interest. For example, information from sensors might lead to predicting equipment failures, even if the information from the sensors seems completely unrelated. Conducting complex event processing on large amounts of data can be enabled using MapReduce, by splitting the data into portions that aren't related to one another. For example, the sensor data for each piece of equipment could be sent to a different node for processing.
Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware.
Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data.
The data can then be analysed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analysed using other tools. Alternatively, the data can be loaded into traditional data warehouses for use with transactional processing.
Apache is considered to be the most noteworthy Hadoop distribution.
NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.
Cassandra is a NoSQL database alternative to Hadoop's HDFS.
Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realising this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop.
There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume. For an idea of how many vendors are operating in each area of the big-data realm, this big-data graphic from Forbes is useful.
Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off-putting.
Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so they can sell the result to enterprise as more comfortable and familiar packaged solutions.
Below, we've laid out the plans of some of the larger vendors.
Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.
Since its creation, various vendors have chosen Hadoop distribution for their own big-data products. In 2010, Teradata was one of the first to jump on the Cloudera bandwagon, with the two companies agreeing to connect the Hadoop distribution to Teradata's data warehouse so that customers could move information between the two. Around the same time, EMC made a similar arrangement for its Greenplum data warehouse. SGI and Dell signed agreements with Cloudera from the hardware side in 2011, while Oracle and IBM joined the party in 2012.
Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012, the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere.
Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and better ways".
Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011. Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor networks, social networks, genomics, video and photographs.
Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualisation and analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop.
Oracle made its big-data appliance available earlier this year — a full rack of 18 Oracle Sun servers with 864GB of main memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems; and 10Gbps Ethernet connectivity.
The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and a distribution of R (an open-source statistical computing and graphics environment).
It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimised datasets to load and analyse in the database.
The appliance costs US$450,000, which puts it at the high end of big-data deployments, and not at the test and development end, according to analysts.
IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core technologies for its big-data push.
The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to "withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing the continuous analysis of streaming data so that companies can respond to events.
IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse appliance (its high-performance, massively parallel advanced analytic platform that can crunch petascale data volumes); its InfoSphere Warehouse; and its Smart Analytics System.
At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance, unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to provide real-time results for analysis and transactions (Oracle's rival product, called Exalytics, hit the market earlier this year). Business applications, like SAP's Business Objects, can sit on the HANA platform to receive a real-time boost.
SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether it's in SAP software or software from another vendor.
Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from Hive into Excel and Microsoft BI tools, such as PowerPivot.
EMC has centred its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried out using data from both sets, which is fed in to dashboards.
What are firms doing with these products?
Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of them below.
Ford is experimenting with Hadoop to see whether it can gain value out of the data it generates from its business operations, vehicle research and even its customers' cars.
"There are many, many sensors in each vehicle; until now, most of that information was [just] in the vehicle, but we think there's an opportunity to grab that data and understand better how the car operates and how consumers use the vehicles, and feed that information back into our design process and help optimise the user's experience in the future, as well," Ford's big-data analytics leader John Ginder said.
HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyse claims as they are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need to stay fit and healthy.
Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed by the company, and to provide those insights to customers. For example, Klout might provide information on how certain peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a separate data silo for each social network. It used custom web services to extract data from the silos. However, maintaining this customised service was very complicated and took too long, so the company implemented a BI product based on Microsoft SQL Server 2012 and the Hive data-warehouse system, in which it consolidated the data from the silos. It is now able to analyse 35 billion rows of data each day, with an average response time of 10 seconds for a query.
Mitsui knowledge industry
Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was able to shorten genome-analysis time from several days to 20 minutes.
Nokia has many uses for the information generated by its phones around the world; for example, using that information to build maps that predict traffic density or create layered elevation models. Developers had been putting the information from each mobile application into data silos, but the company wanted to have all of the data that's collected globally to be combined and cross referenced. It therefore needed an infrastructure that could support terabyte-scale streams of unstructured data from phones, services, log files and other sources, and computational tools to carry out analyses of that data. Deciding that it would be too expensive to pull the unstructured data into a structured environment, the company experimented with Apache Hadoop and Cloudera's CDH (PDF). Because Nokia didn't have much Hadoop expertise, it looked to Cloudera for help. In 2011, Nokia's central CDH cluster went into production to serve as the company's enterprise-wide information core. Nokia now uses the system to pull together information to create 3D maps that show traffic, inclusive of speed categories, elevation, current events and video.
Walmart uses a product it bought, called Muppet, as well as Hadoop to analyse social-media data from Twitter, Facebook, Foursquare and other sources. Among other things, this allows Walmart to analyse in real time which stores will have the biggest crowds, based on Foursquare check-ins.