I recently ran across an old press release of a study by IDC which stated that at that time (June 2011), there were 1.8 zettabytes of data in the world. I don’t know what a zettabyte is, but the press release helped put this in perspective by saying it was equivalent to 200 billion feature length HD movies.
Oh yes, and there was one more thing. The press release said that the amount of data in the world is doubling every 2 years. That means that there must be about 3.6 zettabytes out there now. That’s seriously big data.
Big data is high volume, high velocity, real-time data that comes from all kinds of sources – people collecting information as part of their work, people interacting with search engines and social cites, consumers generating data when they make online and off-line purchases, machines collecting data generated by financial services, businesses, health care organizations, government, and all the forms of media. Even machines generate huge amounts of data. For example, modern jet engines have self-monitoring capabilities, sending continuous streams of information and alerts about their own operating parameters.
All this data at some point ends up in a datacenter. Remember, I said a few posts back that IT budgets, calculated in inflation adjusted dollars, have actually declined over the past 10 years? Notice that I just said the amount of data in the world is doubling every 2 years? But, I digress.
The demand for big data is growing because businesses are learning how to strategically analyze data in ways that give them a competitive advantage. They analyze data to understand buying patterns among their customers and link those patters to other market events. They use analytics to streamline their supply chain in order to reduce any unnecessary overhead. Increasingly, big data and predictive analytics are being used in highly sophisticated personalization strategies that identify individuals and make timely offers based on their location and other information that is known about them. All these applications put huge demand on datacenters where data is stored and analyzed.
The tendency today is to retain all data rather than summarizing or discarding anything not considered essential. This is partly due to reductions in storage costs, but it is also happening because data analysis is advancing so rapidly that it is no longer possible to say what data is not important or valuable.
To take full advantage of all this data, organizations need highly scalable storage and servers as well as the applications and frameworks to process all of the incoming data. Traditional databases are based on SQL, which lends itself well for transactional processing but is not optimized as well for high-performance analysis. Nonetheless, SQL databases have the advantage that they already hold a very large part of the relevant information. Extending them is therefore the quickest and easiest way to add more data processing capacity.
Most Big Data activity is now focused on Hadoop, an open-source software framework that supports data-intensive distributed applications. Hadoop implements a scale-out computational paradigm named map/reduce, which splits the data into many small fragments and distributes processing of the application logic to all the nodes in a given cluster. Some applications combine Hadoop and in-memory computing for ultra-fast, real-time analytics based on high volumes of data.
There are many implementations of Hadoop including Microsoft’s HDInsight, which is available both on Windows Server 2012 and as a Windows Azure service. This ability to process data on internal infrastructure as well as in a public cloud is fundamental to the discussion. There are advantages to running Big Data applications in the public cloud, especially as a proof-of-concept or for one-off analyses. At the same time, latency and security considerations may require an on-premises process. It is valuable to develop a strategy that supports both delivery models.