Last week, Dan Kusnetzky and I participated in a ZDNet Great Debate titled “Big Data: Revolution or evolution?” As you might expect, I advocated for the “revolution” position. The fact is I probably could have argued either side, as sometimes I view Big Data products and technologies as BI (business intelligence) in we-can-connect-to-Hadoop-too clothing.
But in the end, I really do see Big Data as different and significantly so. And the debate really helped me articulate my position, even to myself. So I present here an abridged version of my debate assertions and rebuttals.
Big Data’s manifesto: don’t be afraid
Big Data is unmistakably revolutionary. For the first time in the technology world, we’re thinking about how to collect more data and analyze it, instead of how to reduce data and archive what’s left. We’re no longer intimidated by data volumes; now we seek out extra data to help us gain even further insight into our businesses, our governments, and our society.
The advent of distributed processing over clusters of commodity servers and disks is a big part of what’s driving this, but so too is the low and falling price of storage. While the technology, and indeed the need, to collect, process and analyze Big Data, has been with us for quite some time, doing so hasn’t been efficient or economical until recently. And therein lies the revolution: everything we always wanted to know about our data but were afraid to ask. Now we don’t have to be afraid.
A Big Data definition
My primary definition of Big Data is the area of tech concerned with procurement and analysis of very granular, event-driven data. That involves Internet-derived data that scales well beyond Web site analytics, as well as sensor data, much of which we’ve thrown away until recently. Data that used to be cast off as exhaust is now the fuel for deeper understanding about operations, customer interactions and natural phenomena. To me, that’s the Big Data standard.
Event-driven data sets are too big for transactional database systems to handle efficiently. Big Data technologies like Hadoop, complex event processing (CEP) and massively parallel processing (MPP) systems are built for these workloads. Transactional systems will improve, but there will always be a threshold beyond which they were not designed to be used.
2012: Year of Big Data?
Big Data is becoming mainstream…it’s moving from specialized use in science and tech companies to Enterprise IT applications. That has major implications, as mainstream IT standards for tooling, usability and ease of setup are higher than in scientific and tech company circles. That’s why we’re seeing companies like Microsoft get into the game with cloud-based implementations of Big Data technology that can be requested and configured from a Web browser.
The quest to make Big Data more Enterprise-friendly should result in the refinement of the technology and lowering the costs of operating it. Right now, Big Data tools have a lot of rough edges and require expensive, highly-specialized technologists to implement and operate them. That is changing though, which is further proof of its revolutionary quality.
Spreadmarts aren't Big Data, but they have a role
Is Big Data any different from the spreadsheet models and number crunching we’ve grown accustomed to? What the spreadsheet jocks have been doing can legitimately be called analytics, but certainly not Big Data, as Excel just can't accommodate Big Data sets as defined earlier. It wasn't until 2007 that Excel could even handle more than 16,384 rows per spreadsheet. It can't handle larger operational data loads, much less Big Data loads.
But the results of Big Data analyses can be further crunched and explored in Excel. In fact, Microsoft has developed an add-in that connects Excel to Hive, the relational/data warehouse interface to Hadoop, the emblematic Big Data technology. Think of Big Data work as coarse editing and Excel-based analysis as post-production.
The fact that BI and DW are complementary to Big Data is a good thing. Big Data lets older, conventional technologies provide insights on data sets that cover a much wider scope of operations and interactions than they could before. The fact that we can continue to use familiar tools in completely new contexts makes the something seemingly impossible suddenly become accessible, even casual. That is revolutionary.
Natural language processing and Big Data
There are solutions for carrying out Natural Language Processing (NLP) with Hadoop (and thus Big Data). One involves taking the Python programming language and a set of libraries called NTLK (Natural Language ToolKit) Another example is Apple’s Siri technology on the iPhone. Users simply talk to Siri to get answers from a huge array of domain expertise.
Sometimes Siri works remarkably well; at other times it’s a bit klunky. Interestingly, Big Data technology itself will help to improve natural language technology as it will allow greater volumes of written works to be processed and algorithmically understood. So Big Data will help itself become easier to use.
Big Data specialists and developers: can they all get along?
We don't need to make this an either/or question. Just as there have long been developers and database specialists, there will continue to be call for those who build software and those who specialize in the procurement and analysis of data that software produces and consumes. The two are complementary.
But in my mind, people who develop strong competency in both will have very high value indeed. This will be especially true as most tech professionals seem to self-select as one or the other. I've never thought there was a strong justification for this, but I’ve long observed it as a trend in the industry. People who buck that trend will be rare, and thus in demand and very well-compensated.
The feds and Big Data?
The recent $200 million investment in Big Data announced by the U.S. Federal government received lots of coverage, but how important is it, really? It has symbolic significance, but I also think it has flaws. $200 million is a relatively small amount of money, especially when split over numerous Federal agencies.
But when the administration speaks to the importance of harnessing Big Data in the work of the government and its importance to society, that tells you the technology has power and impact. The US Federal Government collects reams of data; the Obama administration makes it clear the data has huge latent value.
Big Data and BI are separate, but connected
Getting back to my introductory point, is Big Data just the next generation of BI? Big Data is its own subcategory and will likely remain there. But it's part of the same food chain as BI and data warehousing and these categories will exist along a continuum less than they will as discrete and perfectly distinct fields.
That's exactly where things have stood for more than a decade with database administrators and modelers versus BI and data mining specialists. Some people do both, others specialize in on or the other. They're not mutually exclusive, nor is one merely a newer manifestation of the other.
And so it will be with Big Data: an area of data expertise with its own technologies, products and constructs, but with an affinity to other data-focused tech specializations. Connections exist throughout the tech industry and computer science, and yet distinctions are still legitimate, helpful and real.
Where does this leave us?
In the debate, we discussed a number of scenarios where Big Data ties into more established database, Data Warehouse, BI and analysis technologies. The tie-ins are numerous indeed, which may make Big Data’s advances seem merely incremental. After all, if we can continue to use established tools, how can the change be "Big?"
But the revolution isn’t televised through these tools. It’s happening away from them.
We're taking huge amounts of data, much of it unstructured, using cheap servers and disks. And then we're on-boarding that sifted data into our traditional systems. We're answering new, bigger questions, and a lot of them. We're using data we once threw away, because storage was too expensive, processing too slow and, going further back, broadband was too scarce. Now we're working with that data, in familiar ways -- with little re-tooling or disruption. This is empowering and unprecedented, but at the same time, it feels intuitive.