Future direction of Big Data
Where do you think Big Data is going? Will it become its own subcategory of IT, or is it simply the next phase of BI and DW?
Andrew Brust
Revolution
Evolution
Dan Kusnetzky
The moderator has delivered a final verdict.
Andrew Brust: Big Data is unmistakably revolutionary. For the first time in the technology world, we’re thinking about how to collect more data and analyze it, instead of how to reduce data and archive what’s left. We’re no longer intimidated by data volumes; now we seek out extra data to help us gain even further insight into our businesses, our governments, and our society.
The advent of distributed processing over clusters of commodity servers and disks is a big part of what’s driving this, but so too is the low and falling price of storage. While the technology, and indeed the need, to collect, process and analyze Big Data, has been with us for quite some time, doing so hasn’t been efficient or economical until recently. And therein lies the revolution: everything we always wanted to know about our data but were afraid to ask. Now we don’t have to be afraid.
Dan Kusntezky: Big data isn't really new. What we now know as Big Data comes out of ancient and honorable analysis of log data, from a long line of analytical tools that deal with rapidly moving, large amounts of data. Analyzing log data coming out of operating systems, application frameworks, database engines, networking giblets and storage systems has been around for decades as a “big data” task. Just ask vendors such as Splunk, Loggly, or RainStor.
Where do you think Big Data is going? Will it become its own subcategory of IT, or is it simply the next phase of BI and DW?
Big Data already is its own subcategory and will likely remain there. But it's part of the same food chain as BI and DW and these categories will exist along a continuum less than they will as discrete and perfectly distinct fields. That's exactly where things have stood for more than a decade with database administrators and modelers versus BI and data mining specialists. Some people do both, others specialize in on or the other. They're not mutually exclusive, nor is one merely a newer manifestation of the other. And so it will be with Big Data: an area of data expertise with its own technologies, products and constructs, but with an affinity to other data-focused tech specializations. Connections exist throughout the tech industry and computer science, and yet distinctions are still legitimate, helpful and real.
I am for Revolution
Big Data is going to become part of several IT disciplines rather than replacing any of them. Some of the likely categories are IT management, business analysis, retail systems and the like. IT management will be able to better sift through operational data found in operating system, networking, application framework, application, database, etc. logs to understand what leads up to failure. They'll be able to head them off at the pass rather than allowing them to come into town. Business analysts will be able to do their thing without having to always pester their IT colleagues to develop new code or change database schemas. Retail companies will be able to learn more about their customers so they can be better served. Big data is just providing some new tools to add to the tool kit in use today.
I am for Evolution
Andrew and Dan will post their closing arguments tomorrow and I will declare a winner on Thursday. Between now and then, don't forget to cast your vote and jump into the discussion below to post your thoughts on this topic.
There is reportedly going to be a need for 1.2 million new jobs in Big Data analytics over the next decade. Is this about to become the hottest job in IT, or will software engineers continue to be the hottest commodity?
There will be demand for both. We don't need to make it an either/or question. Just as there have long been developers and database specialists, there will continue to be call for those who build software and those who specialize in the procurement and analysis of data that software produces and consumes. The two are complimentary. But in my mind, people who develop strong competency in both will have very high value indeed. This will be especially true as most tech professionals seem to self-select as one or the other. I've never thought there was a strong justification for this, but Ive long observed it as a trend in the industry. People who buck that trend will be rare, and thus in demand and very well-compensated.
I am for Revolution
It is not at all clear how many new positions will be created or where they will be created. It is far more likely that this will be yet another specialization for software engineers rather than something totally new. The key part of this evolution is that non-IT analysts can now take part without necessarily having to become systems experts.
I am for Evolution
The U.S. government just announced a $200 million dollar investment in Big Data and likened it to the rise of the supercomputer and the Internet in terms of its potential impact. How significant is this investment?
I think the investment has symbolic significance, but I also think it has flaws. $200 million is a relatively small amount of money, especially when split over numerous Federal agencies. It's difficult to tell if any of this money will be awarded in the form of grants to independent researchers or if all of the expenditure is for in-house Federal research. If the latter, then I worry that agency inefficiencies may further dilute the impact of this investment. But when the administration speaks to the importance of harnessing Big Data in the work of the government and the importance to society, that tells you it has power and impact. And when it mentions that there's a workforce need around Big Data, and not just around technology in general, that shows and even deeper conviction. The US Federal Government collects reams of data; the Obama administration makes it clear the data has huge latent value.
I am for Revolution
Just because the U.S. government invests in something doesn't mean it will become a broad trend. Anyone remember Ada, the programming language that was supposed to combine the best features of COBOL, Fortran and PL/I? While Ada is still important in some government projects, it didn't take over the world. I hope the investment allows the U.S. government to be more efficient and effective. Only time will tell if that dream will become a reality.
I am for Evolution
Big Data is also launching a new job title: Data Scientist. However, aren't these new data wonks more about asking the right questions and using data analysis to tell stories than the data wonks of the past?
If Big Data's definition suffers from abuse, then that of Data Scientist suffers an order of magnitude more. To me, the field of Data Science is about statistics, data analysis, modeling and computational thinking. Unfortunately, the term is getting dumbed down a bit to describe people with Big Data technology skill sets. For example, someone who understands how to configure and use Hadoop, and maybe knows a little bit about the R programming language (an open source statistics and analysis package) may be described as a Data Scientist, but really should be called a Hadoop specialist.
I am for Revolution
It appears that analysts are sifting through data and don't often know what question to ask at first. This is one of the key benefits of the new tools. It is possible to sift through massive amounts of data without first knowing what you're looking for. Traditional BI and DW tools often require that an analyst already know what they're seeking.
I am for Evolution
Part of the promise of Big Data is better tools that allow non-database experts to run more natural language queries. Is this realistic? Are there already examples of tools that do this?
There are solutions for carrying out Natural Language Processing (NLP) with Hadoop (and thus Big Data). One involves taking the Python programming language and a set of libraries called NTLK (Natural Language ToolKit) and mashing them up with a feature of Hadoop called ???Streaming,??? which allows the Big Data engine to be controlled by almost any programming language. Another example, of both the potential and challenges of natural language technology and Big Data is Apple???s Siri technology on the iPhone. Users simply talk to Siri to get answers from a huge array of domain expertise. Sometimes it works remarkably well; other times it???s a bit klunky. The former is testament to the power and value of Big Data; the latter to the shortcomings of speech processing and semantic understanding in machine learning. Interestingly, Big Data technology itself will help to improve natural language technology as it will allow greater volumes of written works to be processed and algorithmically understood. So Big Data will help itself become easier to use.
I am for Revolution
This is only one of many promises the suppliers of Big Data tools are making. It isn???t the most important in many cases. A more important promise is that data analysts will be empowered to sift through data in real time to learn more about the business. This learning is far more important than if the queries are made using a set of check boxes or in natural language statements.
I am for Evolution
Let's drill down a little bit on unstructured data as part of the Big Data movement. What are some examples and why is it significant?
Text is a good example to start with. Books, papers and reports are only as structured as their sentences and paragraphs, but patterns in that data still exist. Imagine looking at all the annual and quarterly reports submitted by public companies to the Securities and Exchange Commission, over the agency???s history, and correlating phrases and passages to economic phenomena in the reports. That???s using a terrific unstructured/Big Data scenario. Other media, including audio and video are good fodder as well. Since both are either digital or digitize-able, patterns could be mined from them for the purposes of optimizing public safety, customer service or operational improvement. If you start to contemplate the volume of data contained in 24/7 security or traffic camera video, or 911/customer service call center phone audio, you can understand why the intersection of big data and unstructured data is important. Event-driven data is often unstructured.
I am for Revolution
The ability to search documents, presentations, wikis, blogs, videos and audios can help an organization better understand content they???ve created, content that customers have sent them in the form of messages, and the like. Listening to customers regardless of where and how they comment can help a company be much more successful. This goes far beyond simply analyzing shopping baskets to glean some level of understanding of what customers want.
I am for Evolution
How does Big Data differ from the Business Intelligence and Data Warehousing of the past decade?
Again, it???s a question of the granularity (and therefore scale) of the data. Certain Data Warehousing products, including Massively Parallel Processing (MPP) appliances, can legitimately be called Big Data technology. Various data visualization products can be employed in both BI and Big Data contexts. Tableau is a great example of this as it natively connects to Hadoop and Hive, but also to Data Warehouse appliances, relational databases, and even spreadsheets and flat files. The fact that BI and DW are complimentary to Big Data is a good thing. Big Data lets older, conventional technologies provide insights on data sets that cover a much wider scope of operations and interactions than they could before. The fact that we can continue to use familiar tools in completely new contexts makes the something seemingly impossible suddenly become accessible, even casual. That is revolutionary.
I am for Revolution
The three Vs come into play here once again. Most BI and Data warehousing rely on well-defined, structured data. Big Data includes many types of data including both structured and unstructured. For example, a Data Warehouse wouldn???t be able to answer a question like, How many company presentations included the catch phrase Big Data?
I am for Evolution
For business professionals who are trying to understand all of the buzz around Big Data, what would you tell them is the most important thing to understand about Big Data for 2012?
The most important thing is that Big Data is becoming mainstream ??? it???s moving from specialized use in science and tech companies to Enterprise IT applications. That has major implications, as mainstream IT standards for tooling, usability and ease of setup are higher than in scientific and tech company circles. That???s why we???re seeing companies like Microsoft get into the game with cloud-based implementations of Big Data technology that can be requested and configured from a Web browser. The quest to make Big Data more Enterprise-friendly should result in the refinement of the technology and lowering the costs of operating it. Right now, the technology has a lot of rough edges and requires expensive, highly-specialized technologists to implement and operate it. That is changing though, which is further proof of its revolutionary quality.
I am for Revolution
Big Data is a catch phrase that has been bubbling up from the high performance computing niche of the IT market. It is largely the newest attempt to make sense of the ever-larger pile of data organizations have. What???s new this time is that many suppliers are offering powerful tools that are relatively easy to learn. Several open source projects, such as Apache Hadoop, Cassandra, Solr and the like are making tools available at low cost.
I am for Evolution
How does Big Data differ from what the Excel spreadsheet wizards have been doing for most businesses for the past couple decades?
What the spreadsheet jocks have been doing can legitimately be called analytics, but certainly not Big Data, as Excel just can't accommodate Big Data sets as defined earlier. It wasn't until 2007 that Excel could even handle more than 16,384 rows per spreadsheet. It can't handle larger operational data loads, much less Big Data loads. Now all that said, the results of Big Data analyses can be further crunched and explored in Excel. In fact, Microsoft has developed an add-in that connects Excel to Hive, the relational/data warehouse interface to Hadoop, the emblematic Big Data technology. Heres the low-down: the refined exploration and analysis on smaller data sets often done in Excel augments very nicely the comparatively simple work done with Big Data technology and data sets. Think of Big Data work as coarse editing and Excel-based analysis as post-production.
I am for Revolution
The three Vs come into play here. The goal is making it easy to tease out useful information out of masses of data. This data is usually measure in the millions or billions of records. That is far beyond what a personal productivity tool, such as Excel, can handle.
I am for Evolution
Are my two debaters online and ready to go?
I'm ready
I am for Revolution
I'm online and looking forward to the conversation.
I am for Evolution
As a term, "Big Data" is already starting to get as overused and overhyped as "Cloud Computing." How would you define Big Data?
My primary definition of Big Data is the procurement and analysis of very granular, event-driven data. That involves Internet-derived data that scales well beyond Web site analytics, as well as sensor data, much of which we???ve thrown away until recently. Data that used to be cast off as exhaust is now the fuel for deeper understanding about operations, customer interactions and natural phenomena. To me, that???s the Big Data standard. Event-driven data sets are too big for transactional database systems to handle efficiently. Big Data technologies like Hadoop, complex event processing (CEP) and massively parallel processing (MPP) systems are built for these workloads. Transactional systems will improve, but there will always be a threshold beyond which they were not designed to be used. Other definitions are out there, but I go with the study of event data scaling beyond what operational databases were designed to handle.
I am for Revolution
In simplest terms, the phrase refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities. Think three Vs. Volume - The sheer amount of data, whether from a user base, such as Twitter, LinkedIn or Facebook, or a huge amount of machine/sensor data. Variety - Data is more than validated strings in fields - it???s text, images, video, and all sorts of machine data formats Velocity - Wherever and whoever it???s coming from, you have to capture tens or hundreds of thousands of writes per second, maybe even millions. People analyzed this data before. What???s new is that tools are now available that allow business analysts or non-IT people to do the analysis.
I am for Evolution
Andrew Brust
In this debate, we discussed a number of scenarios where Big Data ties into more established database, Data Warehouse, BI and analysis technologies. The tie-ins are numerous indeed, which may make Big Data’s advances seem merely incremental. After all, if we can continue to use established tools, how can the change be "Big?"
But the revolution isn’t televised through these tools. It’s happening away from them.
We're taking huge amounts of data, much of it unstructured, using cheap servers and disks. And then we're on-boarding that sifted data into our traditional systems. We're answering new, bigger questions, and a lot of them. We're using data we once threw away, because storage was too expensive and processing too slow. And then we're working with it, in familiar ways -- with little re-tooling or disruption. It's empowering. It's unprecedented. And at the same time, it feels intuitive.
That's revolutionary.
Dan Kusnetzky
I find that my role is often that of a "systems archeologist.” I have learned a great deal by watching the market grow and evolve over the years. Big data is clearly an evolution rather than something entirely new and different.
Suppliers come forward with new products or services and declare that they are both unique and new. I’m often forced to rain on their parade by telling them of products from the 1970s, 1980s, 1990s, or 2000s that did the same thing. Often the only thing new is the platform upon which they've built their product. I see the same thing when suppliers of big data products and services take time to visit me.
Although the tools that big data suppliers are offering make the analytical process easier and allow IT analysts and non-IT analysts to sift through larger mounds of data, the analytical process is still the same.
What’s new is the sources of data, the volume of data, the different formats of that data and how fast the data is coming in -- not the basic process.
Big data is just an evolutionary step rather than something entirely new.
Jason Hiner
Posted by Jason Hiner