Missing some key points with Big Data

As companies increasingly engage with customers online we can expect the deluge of data to increase ... and increase ... and increase. This is the world of Big Data.
Written by Sid Probstein, Attivio, Contributor

Commentary - It's not surprising to see innovative companies like EMC getting involved with Hadoop. Interest in analyzing and extracting value from the deluge of customer, machine and sensor-generated data-log files, click streams, position data, etc - is at an all time high. As companies increasingly engage with customers online we can expect the deluge to increase ... and increase ... and increase. This is the world of Big Data.

Many people say the Big Data movement is about "unstructured data". But they are missing an important point. Log files and click streams are not really unstructured; they are just relatively unfamiliar, and sometimes variable structures. But even if all this information was traditional, structured data, the average database still couldn't handle the deluge cost-effectively. The sheer volume is a key aspect of the Big Data challenge, and the more successfully you engage with end users - the more interaction you offer them - the more data you have to deal with in return.

What about the other sources that contain important information about customers and buying habits? It is a wealth of value that is today largely untapped. Emails, open-ended survey questions, web forms, call logs, discussion boards, SharePoint and Wiki sites — this is the true "unstructured content" that completes the picture of customer perception. It is moreover the best source to create a useful internal view - employee and partner behavior for example.

Unstructured data is not that different from structured data. It tells you what happened, and probably where. Unstructured content, on the other hand, explains WHY things happen. The ability to process and analyze this unstructured content is what prevents most of the Big Data players from presenting a comprehensive view.

The challenge of aggregating and analyzing unstructured content is significant. Human expression is shockingly diverse, varies by location and changes over time. Assembling the elements required to analyze and mine unstructured content requires a lot of expertise and software.

Another Big Data challenge can be the rate or "velocity" at which the information arrives - and the rate at which it may be desirable to analyze. And beyond velocity, complexity is a big challenge. Hadoop for example assumes all data is equal, and that analysis need consider no more than a single "slice" of that data. But many analytics require analysis and correlation across the entire set.

On the plus side: vendors in the unified information access (UIA) space have been focused on aggregating, enriching and analyzing unstructured content - as well as data — for years. These vendors provide technology that complement Big Data infrastructure by bringing unstructured content into the analysis framework and by presenting Big Data in context to remove information blind-spots in business applications and automated business processes. The technology includes the essential text analytic capabilities such as entity, concept, key phrase and sentiment analysis that help transform unstructured content into meaningful insight.

Ultimately, these UIA vendors can complete the Big Data picture by delivering what Gartner has defined as Extreme Information - volume, velocity, variety and complexity. Complete the Big Data picture. Add unstructured content to your Big Data stack.

Sid Probstein, currently CTO at Attivio, has more than 16 years experience leading successful engineering organizations and building complex, high-performance systems.

Editorial standards