Top 10 categories for Big Data sources and mining technologies

Getting over the gee-whiz factor of Big Data can be tough. Enumerating important Big Data sources and technologies can give us a good start in moving the discussion forward.
Written by Andrew Brust, Contributor

This guest post is by Jeff Morris, Vice President of Product Marketing at Actuate Corporation, the company behind the popular open source reporting product, BIRT.

Most discussions on organizing Big Data center on repository frameworks – specifically Hadoop clusters and MapReduce frameworks. This technology-focused view often overlooks the most important question, “What are you planning to do with the data you’re collecting?”

Since every answer will be different, this means there’s no one-size-fits-all solution. Success lies in recognizing the different types of Big Data sources, using the proper mining technologies to find the treasure within each type, and then integrating and presenting those new insights appropriately according to your unique goals, to enable your organization to make more effective steering decisions.  

A Taxonomy of Big Data sources and technologies
For this process let’s define the two buckets for organizing your Big Data – the sources for Big Data, and the technologies to mine those sources. 

Here are the Top 10 Big Data source types and the corresponding mining techniques that might be applied to find your gold nuggets.
1.    Social network profiles—Tapping user profiles from Facebook, LinkedIn, Yahoo, Google, and specific-interest social or travel sites, to cull individuals’ profiles and demographic information, and extend that to capture their hopefully-like-minded networks.   (This requires a fairly straightforward API integration for importing pre-defined fields and values – for example, a social network API integration that gathers every B2B marketer on Twitter.)

2.    Social influencers—Editor, analyst and subject-matter expert blog comments, user forums, Twitter & Facebook “likes,” Yelp-style catalog and review sites, and other review-centric sites like Apple’s App Store, Amazon, ZDNet, etc.   (Accessing this data requires Natural Language Processing and/or text-based search capability to evaluate the positive/negative nature of words and phrases, derive meaning, index, and write the results).

3.    Activity-generated data—Computer and mobile device log files, aka “The Internet of Things.” This category includes web site tracking information, application logs, and sensor data – such as check-ins and other location tracking – among other machine-generated content.  But consider also the data generated by the processors found within vehicles, video games, cable boxes or, soon, household appliances.  (Parsing technologies such as those from Splunk or Xenos help make sense of these types of semi-structured text files and documents.)

4.    Software as a Service (SaaS) and cloud applications—Systems like Salesforce.com, Netsuite, SuccessFactors, etc. all represent data that’s already in the Cloud but is difficult to move and merge with internal data.  (Distributed data integration technology, in-memory caching technology and API integration work may be appropriate here.)

5.    Public—Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb, etc. – data that is publicly available on the Web which may enhance the types of analysis able to be performed.  (Use the same types of parsing, usage, search and categorization techniques as for the three previously mentioned sources.)

6.    Hadoop MapReduce application results—The next generation technology architectures for handling and parallel parsing of data from logs, Web posts, etc., promise to create a new generations of pre- and post-processed data.   We foresee a ton of new products that will address application use cases for any kinds of Big Data – just look at the partner lists of Cloudera and Hortonworks.   In fact, we won’t be surprised if layers of MapReduce applications blending everything mentioned above (consolidating, “reducing” and aggregating Big Data in a layered or hierarchical approach) are very likely to become their own “Big Data”. 

7.    Data warehouse appliances—Teradata, IBM Netezza, EMC Greenplum, etc. are collecting from operational systems the internal, transactional data that is already prepared for analysis.  These will likely become an integration target that will assist in enhancing the parsed and reduced results from your Big Data installation. 

8.    Columnar/NoSQL data sourcesMongoDB, Cassandra, InfoBright, etc. – examples of a new type of map reduce repository and data aggregator.  These are specialty applications that fill gaps in Hadoop-based environments, for example Cassandra’s use in collecting large volumes of real-time, distributed data.

9.    Network and in-stream monitoring technologies—Packet evaluation and distributed query processing-like applications as well as email parsers are also likely areas that will explode with new startup technologies.     

10.  Legacy documents—Archives of statements, insurance forms, medical record and customer correspondence are still an untapped resource.  (Many archives are full of old PDF documents and print streams files that contain original and only systems of record between organizations and their customers. Parsing this semi-structured legacy content can be challenging without specialty tools like Xenos.)

It’s how you use it
We’ve yet to see conversation about looking at an organization’s management of Big Data as the multi-layered process that it is. Our litmus test will not just be how well we capture Big Data, but also how we organize it, visualize it, and operationalize it – to derive big value from Big Data investments.  Choosing the right technologies for culling value from the variety of Big Data sources is the next discussion we need to have, once we move beyond high-fiving each other because “it works with Hadoop!”

Of course, BIRT (Business Intelligence and Reporting Tools), the Eclipse open source project that serves as the foundation for the ActuateOne product suite, supports Hadoop.  But the real question to ask is, "For what?" This is where the real interesting discussion begins because it marks the crossover point for traditional Business Intelligence and Big Data. 

Today the data architect is the Big Data expert, but imagine what will happen when you and I are reaping personal benefits from the Big Data that affects our own lives—traffic congestion might lessen; coupons will arrive on our phones for products we need, as we enter Target or WalMart; our grocery stores might warn us to throw out the milk before we have to sniff it; or we could discover the fundamentals of the Big Bang.

With the right management of Big Data, its potential is unbounded.  And this isn’t about technology “futures”…it’s happening right now.

Editorial standards