Big Data doesn't always involve Hadoop and MapReduce. This is a point I have made before, and I probably won't shut up about it anytime soon. Hadoop is good for a lot, but it has a batch-oriented architecture, so it works on the premise of MapReduce "jobs" that are submitted, processed, and which then produce an output file. That can be a lot of work when you want a fast answer from quickly streaming data.
That's why other technologies are sometimes better than Hadoop, especially when it comes to doing analysis in real-time. So even in hot, topical areas like social media analytics, "NoHadoop" technologies can work very well.
Welcome back, data I got a better sense of this though a recent conversation. Oddly, the conversation took place at an alumni gathering for my high school, with a gentleman who graduated four years before I started at the school. My conversation was with Alain Chesnais, a member of a very mathematical and computationally-oriented family. Alain's mother was an adored math teacher at the school; Alain's brother helped run the computer lab, and his name was Pascal. Are you starting to get the picture?
As Alain and I chatted, I discovered that he is the current president of the Association of Computing Machinery (the ACM - sort of the AMA of Computer Science). And if that weren't enough, he's also the founder co-founder and Chief Scientist of streaming data analytics company Trendspottr.
Calirvoyance now On the one hand, Trendspottr's eponymous product does what its one-vowel-short name implies: it analyzes social media data, tweets chief among them, and finds trends. For social media, your inner voice may react by saying "but Twitter already does that." But Trendspottr's different -- because the trends it calls out haven't necessarily emerged yet.
While most predictive analytics technologies ironically premise their predictions of the future on historical data, Trendspottr predicates its forecast of trends on current data. Doing this with a batch-based approach like MapReduce on Hadoop would make things very difficult.
All the topics that are fit to tweet In the case of Twitter, Trendspottr monitors hashtags, links and full text tweet content, to discover terms that seem to be on an upward trajectory. The topics Trendspottr finds may not be trending when it finds them, but they may be achieving social support according to patterns that have previously been observed as precursors to trending. And that means Trendspottr can sometimes tell you about social media groundswells before they really pop -- sometimes an hour earlier or more, in fact.
Trendspottr scores topics based on how recently they've been discussed, as well as how much and how often. If the chatter isn't accelerating, or at least holding its own, the topic's score will decay.
Trendspottr is available as an app add-in for Hootsuite, and is free for personal use. Off the shelf, Trendspottr works with Twitter and Facebook data, but the service also supports an API that lets you unleash it on virtually any quickly streaming data.
Old Math, New Tricks Trendspottr has a corpus of knowledge that's based on data it's seen. And it keeps up in real-time because in its algorithm -- based on Chesnais' doctoral thesis -- ongoing data is analyzed to update the predictive models, incrementally, and constantly. This makes the algorithm highly parallelizable, just like MapReduce. But unlike MapReduce, Trendspottr works well on a single computing node, or a very small cluster.
Without the brute force of MapReduce, Trendspottr needs to work smart. Technologies come and go, but algorithms have staying power, and are the core of real innovation. And so it is with Trendspottr. Its technology, which looks for trending topics amongst the social media data streams that are so important in the Big Data scene right now, is based on work Chesnais did as part of his doctoral thesis 30 years ago.
Sometimes you have to go back to school to make a Big Data discovery. Chesnais went to his doctoral work. All I had to do was go to a high school alumni event.