Performing analytics on clickstream and log file data are canonical examples of Big Data in action; they represent a sweet spot for the technology. Hadoop is also a canonical example of Big Data in action and, more and more, Hadoop and Big Data are taken as synonymous. But from recent discussions with Web analytics stalwart Webtrends and log file analytics startup Loggly, I’ve come to learn that these core Big Data scenarios do not always mesh well with Big Data's best known technology.
A week ago, I talked to the CEOs of both companies, Alex Yoder of Webtrends and newly-appointed Charlie Oppenheimer of Loggly, in separate interviews on the same day. The interviews were eye-opening and, for me, rendered certain assumed norms of Big Data null and void.
Webtrends: still data-crazy after all these years
Webtrends has been in the Web analytics game since 1993, essentially since the consumer Web began. Webtrends does leverage contemporary Big Data technology; but it also uses other technologies that the company claims provide even better scale, and it invests in new technologies constantly. The company employs a number of PhDs and core mathematicians and has tripled its data science staff in last year.
Webtrends has always viewed the tenets of database marketing as a key driver for what it does. That may sound old-fashioned, but Yoder provided a very reasonable defense for the timelessness of this mindset. In the age of direct mail, tracking and analyzing the response to campaigns produced very useful market data. Then the advent of the Web introduced scale and anonymity that made those same insights elusive. When you think about it, a lot of the Big Data story in commerce is about getting those insights back.
But the Web also introduced the competitive pressure around obtaining those insights in real time. Meanwhile, Hadoop is a batch processing (i.e. non-interactive) technology which, by its nature, doesn't work well for real-time analysis.
Webtrends must maintain its customers’ service level agreements (SLAs) while processing 13 billion daily events. Webtrends can’t just sample and aggregate the data, but instead must track customer-level information, across Web, Mobile and Social Media, as well as emerging channels including Smart TV platforms, connected appliances and even uploaded data from personal devices like runners’ pedometers. This allows for mathematical, algorithmic modeling and predictive segmentation that the company says mere data sampling doesn’t support nearly as well.
Loggly does Big DevOps
Unlike the 19-year-old Webtrends and its orientation to customer analytics, Loggly was founded in 2009 and focuses on operational Big Data. Rather than mining consumer clickstream information, Loggly runs a hosted service that focuses on server log files, and the health of its customers’ infrastructure. Loggly calls this Application Intelligence, and states unabashedly that its constituency lies squarely in the developer/operations (DevOps) space.
Another customer in Loggly’s roster is itself. The company uses its own service for its own operations, calling the phenomenon “logfooding.” Oppenheimer told me the company is the leader in its space, that it has more than 2500 active customers, including modern cloud computing operations like Heroku as well as old-line firms like Sony Music. Oppenheimer also told me that Loggly indexes more data daily than Twitter produces (which Oppenheimer says is in the neighborhood of 400 million tweets). 400 million may be less than 13 billion, but it’s still a lot of data.
Like Webtrends, Loggly must perform its analyses in real time. In the operational world in which Loggly lives, that is essential. So, while Loggly can use Hadoop for certain background processing, it can’t rely on the batch mode product for its front-line, real-time log processing, monitoring and analytics work. As it turns out, Loggly does make heavy use of another open source technology for much of its analysis: Lucene, something I hope to discuss in a future post.
Lucene is not the whole story though. Loggly has its own complex event processing (CEP) engine that represents the bulk of its intellectual property.
- Also read: CEP and MapReduce: Connected in complex ways
And that IP is valuable indeed; on July 17th, the day after I interviewed Oppenheimer, Loggly announced a $5.7 million round of funding, with participation from True Ventures, Trinity Ventures and Matrix Partners.
The Common Thread
The fact that I interviewed the CEOs of Loggly and Webtrends on the same day was a coincidence, but it made all the more poignant to me the fact that, depsite their different products, the companies have much in common when it comes to Big Data. Both companies need real-time analytics, and both companies have strict SLAs with which to comply. While Hadoop is helpful for some ancillary tasks (albeit ones involving enormous data sets), it’s not suitable for either company’s core offerings.
Real-time requirements appear to be Hadoop’s Achilles heel. While the mere capability of processing huge volumes of data has given rise to Hadoop’s popularity and adoption, that widespread acceptance has created a challenge for the open source project and its batch-oriented approach to Big Data. The question is whether Hadoop’s popularity will lead to it gaining interactive, real-time capabilities in the future, or if a new technology without the batch handicap will come along to disrupt Hadoop’s dominance.