This blog is about an industry area that has come to be called “Big Data.” The excitement around Big Data is huge; the mere fact that the term is capitalized implies a lot of respect. A number of technologies and terms get mentioned in the context of Big Data, with Hadoop chief among them, “data scientist” often not far behind and sometimes NoSQL thrown in for good measure.
It’s a bit unorthodox to start a blog post – especially a first post for a new blog – with a bunch of terms unaccompanied by definitions. But that’s a perfect metaphor for Big Data itself because, frankly, it’s not rigorously defined. Meanwhile the term is already entrenched – not just in the industry lexicon but in the mainstream vernacular as well.
What about Big Data is concrete and certain? We can safely say that Big Data is about the technologies and practice of handling data sets so large that conventional database management systems cannot handle them efficiently, and sometimes cannot handle them at all. Often these data sets are fast-streaming too, meaning practitioners don’t have lots of time to analyze them in a slow, deliberate manner, because the data just keeps coming.
Sources for Big Data include financial markets, sensors in manufacturing or logistics environments, cell towers, or traffic cameras throughout a major metropolis. Another source is the Web, including Web server log data, social media material (tweets, status messages, likes, follows, etc.), e-commerce transactions and site crawling output, to list just a few examples.
Really, Big Data can come from anywhere, as long as it’s disruptive to today's operational, transactional database systems. And while those systems will be able to handle larger data sets in the future, Big Data volumes will grow as well, so the disruptions will continue. The technologies used for creating and maintaining data, it turns out, are just not that well-suited to gathering data from a variety of systems, triaging it and consolidating it for precise analysis.
Perhaps you’ve heard of other terms, like Business Intelligence, Decision Support, Data Mining and Analytics, and wondered whether they’re part of Big Data or technically distinct from it. While these fields may have started out as distinct endeavors, they are often folded in to the Big Data discussion. Sometimes when that happens, it may seem that people are merely conflating things. But it turns out that Big Data is still evolving, and as a term it’s malleable. In a way, Big Data is a startup that’s still working out its business model.
I’ve been working with database, data access and business intelligence technologies since the mid-1980s, so Big Data quickly became a logical interest for me. What’s interesting, though, is that Big Data purists sometimes seem unaware of the data technologies that have come before, and miss out on the knowledge and experience those technologies represent. A few Big Data wheels have been re-invented ones.
This blog will investigate and explain what Big Data is about, based on the premise that there’s no perfect consensus on that definition and that it is, in any case, changing. I’ll be talking to customers, vendors and implementers, and I’ll be getting hands-on with the technology as well. I’ll look at some combination of terms, technologies, algorithms, languages, products, services, vendors, industry alliances, and more. Whatever the area of coverage, I’ll do my best to report it, explain it, analyze it, and monitor how it changes.
That’s not the whole story though. I’d very much like your input on which areas interest you the most, and if you have additional ideas. I’d like to hear your thoughts on Big Data in general, as well. My goal is to create a blog that deepens understanding of Big Data, and deepens interest in it too. Your feedback will be the best enabler in achievement of that goal.