A little over two years ago, I was given the opportunity to write about Big Data for ZDNet. I was, and am, quite proud of that achievement. I've worked since that time to educate myself about this burgeoning part of the database and analytics scene, and to share what I learned with readers here. I've also tried to accompany reported facts with some opinion and analysis.
Kids these days
When I got here, I knew a lot about Business Intelligence (BI), but not so much about Big Data itself. I suspected the two were separated only arbitrarily, and over the two year period, I've confirmed that suspicion. Meanwhile, Big Data was (and still is) mostly about Hadoop. And Hadoop had its own separate ecosystem of tools, vendors and practitioners.
It also had its own querying paradigm in MapReduce, a batch mode, procedural (i.e. non-declarative) approach to working with data that is natively programmed in Java. MapReduce can scale over huge numbers of server nodes in Hadoop clusters, handling huge volumes of data. But it has its overhead and again, it's a batch-mode, rather than interactive, technology.
I'll come clean now and tell you that I always thought this was primitive and absurd. In the BI world, practitioners have SQL (and sometimes MDX) skills with which they create declarative queries that return result sets interactively and, with proper tuning, very quickly. Java? Batch mode? Give me a break.
And what about role-based security and manageability? Data governance, like master data management, and lineage? Or even some front-end tools with a user-friendly interface instead of forcing folks to work at the command line? The BI world had confronted and largely addressed these requirements, and yet the Big Data world was (somewhat arrogantly) ignoring them.
We've come a long way, baby
Fast forward to today and these worlds are rapidly coming together. 2013 was the year of SQL-on-Hadoop solutions that accommodated mainstream database specialists' SQL skills and went around MapReduce. This year, Hadoop 2.0 and its YARN component put MapReduce in its place: as just one algorithm among many that Hadoop will accommodate.
Hue provides a nice browser-based user interface for Hadoop, Ambari provides manageability extensions; Sentry begins to provide role-based security; and the Stinger initiative and its progeny, Tez, take advantage of YARN to bake interactive SQL-on-Hadoop right into Apache Hive. And now Spark brings in-memory technology to Hadoop, too.
A ways to go
The Hadoop ecosystem's evolution has moved at impressively high velocity. We've closed a lot of gaps in the last two years. But we ain't done yet.
As I said before, data scientists don't scale. Big Data needs the same kind of self-service revolution that BI has had. And the next area that's ripe for increasing maturity is predictive analytics/machine learning (once called "data mining," before that term became politically incorrect). The tools in this space are still pretty hard to use, even the few that provide visual interfaces.
It's great that we can build predictive models with relative ease. But the notion that a small population of folks who know the tools, and program in R, are necessary to do this is unsustainable. The tools need to be accessible to business users and, ironically, they need to be smarter than they are now. The notion that machine learning algorithms need to be selected and tuned manually is silly — it just makes data scientists into middle men, when they should instead be our concierges and advisers.
Even here though, the ice is starting to break. I met three days ago with Stephen Purpura, the CEO of start-up Context Relevant, which offers machine learning technology that is intelligent and does shield users from having to pick algorithms. It analyzes data for its users, determines the best algorithm to use, and cleverly flattens its models' complexity back onto the number of input variables, allowing it to scale out and work incredibly fast. The company's demo has the engine learning Hero's formula for the area of a triangle. From scratch. In under 10 seconds.
As much as I chide the Hadoop world for having started out artificially siloed and aloof, it did the industry a great service: it took the mostly-ossified world of databases, data warehouses and BI and made it dynamic again.
Suddenly, the incumbent players had to respond, add value to their products, and innovate rapidly. It's hard to imagine that having happened without Hadoop. And so it is with my own change. I'll be joining Gigaom Research next week as Director of Research for Big Data and Analytics.
Without the vast changes that have taken place in the industry, I would not have had the opportunity to write about them, think through them or present analyses on them. I certainly wouldn't be in a position to join an analyst firm as a Director, or anything else.
My whole career has been built on data, going back to the dBase II development I did in 1985, through my work with client/server databases and BI. I built my career that way because all software relates to data. That part isn't new. The change that Big Data has brought about is how often we record data, how much of it we can keep, and how we use and analyze that data for our benefit.
I expect improvement on all fronts. I'm excited about the future.