Internet-scale data sets and Web-scale analytics have placed a different set of requirements on software infrastructure and data processing techniques. More types of companies and organizations are seeking new inferences and insights across a variety of massive data sets -- some into the petabyte scale.
How can all this data be shifted and analyzed quickly, and how can we deliver the results to an inclusive class of business-focused users? Following the lead of such Web-scale innovators as Google, and through the leveraging of powerful performance characteristics of parallel computing on top of industry-standard hardware, such companies as Greenplum are now focusing on how MapReduce approaches are changing business intelligence (BI) and the data-management game.
BI has become a killer application over the past few years, and we're now extending that beyond enterprise-class computing into cloud-class computing. The amount of data and content -- and the need for innovative analytics from across the Internet -- is still growing rapidly, even though we have harsh economic times.
To provide an in-depth look at how parallelism, modern data infrastructure, and MapReduce technologies come together in the new age, BriefingsDirect's Dana Gardner recently spoke with Tim O’Reilly, CEO and founder of O’Reilly Media and blogger; Jim Kobielus, senior analyst at Forrester Research, and Scott Yara, president and co-founder at Greenplum.
Here are some excerpts:
Kobielus: A number of things are happening ... and the trend continues to grow. In terms of the data sets, it’s becoming ever more massive for analytics. It’s equivalent to Moore’s Law, in the sense that every several years, the size of the average data warehouse or data mart grows by an order of magnitude.
Why are data warehouses bulking up so rapidly? One key thing is that organizations, especially in tough times when they're trying to cut costs, continue to consolidate a lot of disparate data sets into fewer data centers, onto fewer servers, and into fewer data warehouses that become ever-more important for their BI and advanced analytics.
What we're seeing is that more data warehouses are becoming enterprise data warehouses and are becoming multi-domain and multi-subject. You used to have tactical data marts, one for your customer data, one for your product data, one for your finance data, and so forth. Now, the enterprise data warehouse is becoming the be all and end all -- one hub for all of those sets.
Also, the data warehouse is becoming more than a data warehouse. It's becoming a full-fledged content warehouse, not just structured relational data, but unstructured and semi-structured data -- from XML, from your enterprise content management system, from the Web, from various formats.
O'Reilly: In the first age of computing, business models were dominated by hardware. In the second age, they were dominated by software. What started to happen in the 1990s ... open source started to create new business models around data, and, in particular, around network applications that built huge data sets through user participation. That’s the essence of what I call Web 2.0.
Look at Google. It's a BI company, based on massive data sets, where, first of all, they are spidering all the activity off of the Web, and that’s one layer. Then, they do this detailed analysis of the link structure of that Web, and that’s another layer. Then, they start saying, "Well, what else can we find? They start looking at click stream data. They start looking at browsing history, and where people go afterward. Think of all the data. Then, they deliver service against that.
That’s the essence of Web 2.0, building a massive data set, doing real-time analytics against it, and then figuring out what services you can deliver. What’s happening today is that movement is transferring from the consumer Web into business.
... When we think about where this is going, we first have to understand that everybody is connected all the time via applications, and this is accelerating, for example, via mobile. The need for real-time analytics against massive data sets is universal. ... This is a real frontier of competitive advantage. You look at the way that new technologies are being explored by startups. So many of the advantages are in data.
Yara: We're now entering this new cycle, where companies are going to be defined by their ability to capture and make use of the data and the user contributions that are coming from their customers and community. That is really being able to make parallel computing a reality.
... If you look at running applications on a much cheaper and much more efficient set of commodity systems and consolidating applications through virtualization, that would be a really compelling thing, and we've seen a multi-billion dollar industry born of that.
... We're talking about using parallel computing techniques, open-source software, and commodity hardware. It’s literally a 10- to 100-fold improvement in price performance. When the cost of data analysis comes down 10 to 100 times, that’s when new things become possible.
... Business is now driven by Web 2.0, by the success of Google, and by their own use and actions of the Web realizing how important data is to their own businesses. That’s become a very big driver, because it turns out that parallel computing, combined with commodity hardware, is a very disruptive platform for doing large-scale data analysis. ... Google has become a thought leader in how to do this, and there are a lot of companies creating technologies and models that are emblematic of that.
Kobielus: ... Power users are the ones who are going to do the bulk of the BI and analytics application development in this new paradigm. This will mean that for the traditional high priesthood of data modelers and developers and data mining specialists, more and more of this development will be offloaded from them, so they can do more sophisticated statistical analysis. ... The front office is the actual end user.
O'Reilly: ... The breakthroughs are coming from the ability of people to discern meaning in data. That meaning sometimes is very difficult to extract, but the more data you have, the better you can be at it. ... Getting more tools for handling larger and more complex data sets, and in particular, being able to mix data sets, is critical. ... That fits with this idea of crossing data sets being one of the new competencies that people are going to have to get better at.
Kobielus: Traditionally, data warehouses existed to provide you with perfect hindsight on the customer -- historical data, massive historical data, hopefully on the customer, and that 360 degree view of everything about the customer and everything they have ever done in the past, back to the dawn of recorded time.
Now, it’s coming down to managing that customer relationship and evolving and growing with that relationship. You have to have not so much a past or historical view, but a future view on that customer. You need to know that customer and where they are going better than they know themselves. ... That’s where the killer app of the online recommendation engine becomes critical.
Feed all [possible data and content] into a recommendation engine, which is a predictive-analytics model running inside the data warehouse. That can optimize that customer’s interaction at every touch point. Let’s say they're dealing with a call-center person live. The call-center person knows exactly how the world looks to that customer right now and has a really good sense for what that customer might need now or might need in three month, six months, or a year, in terms of new services or products, because other customers like them are doing similar things.
Yara: ... You're going to see lots of cases where for traditional businesses that are selling services and products to other businesses, the aggregation of data is going to be interesting and relevant. At the same time, you have companies where even the internal analysis of their data is something they haven’t been able to do before.
... These companies actually have access to amazing amounts of information about the customers and businesses. They are saying, "Why can’t we, at the point of interaction -- like eBay, Amazon, or some of these recommended engines -- start to take some of this aggregate information and turn it into improving businesses in the way that the Web companies have done so successfully. That’s going to be true for B2C businesses, as well as for B2B companies.
We're just at the beginning of that. That’s fundamentally what’s so exciting about Greenplum and where we're headed.