There's gold in them thar databases

There's gold in them thar databases

Summary: Finding hidden patterns and trends in masses of corporate data--in real time--takes up enormous processing power, but it can be very profitable.The buzz of the corporate data explosion has meant near-constant headaches for corporate data managers charged with making sense of ever-growing mountains of information.


Finding hidden patterns and trends in masses of corporate data--in real time--takes up enormous processing power, but it can be very profitable.

The buzz of the corporate data explosion has meant near-constant headaches for corporate data managers charged with making sense of ever-growing mountains of information. Facing pressure to extract meaningful value from this data, many have turned to complex data mining systems that use preternaturally intelligent mathematical algorithms to sift through massive volumes of information looking for previously unknown relationships.

Although data mining has been sold as a way to maximise the value of any type of corporate data, however, its user base has largely been limited to the upper tier of business — financial services, large retail, telecommunications, and government users.

Tax offices use it to look for discrepancies in lodged forms; banks use it to spot credit card fraud and to profile customers likely to declare bankruptcy or default on loans; law enforcement agencies use it to detect money laundering; insurance companies use it to pick out fraudulent claims; and retailers analyse sales trends to model consumers' buying behaviour and adjust stores' stock levels accordingly.

In a well-known example of the technology's power, the Australian Securities and Investment Commission used NetMap, which cross-references relationships between data in a case — to root out the 1996 insider trading conducted by then-Macquarie Bank executive Simon Hannes.

The technology has even found a home at Italian soccer club AC Milan, which is using Computer Associates' CleverPath Predictive Analysis Server to analyse physiological, orthopaedic, and mechanical data culled from a variety of sources. The system digs through data to identify factors that may have led to player injuries in the past, then uses CA's Neugents technology to watch incoming data and warn specialists about the emergence of a similar situation that could potentially lead to player injury.

At its core, data mining is about creating useful information by identifying previously unknown relationships within a data set. Yet despite its usefulness, practical limits on data mining's uptake have stemmed from its high cost (implementations often run into seven figures) and the complexity involved in integrating applications and databases with tools that have never been designed for user-friendliness.

That's led to a frustrating truth: although many companies have successfully implemented data warehouses--massive databases containing large volumes of historical data for analysis and reuse--many more have struggled to do more with that data than run basic reports using simple tools. Indeed, while relatively mature business analytics tools have been available for years, a recent survey of 50 Australian executives, conducted by Teradata, revealed that nearly half feel they don't have enough information to make intelligent business decisions.

Always looking for new ways to leverage the data they've collected, companies are continuing to turn to data mining to find out things they never knew about their businesses. IDC's latest assessment of the market for business intelligence (BI) solutions — a category that includes data mining, executive reporting and other data analysis tools — predicted the Asia-Pacific market would grow at 23 per cent annually to be worth US$3.3 billion by 2006. In a separate report, Aberdeen Group predicted the worldwide business analytics market would grow from US$4 billion in 2001 to US$11 billion by 2005.

Much of this growth will be driven by the use of data mining to turn data contained in corporate data warehouses into better pictures of customer behaviour — which, in turn, can fundamentally modify business strategies.

Clearly, companies want to know more about their operations and their best chances for success. IDC's projection is doubly interesting as it confirms that BI is one of the few sectors to enjoy positive growth despite the recent slump in overall IT spending. Despite its cost, it seems data mining is recession-proof: its value makes it a key ally in identifying and setting strategic priorities that can turn around a business hit hard by tough economic times.

In a study of business analytics' business value, IDC recently found that among North American and European companies, the successful introduction and use of analytics has delivered returns on investment of anywhere from 17 per cent to 2000 per cent, with a median ROI of 112 per cent; more than half saw ROI of 101 per cent or more, and fully one-fifth of respondents saw an ROI of 1000 per cent or more. Overall, the mean payback period was 1.6 years, with projects averaging US$4.5 million. When paired with business process change such as CRM, analytics delivered a median ROI of 55 per cent.

These kinds of numbers make a compelling argument for business analytics, and data mining is a significant part of those analytics. To this end, it's possible to construct a quite compelling business case around the introduction of such technologies, which are both a worthwhile investment in new environments and a strong step for companies that have invested heavily in data warehouses in the past. As the IDC figures show, the potential returns are truly limited only by the imagination.

A stitch in time saves (data) mine

In the past, the data mining market was defined by proprietary products designed to sit on top of enterprise data warehouses, or by extensions to those warehouses themselves. Data mining was seen as a discrete activity in which a skilled analyst would define complex search parameters, then sic the tools onto the data warehouse and wait for results.

When those results came, they were usually helpful: two or more products might be selling well together, or a particular type of customer might have filed more lost property claims in the past month. Whatever the data, however, data mining's biggest shortcoming was that it could take some time for business practices to act upon the relationships that it had found.

In many ways, this problem was a result of the architectural and technical issues that data mining presented. Thus was born OLAP (OnLine Analytical Processing), a category of tool that allows data analysts to pull out several dimensions of data from a larger data set, then explore relationships between those sets. In the years since it was introduced, OLAP has gained a steady following as a way of finding patterns in large masses of data — yet it is still human-directed, only revealing patterns as it's instructed to. OLAP is also, by its nature, extremely limiting since its existence is predicated around ignoring many other data sets.

Janet PernaData mining, in its purest sense, is about establishing a data analysis methodology that can be a superset of the narrow scope of OLAP. Yet making this happen has historically been difficult, if only because of the massive volume of data--and attendant processing time required to process it — that most companies faced.

Recognising that improving this situation presents a significant opportunity for product differentiation, database and enterprise application vendors have recently committed themselves to improving the utility of data mining by positioning it as a real-time function built into business information systems.

"We've pushed data mining algorithms into DB2 to be able to do things like real-time segmentation as transactions are coming in," says Janet Perna, worldwide general manager of data management with IBM. "The big problem with data mining technology has been that there's been a sense that it requires a PhD to be able to utilise the technology. But these technologies are becoming part and parcel of the data infrastructure, and it's going to become more and more mainstream as more capability is pushed into the data engine."

To this end, IBM recently added a host of data mining capabilities into its DB2 OLAP Server and offers more features through its DB2 Intelligent Miner, which conducts real-time scoring on DB2 data and PMML (Predictive Model Markup Language) data. PMML is an XML-based markup language, now in version 2.1 and supported in a number of data mining and enterprise applications, that describes inputs to data mining models, the transformations used in preparing data for data mining, and the parameters defining the data mining models.

For its part, Microsoft plans to add seven additional data mining algorithms to Yukon, the next major release of SQL Server, which gained OLAP capabilities and support for clustering and "decision tree" data mining algorithms in its SQL Server 2000 version.

Microsoft data access products marketing manager Terry Clancy concedes the company's focus on OLAP has diluted its brand strength in data mining, but is planning a major partner and marketing push to remedy this situation, and commoditise the overall data mining market, when Yukon appears. Yet he won't be alone: SAP, Peoplesoft, i2, and other enterprise application vendors are adding analytics to their core applications to improve their ability to process data as it's generated.

Because it's so data-intensive, real-time data mining requires a significant investment in hardware in order to allow analytics to run alongside the everyday transaction processing that keeps the company running. All operating system and database combinations have benefited from the ongoing acceleration of server processor speed, which has provided the grunt to make real-time mining possible.

For its part, Oracle has addressed this problem by promoting its Oracle-on-Linux solution for data mining, largely because of the recognised scalability and robustness of the Linux platform and Oracle's own RAC (Real Application Clusters) clustering technology.

"It can be deployed ... on clusters of Linux servers so you can have enterprise data management, but at high reliability," says Roland Slee, director of business and technology solutions with Oracle Australia. "We're seeing customers deploying databases on Intel-based servers with very high clock speeds. Because it supports transparent clustering, you can get scalability and affordability, then take advantage of the data mining features. In Oracle's experience, those [computing] cycles are available more affordably and with better performance using Linux clusters than other environments."

That's just one perspective, however; with both RISC and Intel processors continuing their ascent up the exponential curve of Moore's Law, the servers necessary to make real-time data mining work are quickly dropping in price. That means today's data mining has become both cheaper and more accessible to a broader range of companies than ever before.

Applied data mining 101

Now that real-time data mining is becoming a practical reality, it's time to consider how it might improve existing business processes. Rather than forcing companies to run regular reports about customer preferences, real-time data mining allows, for example, a call centre application to query the data warehouse the instant a customer calls on the phone. This decision tree approach allows the call centre agent's system to adapt on-screen prompts so as to better guide the conversation to a positive and more profitable end.

Consider the case of a mobile phone company, which is constantly struggling to retain customers in an environment of high churn, intense competition and very short customer tolerance. A customer may have had several support calls in the past in which she complained about poor reception, and is making yet another one.

Data warehousingThrough conventional methods, the call centre agent is unlikely to know much about the customer's past contacts with the company — unless they've already had a heated argument. That leaves both sides unaware of the trap they're about to walk into, which in turn creates the potential for a confrontational exchange.

If a data mining engine is working in the background while the two talk through the problem, it might notice not only that the caller has had several complaints — and should thus be handled with kid gloves — but also that the caller fits the demographics of people that have historically been high churners. Recognising that the customer isn't likely to stick around much longer if service doesn't improve, the call centre agent could offer her sweeteners such as free calls, on-site replacement, or a handset upgrade if she renews or extends her current contract.

This type of selling would be impossible without having a way to identify those significant opportunities that only arise during the time a company is in live contact with its customers. Since those contacts are most likely to be phone calls associated with billing or service problems, real-time data mining becomes invaluable in allowing companies to seize an up-selling opportunity that might not come again.

"The data mining model is able to go into relationships in a more granular way and find out more unusual combinations that a human mind wouldn't be able to cover," says Richard Lees, principal consultant with Microsoft Australia.

"These have tended to be head-office applications, where a [technical] team built a data mine to find out more about customers. But someone technical will tend to look for things they didn't know about, and if those things are no use to the organisation they're of no use to anybody. We can now create tools for coalface workers to exploit the value of the data mining model— without realising they're doing it."

In something of a confession that earlier data mining tools have been too obscure and esoteric for many customers to use, many vendors are working to bundle their applications into bundles with specific purposes. SPSS, which claims a number of blue-chip banks and telcos amongst its local user base, recently took this approach with the launch of SPSS Predictive Marketing, which wraps data mining techniques around best-practice marketing templates. Using a standard Web browser interface, users can pick out historical trends and explore what-if models; results are framed in terms that non-mathematical employees can relate to and act upon.

Similar offerings are appearing from virtually every data warehousing vendor. Teradata, which has its core user base in high-volume transactional processing systems, recently released an analysis system targeted at financial services companies, while SAS Institute has developed a credit scoring data mining application that's used at organisations including the ANZ Bank.

SPSS also offers Text Mining for Clementine, which tackles the very real problem of mining unstructured data such as comments entered by call centre operators into customer records. Whereas OLAP cubes are well-architected for crunching large volumes of numerical data, it falls short when it comes to textual analysis because unstructured content cannot be neatly categorised for crunching by an OLAP engine. This is where text mining engines, offered as standalone products from a number of key vendors, really come into their own.

Topics: Data Management, Big Data


Australia’s first-world economy relies on first-rate IT and telecommunications innovation. David Braue, an award-winning IT journalist and former Macworld editor, covers its challenges, successes and lessons learned as it uses ICT to assert its leadership in the developing Asia-Pacific region – and strengthen its reputation on the world stage.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to start the discussion