There's gold in them thar databases

Finding hidden patterns and trends in masses of corporate data--in real time--takes up enormous processing power, but it can be very profitable.The buzz of the corporate data explosion has meant near-constant headaches for corporate data managers charged with making sense of ever-growing mountains of information.
Written by David Braue, Contributor

Finding hidden patterns and trends in masses of corporate data--in real time--takes up enormous processing power, but it can be very profitable.

The buzz of the corporate data explosion has meant near-constant headaches for corporate data managers charged with making sense of ever-growing mountains of information. Facing pressure to extract meaningful value from this data, many have turned to complex data mining systems that use preternaturally intelligent mathematical algorithms to sift through massive volumes of information looking for previously unknown relationships.

Although data mining has been sold as a way to maximise the value of any type of corporate data, however, its user base has largely been limited to the upper tier of business — financial services, large retail, telecommunications, and government users.

Tax offices use it to look for discrepancies in lodged forms; banks use it to spot credit card fraud and to profile customers likely to declare bankruptcy or default on loans; law enforcement agencies use it to detect money laundering; insurance companies use it to pick out fraudulent claims; and retailers analyse sales trends to model consumers' buying behaviour and adjust stores' stock levels accordingly.

In a well-known example of the technology's power, the Australian Securities and Investment Commission used NetMap, which cross-references relationships between data in a case — to root out the 1996 insider trading conducted by then-Macquarie Bank executive Simon Hannes.

The technology has even found a home at Italian soccer club AC Milan, which is using Computer Associates' CleverPath Predictive Analysis Server to analyse physiological, orthopaedic, and mechanical data culled from a variety of sources. The system digs through data to identify factors that may have led to player injuries in the past, then uses CA's Neugents technology to watch incoming data and warn specialists about the emergence of a similar situation that could potentially lead to player injury.

At its core, data mining is about creating useful information by identifying previously unknown relationships within a data set. Yet despite its usefulness, practical limits on data mining's uptake have stemmed from its high cost (implementations often run into seven figures) and the complexity involved in integrating applications and databases with tools that have never been designed for user-friendliness.

That's led to a frustrating truth: although many companies have successfully implemented data warehouses--massive databases containing large volumes of historical data for analysis and reuse--many more have struggled to do more with that data than run basic reports using simple tools. Indeed, while relatively mature business analytics tools have been available for years, a recent survey of 50 Australian executives, conducted by Teradata, revealed that nearly half feel they don't have enough information to make intelligent business decisions.

Always looking for new ways to leverage the data they've collected, companies are continuing to turn to data mining to find out things they never knew about their businesses. IDC's latest assessment of the market for business intelligence (BI) solutions — a category that includes data mining, executive reporting and other data analysis tools — predicted the Asia-Pacific market would grow at 23 per cent annually to be worth US$3.3 billion by 2006. In a separate report, Aberdeen Group predicted the worldwide business analytics market would grow from US$4 billion in 2001 to US$11 billion by 2005.

Much of this growth will be driven by the use of data mining to turn data contained in corporate data warehouses into better pictures of customer behaviour — which, in turn, can fundamentally modify business strategies.

Clearly, companies want to know more about their operations and their best chances for success. IDC's projection is doubly interesting as it confirms that BI is one of the few sectors to enjoy positive growth despite the recent slump in overall IT spending. Despite its cost, it seems data mining is recession-proof: its value makes it a key ally in identifying and setting strategic priorities that can turn around a business hit hard by tough economic times.

In a study of business analytics' business value, IDC recently found that among North American and European companies, the successful introduction and use of analytics has delivered returns on investment of anywhere from 17 per cent to 2000 per cent, with a median ROI of 112 per cent; more than half saw ROI of 101 per cent or more, and fully one-fifth of respondents saw an ROI of 1000 per cent or more. Overall, the mean payback period was 1.6 years, with projects averaging US$4.5 million. When paired with business process change such as CRM, analytics delivered a median ROI of 55 per cent.

These kinds of numbers make a compelling argument for business analytics, and data mining is a significant part of those analytics. To this end, it's possible to construct a quite compelling business case around the introduction of such technologies, which are both a worthwhile investment in new environments and a strong step for companies that have invested heavily in data warehouses in the past. As the IDC figures show, the potential returns are truly limited only by the imagination.

A stitch in time saves (data) mine

In the past, the data mining market was defined by proprietary products designed to sit on top of enterprise data warehouses, or by extensions to those warehouses themselves. Data mining was seen as a discrete activity in which a skilled analyst would define complex search parameters, then sic the tools onto the data warehouse and wait for results.

When those results came, they were usually helpful: two or more products might be selling well together, or a particular type of customer might have filed more lost property claims in the past month. Whatever the data, however, data mining's biggest shortcoming was that it could take some time for business practices to act upon the relationships that it had found.

In many ways, this problem was a result of the architectural and technical issues that data mining presented. Thus was born OLAP (OnLine Analytical Processing), a category of tool that allows data analysts to pull out several dimensions of data from a larger data set, then explore relationships between those sets. In the years since it was introduced, OLAP has gained a steady following as a way of finding patterns in large masses of data — yet it is still human-directed, only revealing patterns as it's instructed to. OLAP is also, by its nature, extremely limiting since its existence is predicated around ignoring many other data sets.

Data mining, in its purest sense, is about establishing a data analysis methodology that can be a superset of the narrow scope of OLAP. Yet making this happen has historically been difficult, if only because of the massive volume of data--and attendant processing time required to process it — that most companies faced.

Recognising that improving this situation presents a significant opportunity for product differentiation, database and enterprise application vendors have recently committed themselves to improving the utility of data mining by positioning it as a real-time function built into business information systems.

"We've pushed data mining algorithms into DB2 to be able to do things like real-time segmentation as transactions are coming in," says Janet Perna, worldwide general manager of data management with IBM. "The big problem with data mining technology has been that there's been a sense that it requires a PhD to be able to utilise the technology. But these technologies are becoming part and parcel of the data infrastructure, and it's going to become more and more mainstream as more capability is pushed into the data engine."

To this end, IBM recently added a host of data mining capabilities into its DB2 OLAP Server and offers more features through its DB2 Intelligent Miner, which conducts real-time scoring on DB2 data and PMML (Predictive Model Markup Language) data. PMML is an XML-based markup language, now in version 2.1 and supported in a number of data mining and enterprise applications, that describes inputs to data mining models, the transformations used in preparing data for data mining, and the parameters defining the data mining models.

For its part, Microsoft plans to add seven additional data mining algorithms to Yukon, the next major release of SQL Server, which gained OLAP capabilities and support for clustering and "decision tree" data mining algorithms in its SQL Server 2000 version.

Microsoft data access products marketing manager Terry Clancy concedes the company's focus on OLAP has diluted its brand strength in data mining, but is planning a major partner and marketing push to remedy this situation, and commoditise the overall data mining market, when Yukon appears. Yet he won't be alone: SAP, Peoplesoft, i2, and other enterprise application vendors are adding analytics to their core applications to improve their ability to process data as it's generated.

Because it's so data-intensive, real-time data mining requires a significant investment in hardware in order to allow analytics to run alongside the everyday transaction processing that keeps the company running. All operating system and database combinations have benefited from the ongoing acceleration of server processor speed, which has provided the grunt to make real-time mining possible.

For its part, Oracle has addressed this problem by promoting its Oracle-on-Linux solution for data mining, largely because of the recognised scalability and robustness of the Linux platform and Oracle's own RAC (Real Application Clusters) clustering technology.

"It can be deployed ... on clusters of Linux servers so you can have enterprise data management, but at high reliability," says Roland Slee, director of business and technology solutions with Oracle Australia. "We're seeing customers deploying databases on Intel-based servers with very high clock speeds. Because it supports transparent clustering, you can get scalability and affordability, then take advantage of the data mining features. In Oracle's experience, those [computing] cycles are available more affordably and with better performance using Linux clusters than other environments."

That's just one perspective, however; with both RISC and Intel processors continuing their ascent up the exponential curve of Moore's Law, the servers necessary to make real-time data mining work are quickly dropping in price. That means today's data mining has become both cheaper and more accessible to a broader range of companies than ever before.

Applied data mining 101

Now that real-time data mining is becoming a practical reality, it's time to consider how it might improve existing business processes. Rather than forcing companies to run regular reports about customer preferences, real-time data mining allows, for example, a call centre application to query the data warehouse the instant a customer calls on the phone. This decision tree approach allows the call centre agent's system to adapt on-screen prompts so as to better guide the conversation to a positive and more profitable end.

Consider the case of a mobile phone company, which is constantly struggling to retain customers in an environment of high churn, intense competition and very short customer tolerance. A customer may have had several support calls in the past in which she complained about poor reception, and is making yet another one.

Through conventional methods, the call centre agent is unlikely to know much about the customer's past contacts with the company — unless they've already had a heated argument. That leaves both sides unaware of the trap they're about to walk into, which in turn creates the potential for a confrontational exchange.

If a data mining engine is working in the background while the two talk through the problem, it might notice not only that the caller has had several complaints — and should thus be handled with kid gloves — but also that the caller fits the demographics of people that have historically been high churners. Recognising that the customer isn't likely to stick around much longer if service doesn't improve, the call centre agent could offer her sweeteners such as free calls, on-site replacement, or a handset upgrade if she renews or extends her current contract.

This type of selling would be impossible without having a way to identify those significant opportunities that only arise during the time a company is in live contact with its customers. Since those contacts are most likely to be phone calls associated with billing or service problems, real-time data mining becomes invaluable in allowing companies to seize an up-selling opportunity that might not come again.

"The data mining model is able to go into relationships in a more granular way and find out more unusual combinations that a human mind wouldn't be able to cover," says Richard Lees, principal consultant with Microsoft Australia.

"These have tended to be head-office applications, where a [technical] team built a data mine to find out more about customers. But someone technical will tend to look for things they didn't know about, and if those things are no use to the organisation they're of no use to anybody. We can now create tools for coalface workers to exploit the value of the data mining model— without realising they're doing it."

In something of a confession that earlier data mining tools have been too obscure and esoteric for many customers to use, many vendors are working to bundle their applications into bundles with specific purposes. SPSS, which claims a number of blue-chip banks and telcos amongst its local user base, recently took this approach with the launch of SPSS Predictive Marketing, which wraps data mining techniques around best-practice marketing templates. Using a standard Web browser interface, users can pick out historical trends and explore what-if models; results are framed in terms that non-mathematical employees can relate to and act upon.

Similar offerings are appearing from virtually every data warehousing vendor. Teradata, which has its core user base in high-volume transactional processing systems, recently released an analysis system targeted at financial services companies, while SAS Institute has developed a credit scoring data mining application that's used at organisations including the ANZ Bank.

SPSS also offers Text Mining for Clementine, which tackles the very real problem of mining unstructured data such as comments entered by call centre operators into customer records. Whereas OLAP cubes are well-architected for crunching large volumes of numerical data, it falls short when it comes to textual analysis because unstructured content cannot be neatly categorised for crunching by an OLAP engine. This is where text mining engines, offered as standalone products from a number of key vendors, really come into their own.

Getting data mining to the users

Increasing delineation of the functionality of data mining solutions has had another beneficial effect: it's made development kits easier to bundle into discrete functional units. Vendors have been improving the accessibility of data mining solutions: latest toolkit revisions, many of them Java-based, allow enterprise developers to easily integrate each platform's data mining features into in-house applications.

This flexibility is a major improvement over the esoteric and complex interfaces of previous systems, a change that should ease development of analytics portals serving the needs of specific user communities within the business. Rather than existing as complex applications used by a few technical analysts, integrating analytics into a general-purpose portal can easily put powerful analysis tools at the hands of far more employees than ever.

Just remember: although employees will no doubt benefit from better information, they can also drown if they're getting too much of it. "The interface is easy, but we have to decide the right level of exposure to give to users," says Colin Shearer, vice president of analytics with SPSS. "They don't want to see detailed statistics about the predictive accuracy of the model; they want things they can immediately interpret within their own sphere of knowledge."

Making this happen, of course, requires close collaboration between technical and business units so that the developed applications reflect the idiosyncrasies of each company's business. It's also important to develop consistent business rules so that all is not left to chance: rules, enforceable through systems such as CA CleverPath's Business Rules Engine, ensure consistent results and provide auditability if there are ever questions about the mechanisms by which numbers are derived.

"The way in which data mining technology is being incarnated is changing quite dramatically," says Oracle's Slee. "Instead of being a specialist activity performed by a small number of users on a subset of data, mining can now be a mainstream activity performed by all users in the context of mainstream applications."

Cleaning up the warehouse

Despite its benefits, there is considerable risk in the process of implementing data mining. That risk lies not so much in the solutions themselves, but in the fact that properly utilising the technology is an all or nothing proposition. Without all of your data in the right place — and the right order — even the most intelligent data mining algorithm is going to throw up furphies that cloud the insights it might otherwise provide.

If data entry problems mean one of your customer's surnames is spelt different ways, for example, any analysis of your customer data is going to treat that customer as different people, each with different buying habits. Desmond McGillevray might love to buy Pringles potato chips at your Hurstville store, but Desmond MacGillevray could buy lots of toothpaste in Kogarah while Desmond McGillevry also likes to buy Doritos in Rockdale.

Of course, you're unlikely to be writing down customer names for each grocery purchase, but that's just a practical issue. The point remains: feed this data into a data mining system, and it's going to tell you something different than if you build a long-term profile of Desmond MacGillevry's overall buying habits. Compound this sort of problem to hundreds of thousands of customers, and it's easy to see why many companies have treated data mining more as a goal — to be reached after careful due diligence and data amelioration — than as a single project in itself.

The ability to tie these purchases together is reason enough to implement a loyalty program, where each customer has a unique identifier that reduces the risk of data entry problems and that coalesces later customer support and marketing around a single historical record.

In many cases, data consistency problems are compounded when data mining tools are applied to data culled from several enterprise systems--for example, sales, loyalty program, customer care and marketing databases. Unless enough work on consistency has been done beforehand, it's likely that each of those databases will represent many customers in different ways within the data warehouse.

Feed this data into a data mining system, and you've got the preparations for an informational disaster — which can become even more problematic if you mistakenly act on bad data that you believe is accurate. The solution to this problem lies in careful data checking (automated tools can help this process) and a concerted effort to improve data entry procedures so problem data gets fixed and stays that way.

By bringing complex analytics to large communities of users, today's data mining platforms have broken down many of the barriers that prevented their adoption in the past. Given the demonstrated benefits of data mining to organisations that have pursued it in the past, playing the ease-of-use card has created a clear and compelling business case that, with just a little imagination, can deliver far more relevant, data-enabled applications than ever.

Data intelligence strengthens OneSteel

OneSteel, the recently divested steel manufacturing division of BHP Billiton, manages extraordinarily complex supply chains emerging from the co-ordination of raw materials, region-wide logistics, process manufacturing and marketing in a fiercely competitive global market.

Given the wealth of information its nearly 600 knowledge workers need to process, business analytics have long been an everyday part of life at the $3 billion OneSteel, which has most of Cognos' analytics applications in production in one way or another. Data mining is among the latest of these tools, riding a growing crest of recognition that data is good for far more than simply filling up transactional databases.

That recognition has come as IT and business staff work together to cull the most interesting details from mountains of data generated every day. Every day or so, automated tools pull out fresh data about manufacturing, sales and other parts of the business from OneSteel's JD Edwards, BPCS, and other systems. This data is then loaded into a number of dedicated data marts running on Microsoft SQL Server (this will soon be replaced when OneSteel completes a currently underway migration to SAP on Oracle), and made available to employees using a variety of analytical tools.

This approach ensures that data is always current, but getting to this point has taken a significant effort in ensuring data consistency, says Will Rigby-Jones, manager of OneSteel's knowledge systems, whose role involves finding new ways to utilise data analysis to help managers improve the business.

"To use data mining in the way it was intended requires a need from the business, and requires quality data captured in the right way, then delivered and presented in the right way," he says. "There's nothing more catastrophic than having one report that says the same thing [and another] that says something different."

Making the jump from reports that contained hundreds of pages of 132-column fanfold pages--which were for years the only way for managers to get business information — to onscreen analytics has required careful attention to users' needs, Rigby-Jones explains. Managers, of course, quickly warm to the ability to prepare reports that might have previously taken days to compile, in just seconds.

Given the size of the business, however, data mining can also swamp them with data, perpetuating an age-old problem. To avoid this issue, the OneSteel team has expended considerable effort to provide interfaces that provide easy access to the most important information for each group of users.

Stoplight-style indicators, built using Cognos Metrics Manager and fed with data from proactive data mining and OLAP analysis, allow managers to easily spot which metrics need attention, then drill down into more detail as necessary. Furthermore, growing utilisation of Web-based interfaces tells Rigby-Jones that increasingly senior managers are recognising the value of the data mining environment.

The ability to spot important multi-factor trends and analyse business data to the nth degree, in near real-time, has sped up workers' ability to use information. At a broader scale, it's also revolutionised management philosophy by putting a visible face on the theory that even several small business changes can compound into significant business improvement. With the right tools now providing a way to weigh up the relative merits of such changes, OneSteel knows more about its business than ever.

"In the past, making money was all about getting the biggest price at the lowest cost," Rigby-Jones says. "But it's actually a whole lot of things. We might find that if you increase the booking rate by x percent, decrease the amount of overtime, save x percent on freight — if you add all these together it might be enough to affect the money in a large way. But if you try to do one of them, you're doomed to failure. It's all about relationships: business is the puppeteer and we provide the strings to pull."

Executive summary: dig for gold, toss the pyrite

Real-time data mining significantly improves both the user community's access to the technology, and the role that data mining can play within everyday business processes. Here are a few tips to make the most of your data:

  • Think hardware. Data mining is extremely computing-intensive, particularly if it's being done continuously in the background as in a real-time environment. Clustered servers will provide the scalability you need to go real-time without affecting overall performance.
  • Clean up your data. Business analytics are impossible to use effectively if your data isn't clean and consistent, yet many companies still haven't resolved chronic discrepancies between the data held in different types of databases. Before going into data mining, get your data under control and figure out how to keep it that way. This often involves people training as much as proper systems.
  • Think outside the cube. OLAP may be great for crunching numbers, but it's inherently limiting because it excludes many types of information that may well be relevant. Use OLAP data marts for users needing to run consistent, regular reports — but when it comes to spotting new trends, point your data mining tools at your full data set.
  • Customers respect knowledgeable staff. But they'll go running if your staff don't have the right data to resolve their issues quickly. Real-time analytics can be an important tool in improving customer care by putting the right information at your customer service representatives' fingertips. That way, they can make informed decisions when it's important to — not later on, after they're off the phone.
  • It's not what's in the data that counts. It's how you use it. Just implementing data mining is only one part of the challenge; the real value of that analysis lies in the ability to turn that information into real business decisions. Make sure managers are trained to think laterally by using data mining to find new and interesting patterns in company databases.
  • Text isn't the same as data. Comments from customers may be hastily typed notes from call centre operators, but they're extremely useful in determining customers' opinions. Yet while numbers are usually contained in well-structured databases, textual information is rarely so ordered — so it's hard to analyse using conventional data mining tools. If you want to pull out trends from textual information, consider companion text mining tools that complement conventional data mining.
  • Think of data mining as a feature. It's usually been made possible by standalone products in the past, but enterprise application vendors are increasingly building it into their databases and applications. This may be a particularly effective approach as it allows data mining environments to leverage the strength of the underlying database or application — and to follow those applications' growth curve by using capabilities such as built-in clustering support.
  • Share data intelligently. It's one thing to use data mining to spot new patterns in your data, but it becomes even more effective when you can feed relevant portions of that data to your suppliers. Noticed customers tend to buy loads of Coke when it's discounted along with Doritos? Make sure your systems can automatically tell the distributors to send you extra volumes so you can keep up with expected demand.
  • Don't overbuy BI. Anecdotal evidence suggests many companies buy large numbers of licenses for analytical tools, then end up using just a few of them as power users warm to the tools and other users reject them. Start low, gauge user demand and increase your licenses from there. A web client may be an easy way to do this without headaches, since it can be easily offered to new users as demand dictates.
  • The interface is everything. Analysts love data mining since it lets them explore complex data sets. Most users hate it because it lets them explore complex data sets. Since accessibility is the key to data mining success, either get analysts to work closely with business teams, or integrate the mining into other, more user-friendly applications so users are always viewing results in a context that's meaningful to them.

Editorial standards