Getting your corporate data ready for prescriptive analytics: data quantity and quality in equal measures

Good news: there's nothing special about getting your data ready for prescriptive analytics. Bad news: you need to do what's needed to get your data ready for any type of analytics - and that's hard work.
Written by George Anadiotis, Contributor

Prescriptive analytics is nothing short of automating your business. This was the silver lining as we explored the complexities of prescriptive analytics in our guide. While a lot of that complexity is something line of business and expert data scientists will have to deal with, IT is not out of the equation either.

Prescriptive analytics is hard, and there's no silver bullet that can get you there without having gone through the evolutionary chain of analytics. You have to get the data collection and storage infrastructure right, the data modeling right, and the state classification and prediction right. 

This is the prescriptive analytics bottom line, and IT has to make sure the data collection and storage infrastructure parts are in place for business and data science to do their parts. The data cleaning and organization necessary for success with prescriptive analytics can be thought of along two dimensions: quantity and quality of the data that will be used to feed the analytics.

Data quantity

To begin with, IT needs to make sure all the data pertinent to the organization are accounted for and accessible. This really is a sine qua non of any analytics effort, but it may be more complicated than it sounds.

Consider all the applications an organization may be using: custom built, off the shelf, on premises, in the cloud, legacy. Each of those may have its own format, storage, and API. IT needs to make sure they are all accessible, without disrupting the operation of applications. A data lake approach may be useful in that respect.

And it gets worse. Data may also live beyond applications. Consider all the internal documents and emails, for example. More often than not, a wealth of data lives in unstructured format and undocumented sources. And many applications are also undocumented, unaccessible, and lack APIs to export data. For those, you will have to either get resourceful, or fail fast.

Even where you succeed, however, this is not a one-off exercise. Applications evolve, and with them so do their data. APIs change, schemas change, new data is added. New applications get thrown in the mix, and old ones become deprecated. Staying on top of data collection requires constant effort, and this is a cost you need to factor in when embarking on your prescriptive analytics journey. Adding semantics to your data lake may help.


How much data is enough? As much as possible.

Photo by William Warby on Unsplash

Speaking of cost: of course, the usual IT provisioning discourse applies here, too. Do you plan ahead, make this a project with predetermined budget for infrastructure and personnel costs, and get it through the organizational budget approval process? Or do you take a more agile, pay-as-you-go approach?

The former is theoretically safer, and more in line with organizational processes. Here's the problem: Unless your data sources are relatively limited and well understood, and you are very thorough in keeping track and provisioning for them, this approach may be impossible in practice.

The latter is more flexible, but can also lead to budget overrun and shadow IT issues. Without some sort of method to the madness, you may end up spending beyond control, and having your data stored all over the place. Although this is not a 100 percent strict rule, the budgeting ahead approach makes more sense when going for on-premises storage, while cloud storage and development  lends itself well to the pay-as-you-go approach.

Finally, data freshness is one more consideration to take into account. If you want your analytics to reflect the real world in real time, the data that feeds it should come in real time, too. This means you should consider streaming data infrastructure. While there are benefits in adopting streaming, it's a new paradigm that comes with its own learning curve and software/hardware/people investment.

Data quality

Garbage in, garbage out is a golden rule when it comes to using data to get analytics insights. So while getting all the data you can get a hold of is an absolute must, it won't help much if you just dump it in a data lake and consider your work done. That's precisely the reason data lakes have gotten a bad name - data swamp, anyone?

To pick up on the "data evolves" theme - this really should be your number one priority. Data governance, that is. Yes, this does sound abstract, but it's just as important as building a pipeline to channel your data to your data lake. Each dataset should come with metadata on its lineage (where does the data come from), its acquisition date, as well as access rights and processing history.

This latter aspect has become increasingly important in today's GDPR world. Of course, not all organizations deal with user data, and even for the ones who do, not all data will be related to users. Still, most organizations at least touch upon some personal data. For those, GDPR provisions need to apply.

So the question becomes: what's more efficient - dividing and conquering, or giving all data the GDPR-ready treatment? In many cases, if the infrastructure to apply full circle data governance is there anyway, it makes sense to apply it to all data. This may make make dataset processing a less lightweight process. But the benefits of metadata for downstream applications can very well make up for this. 


Only quality data can lead to better analytics. And you have to learn to listen to the data, too.

Photo by Franki Chamaki on Unsplash

To boot, master data management is something that can benefit from metadata. When collecting data from many sources, the same entity may well exist more than once. For example, references to customer X will probably exist in the CRM system, in a number of emails and documents, and in the ERP system.

Master data management is the art of keeping references to customer X consistent in your data, and it's an essential component of data quality. Without it, your analytics will not be able to identify customer journey patterns in your data.

Often, the gist of data quality comes down to truly mundane issues: unit conversions, data formats, and the infamous "address issue". To quote Mark Bishop, Tungsten TCIDA director:

"My team and myself were hired to work with Tungsten to add more intelligence in their SaaS offering. The idea was that our expertise would help get the most out of data collected from Tungsten's invoicing solution. We would help them with transaction analysis, fraud detection, customer churn, and all sorts of advanced applications.

But we were dumbfounded to realize there was an array of real-world problems we had to address before embarking on such endeavors, like matching addresses. We never bothered with such things before -- it's mundane, somebody must have addressed the address issue already, right? Well, no. It's actually a thorny issue that was not solved, so we had to address it."


Addresses are a good example of how data quantity and quality need to coalesce if you want to have dataset that can feed prescriptive analytics efforts. Your data scientists won't be able to do feature engineering to capitalize on the business expertise, if your data is not abundant and clean.

But that already implies the most important thing, which is often left out of the equation: culture. No organization can benefit from prescriptive analytics, without change in attitude to become data driven. And that is perhaps the most important by-product of going through the evolutionary path of analytics. 

If you do that, what you'll find is that you will no longer be thinking in terms of IT and business: data is business, and it's everyone's job to produce and consume it. IT is just the facilitator..

Editorial standards