Olly Downs is a data scientist with an academic pedigree as long as your arm and a hatful of degrees from Princeton in the US and Cambridge in the UK. His record in business isn't bad either.
As well as spells as a data scientist at Barnes & Noble, broadcasting firm SiriusXM and MSN, he worked at Microsoft Research spin-off Inrix, harvesting data from a fleet of vehicles fitted with GPS locator devices to deliver real-time traffic information.
His present role is senior vice president of data science at analytics applications company Globys, which uses big data to provide mobile operators with targeted marketing. The company employs large-scale data mining, predictive modelling, and machine learning to spot the right time to interact with customers.
Here are the challenges he picked out for businesses implementing big data projects:
There's definitely a skills issue. It's probably part of why the first wave of big-data initiatives hasn't been terribly successful. Essentially, there are the same fundamental challenges as with small-data initiatives — in the sense that getting real knowledge out of data is not really an IT capability.
It's an analytics and data-science capability and that skillset isn't there. It's compounded when you add big data to it because the big-data technologies in play require much more of a software development bent, rather than an IT systems management skillset.
I've been involved in the Seattle area in establishing some professional development programmes with the universities, which focus on taking software engineers and IT skillsets and teaching them to understand that there's a garbage-in, garbage-out problem that comes with managing data at scale and to give some appreciation of the contents of the system and how you then successfully set them up.
The challenge is that it all goes back to the beginning and how you structure data to make it accessible for ad-hoc analysis and make it flexible enough that you can get some things out.
What companies like Tableau Software and QlikTech have shown the world — and advanced users of Microsoft Excel, too — is that you don't have to be a database expert to start manipulating data in an ad-hoc way and coming up with interesting views and insights — provided when you started, the data warehouse is appropriately structured. It's very hard to fix after the fact.
The challenge today is that most enterprise data warehouses view a customer or an entity that the business works with as a row of data rather than a column. That row is populated and updated perhaps on a daily basis with snapshot or aggregate views of the current state of the customer.
But you've collapsed away all the data that tells you about what the individual entity has actually done and the things that have accumulated about them over the course of their relationship.
That makes it much harder to go back as a BI and analytics insights team and recover and start building models that are predictive or actionable in shaping behaviour or changing the relationship you have as an enterprise with your customers.
That's a real problem and part of the very unhealthy dialogue that tends to occur:
Data scientist Just give me the data and I'll work out what it is we'll need.
Response Well, if you can tell me just exactly what you need, we'll get it for you.
Data scientist I'm not going to know what I need until I see it all.
Response You really want all the data?
Data scientist Yes, ideally we'd have all the data in its most basic form.
Response We've got that on tape drive somewhere.
And so the story goes. The challenge you often see is the data is collected and persistently stored, frequently for the purpose of disaster recovery.
But that kind of long-term or expanded storage perpetuates the same schema that exists live, rather than perpetuating data in a more native form that you could go back to and then change how you subsequently process it and bring it live. It burns in whatever the first thinking was about how that data should be used.
So the most successful enterprises I've seen have been ones where they have an archival process that stores in very cold or slow storage the most basic data and doesn't view that as a disaster-recovery system and then has a disaster-recovery system for the current live system.
That's the most flexible way to maintain your data and make it have future value that you couldn't previously have expected.
Part of the key is that data science-driven projects tend to require multiple cycles of history. You need that to understand what's going to happen next from a seasonal perspective. You need examples of other macro-economic events to allow you to model them correctly in the future.
For gathering insights — and this is something that hasn't changed with the big-data revolution essentially — you only need to sample the data in a representative way. That's how you do prototype analytics and things like that, and it's also how you can generate some good reporting.
But when you want to apply that knowledge back to the entirety of your business, you need to be able to have that representation for every individual. If you had distilled away all that data, you'd just never be able to do that. You'd be treating people in broad-brush segments or groups.
So from an insights perspective, sampling the data in an elegant and representative way will make insights accessible to you. Then if you haven't kept the data around for every individual, it isn't going to be actionable in ways that are individualised.
The interesting thing is that Hadoop is great for batch-mode processing at large scale, which is operations like aggregation or counting. The problem is Hadoop is not a real-time or very dynamic technology at all.
Running queries on a Hadoop cluster tends to have quite a large latency because you have to distribute out each individual query run, then you do your reduction step, which is bringing all that data back together. So it's a high-throughput but high-latency technology.
The complement to technology like that is Twitter's Storm which is more of an in-datastream distributed-processing capability. IBM had an early technology in this space, which is now very nicely productised as InfoSphere Streams.
But that also has the same type of idea, which is that sometimes I have a very high-volume datastream and I need to be able to make decisions on it before I put that data into a depository and begin to compute aggregates.
I want to be able to do some manipulations and event detections in-stream so Storm and InfoSphere Streams and some of those technologies are the nice complement to high throughput but high latency.
These guys are high throughput, very low latency but lower complexity in terms of the things that you can do. You can't very easily do machine learning on a datastream, for example.
That's the experience of a lot of enterprises — there's lots of investment but it's very hard to get real actionable knowledge. One of the ways to solve these problems is to outsource your BI or marketing intelligence and think you can get the scale that you wish you had inside.
But the problem is that it's just hiding that you need a lot of external people just as you would need a lot of internal people to solve these problems. It's not necessarily a scaling with technology.
So we focused on how the technology can help us scale these actionable insights and individualised actions. Partially take it out of the hands of the enterprise but provide a technology-enabled service that delivers that value without needing lots of human hands.