Two key facts about data scientists: they're in short supply and most people don't actually get what they do. That unfortunate combination of being in demand and yet misunderstood is complicated by marketing's cavalier use of the job title.
Paul Schaack, former CERN physicist and now senior data scientist at predictive analytics SaaS firm Blue Yonder, says the problem is that the term 'data scientist' is vague and many people in the business world misconstrue the role.
"It's interesting to see what nowadays is defined as a data scientist, because so many people call themselves that. Some people from marketing call themselves data scientists. Other people actually do proper machine learning and work with data programming. They also call themselves data scientists. But if you compare the two groups, it's like two different worlds," he said.
"Marketing people use it because they look at customer insight data. It's business insights that they're doing, pure analysis. It's nothing to do with data science. They don't use statistical methods to reason or to quantify their results."
According to Schaack, data scientists should be a fusion of software engineer, data analyst and statistician.
"It's a bit of engineering, machine learning and statistics. So a typical statistician is more of a data scientist than somebody in marketing. It matters when people have a very specific idea of what they want from their data scientist and they don't get it and are surprised why," he said.
Two weeks ago, scientists at CERN's LHCb experiment announced the discovery of a new particle called the pentaquark. It was the project that Schaack worked on until the Large Hadron Collider shut down for two years in 2013 for upgrades and maintenance.
Inside the collider, two beams travel in opposite directions around the 27km ring at close to the speed of light. Each beam contains up to 476 bunches of 100 billion protons, creating collisions every 50 nanoseconds. The Higgs boson particle was found by the Atlas and CMS experiments in 2012. LHCb is designed to look at the asymmetry between matter and anti-matter.
Schaack is today working with businesses ranging from retailers to manufacturers, using the same NeuroBayes algorithm that he employed at CERN on a rare decay from a B meson into sub particles.
"For my purposes, it was very useful because I had a problem. My signal was very small. I only had a very small number of events, maybe 70, and the background was huge, thousands of millions of events. So I needed an algorithm that was very accurate in predicting the correct signal," he said.
The NeuroBayes algorithm was developed by Blue Yonder founder and chief scientific adviser Professor Michael Feindt when he too was working at CERN. The algorithm uses historical data to forecast events, producing probability distributions for individual occurrences.
"What you put into [NeuroBayes] is usually a quantity and whether that's a physical quantity or a business value, it doesn't matter to the algorithm as such," Schaack said.
"The output is the same - usually it's a probability. In science it's a probability of it being the particle decay you're looking for or not and in business it's the probability of whether the customer is likely to spend or not."
While the underlying algorithm may be the same, Schaack finds the use of real-life data, instead of theoretical models, and the pace of the business environment, where projects can go from proof of concept to production in months, strikingly different from peer-reviewed academia.
Yet one area where people still come unstuck when attempting predictive analytics in business is with data quantity and quality, something that is less of an issue at the Large Hadron Collider, which generates about 30 petabytes of data annually.
"We always need historical data to try to train our model on. If they just plan on taking certain data in the future or have data for one month, or maybe they started taking data properly some time ago but they don't have enough history, then we have a problem because we usually need to capture seasonal effects. So we need at least two years of data," he said.
"The way we work is we cross-validate our model with historical data where we know the truth. If they don't have enough data points, we can't train our data model accurately and therefore cannot predict future events."
Another issue is people wanting predictions on an impossibly detailed level.
"Maybe they want a prediction per day but then they only have a few sales every week. So it's a bit pointless because the fit is not really there and the number of data points is too few." Schaack said.
The second big factor is data quality, where you want as many quantitative and categorised data points as possible.
"But of course as soon as you have human text or any sort of sentiment analysis required, the accuracy of that is not as good as numeric values," he said.
A common misconception about predictive analytics is the potential for making accurate predictions about far-off events.
"People think it must be possible to make those really long-term forecasts - predicting nine months or 12 months into the future. Obviously you can make those sorts of predictions but they come with an uncertainty and the further into the future you go, the greater that becomes," he said.
"Often it's, 'I want to know the Christmas sales now'. We tell them if we have three Christmas seasons recorded, maybe a month before Christmas we can talk about making reliable predictions."
Those types of misconceptions make the one- or two-day workshops conducted at the beginning of a new business project particularly important.
"They know the data better than anyone else because they've been working for a particular company for maybe a few years so they have that domain knowledge, which they might not even realise they have," he said.
"That interaction is very important. At the beginning I'm trying to identify a good use case because not every data project is necessarily suited to us. Maybe people are more interested in standard analytics than predictive."
Once the potential value of a predictive analytics project has been identified, the work shifts to a proof-of-concept stage and ultimately to blind testing against the customer's existing methods to demonstrate the accuracy of Blue Yonder's technology and create confidence.
"In most cases people know they need to do something and they don't know quite what it is yet. But they have quite a lot of data and they feel under pressure to do something because everyone is talking about big data."
More on big data
- Marketers failing to take advantage of predictive analytics tools according to new report
- Cortana Analytics: Microsoft's cloud analytics prix fixe
- SAP, Esri team on HANA geospatial integration
- Virtualized Hadoop: A brief look at the possibility
- Apache Atlas, Parquet progress; Whirr retired
- MariaDB Corp cooks up better Chef and Docker features for developers
- Microsoft bundles analytics services and adds a Cortana front-end
- Big data's big problem: How to make it work in the real world