X
Tech

Democratising machine learning; ask the right questions

Big data - turning masses of data into useful information - is by definition too big to handle individually. You don't care what speed five drivers are doing around the M25 - you care what speed 5,000 drivers are travelling at and whether the 5,000 travellers who drove the same route yesterday and the same day last week and the same day last month went faster or slower so you know if there's something unusual about the traffic.
Written by Simon Bisson, Contributor and  Mary Branscombe, Contributor

Big data - turning masses of data into useful information - is by definition too big to handle individually. You don't care what speed five drivers are doing around the M25 - you care what speed 5,000 drivers are travelling at and whether the 5,000 travellers who drove the same route yesterday and the same day last week and the same day last month went faster or slower so you know if there's something unusual about the traffic. Get the average speeds on weekdays and at weekends and in school holidays or on days when there's a big football match and you can do some interesting things like giving a much more accurate prediction about how long a journey is going to take on a specific day at a specific time - like when you're driving to the airport… Putting all that data together and making rules from it is what machine learning is ideal for.

But while it's getting easier to get large amounts of data - any company can buy streams of Twitter data, for example - knowing how to build the model that turns that data into information (like 'are people upset with our company today?') is more difficult. The expensive part of modelling with traditional tools like SAS isn't buying the software, it's employing the PhD in model theory who can choose the right machine learning system, train it and derive a model that actually matches the data.

That's where Google's prediction API comes in; this service is coming out of Google labs today and now it's a live service, with an SLA. What this gives you is cloud-hosted machine learning using some of the same techniques Google uses itself for things like matching ads to your Gmail messages. As the product manager Travis Green explains it, "We take examples from the past, say pieces of text and how positive or negative they are and we apply many machine learning algorithms to find the best model that finds the patterns in the data, so that when you give it a new piece of data - a new piece of text - the system will tell you how positive or negative it is."

Having something that used to be as specialised as machine learning be available this widely could be really exciting; it could make machine learning accessible to developers who would never have been able to use the technique before. The API doesn't mean that every application is going to become instantly smart of course; this is still hard programming. For some of the simpler common problems like categorising and ranking information (and analysing the sentiment of messages), Google will have a gallery of predictive models you can choose from (and there's going to be a marketplace to buy and sell more models from modelling experts). It's important to pick the right model; get it wrong and it's worse than not modelling at all. And you have to keep it up to date as circumstances change (the API now lets you stream in data so it can update the model). Not knowing in detail about the models you're using means you have to be extra careful that you aren't getting too many false positives or plain wrong answers; if the model is critical to your business, you need to pay for an expert to work on it.

But most importantly, you have to be asking the right questions. The way to do that is to aggregate your data, normalise it - and then dig into it, according to Monica Rogati, a senior research scientist at Linked In. Linked In has a lot of information about careers and promotions, so she aggregated that by country - say the US and India - and then looked at when people tend to get promoted - aggregating changes in job title to something that the Linked in machine learning system classes as a better job at the same company by month. It turns out that in India there are several different months with higher percentages of promotions than usual; in the US there's a definite spike in January. Normalising the data to find out what's standing out - what's happening less or more often in that specific data than in the standard population - showed that there are more promotions in the US in January but that the figure is going down (from 22% to 16% over a decade). So she dug into the data and started slicing it up in different ways to look for patterns and found that the age of the employee corresponds to the change. Older employees wait for the annual appraisal and the new year; younger workers are more impatient. "They don't care what month of the year it is; they want their promotion."

If you're in HR or you want to hire someone away from their current company - knowing that people expect to be promoted more often could be really valuable. Most of the machine learning models aren't that obvious; if the patterns were easy to find, you wouldn't need machine learning to go looking for them. But the fact that you can start asking these kind of questions, using the same machine learning that powers Google services like translation and voice transcription - or the smart playlists in the new Google Music service, means smaller companies can try out some of the techniques that have made Google and Amazon hugely successful.

Mary Branscombe

Editorial standards