Machine learning is the next frontier in Big Data innovation. And the cloud is the next frontier within that frontier.
Almost five years ago, Google launched its Prediction API cloud-based machine learning service. This past July, Microsoft launched its Azure Machine Learning (Azure ML) service as a preview, and brought it into general availability in February. That service had (and has) surprisingly good integration with code written in the open source R programming language.
These were interesting opening moves, but the plot thickened when Microsoft announced in January its intention to acquire Revolution Analytics, the dominant commercial entity contributing to the open source R project and the developer of a compelling distributed computing version of R (called Revolution R) that integrates especially well with Hadoop.
But this past week, the gloves really came off. On Monday, Microsoft announced it had completed its acquisition of Revolution Analytics. And on Thursday, Amazon announced, at long last, the release of its own cloud machine learning service: Amazon Machine Learning (Amazon ML). Now the three-horse race has begun.
Three of a perfect pair
Which one should you use? They are all good offerings; using any of them will be highly advantageous if you aren't implementing predictive analytics at all yet, and odds are you're not.
All three cloud providers claim their machine learning services are based on the same technologies that they have used internally. Amazon, of course, has used predictive analytics in its ecommerce businesses almost since the beginning. Google uses predictive logic in its core search services. Microsoft does likewise with Bing, Xbox and other services.
In terms of data sources, all three major public cloud providers have connected their core storage services, and one or more of their database products, to the machine learning offerings:
- Amazon Machine Learning connects to S3, Redshift and the MySQL flavor of its Relational Database Service (RDS).
- Google Prediction API can read data from Google Cloud Storage, and BigQuery.
- Microsoft supports both its Table and Blob storage services as data sources, as well as SQL Database, Hive tables in Hadoop and both OData feeds and flat files pointed to by a valid Internet URL.
Once data is read in, all three providers support the building of predictive models on it. They also provide APIs for developers to send input variable values and receive a predicted value for the target variable. The attraction of putting this all in the cloud is that any client application can run a prediction by making a single web service call.
For example, you might build a model correlating demographic data points like gender, income, age, and profession to the likelihood of purchasing a specific item. With any of the public cloud providers' machine learning services, you could make a web service call, supplying the demographic data as input parameters, and receiving a prediction back (a yes or no, indicating whether a purchase is likely) as a return value.
The services do differ though. For example:
- Google Prediction API, true to its name, is developer-oriented and provides no user interface (UI).
- Amazon ML provides only a single (and rather opaque) algorithm with which models can be built. Microsoft and Google provide a selection of algorithms, and Microsoft allows R and Python code, and packages, to be used as well.
- Microsoft provides a full-fledged flowchart-style data flow to be built in its UI. Amazon only allows for the specification of an input data set, and the selection of input variables and the target variable from the data set's schema.
- Amazon ML, though it may impose algorithmic and UI restrictions, is wizard-driven and very easy to use.
So which of the Big Three will emerge victorious?
Amazon has an advantage of incumbency given its connectivity to S3 and the amount of data stored there by many companies. Google, meanwhile, has the reputational advantage of using machine learning most innovatively in its own core businesses.
Microsoft's service is probably the most sophisticated, flexible and well connected: it's the only one to offer integration with Hadoop and its data integration service, Azure Data Factory (details here). And if Microsoft can do a good job meshing Revolution Analytics' technology into Azure ML (something which shouldn't be taken for granted), it could be a real juggernaut.
In the end, customers will likely use the machine learning service native to the cloud they use most often for other data processing tasks. If any service wishes to transcend that lock-in dynamic, it will need to integrate machine learning within a broader analytics offering.
For Microsoft, that means tying Azure ML into Power BI. Google would do well to extend its Prediction API-BigQuery integration to be reciprocal, so that queries against BigQuery could reference Prediction API models through SQL JOINs. Amazon would have a trickier time, needing to work with third-party BI providers (like Jaspersoft, Logi Analytics and Tableau) that can run on EC2. But the payoff would be big.
This will take a while to sort out. When that happens, though, machine learning will finally have some mainstream chops, and business competitiveness could change dramatically.
Special thanks to Jen Stirrup for her logistical support and research contributions