Machine Learning: The cloud is the new battlefield

There's a reason why every major cloud provider is backing its own machine learning framework. What's that and how do you get to choose?

Can machine learning bail enterprises out of their data management woes?

More than a convergence of buzzwords, machine learning in the cloud is very much real. Why is this happening, what are the players and their strategies, and how do you get to choose the right technology and vendor for utilizing machine learning (ML)?

ZDNet discussed with Chris Nicholson, a deep learning (DL) expert with a deep knowledge of and interest in the field. Nicholson is one of the masterminds of Deeplearning4J and the CEO of Skymind. Deeplearning4J is am open source library for DL using Java, and Skymind is a company set up to expand and commercialize Deeplearning4J.

Challenging AWS dominance via Machine Learning

special feature

IoT: The Security Challenge

The Internet of Things is creating serious new security risks. We examine the possibilities and the dangers.

Read More

It's interesting to note that the DL framework getting the most traction at the moment is backed by a vendor throwing its weight behind its cloud: Google's TensorFlow. With Amazon still having the lion's share in public cloud use, Nicholson believes TensorFlow is part of Google's strategy to get a bigger share of the pie.

Nicholson says ML is a relatively new workload for the data center, but one that will be increasingly more important: "It's computationally intensive and if you're charging per time of use, it's a cash cow." Hiring VMWare's Diane Greence signaled Google's intention to turn its cloud from immaterial to a source of revenue, and Nicholson believes TensorFlow is a trojan horse for Google's cloud:

Google did a great job with TensorFlow, and it's very popular. They are also trying to remove their dependency on NVidia's GPUs by introducing their own TPU chips. However, there are also flaws in their strategy -- most notably, lack of experience with enterprise clients.

Their bottom-up approach seems to be working, because people are using TensorFlow. But TensorFlow is maybe 5 percent of what you need to have a full AI solution. There's also a big data stack that you need, and Google has its own proprietary stack there with BigTable and BigQuery and so on. Not everyone will want to use those.

Furthermore, TensorFlow is in fact tightly coupled to Google's cloud -- it works better there than anywhere else. Sure, if AWS decided to make TensorFlow work better there they could throw resources at it and do it, but they also have their own horse in the race with MxNet. Plus Google is used to dealing with consumers, they are not good at the kind of hand-holding enterprises are used to.

Nicholson says he does not see Google signing SLAs for on-premise, as this is not part of its strategy, just trying to suck people into its cloud, which not everyone will want to do. He believes Microsoft is getting more enterprise traction with CNTK, partly due to its footprint and tradition in working in this space, but the end goal of luring users to Azure is the same.

Then there's also a couple of odd ones out. Most notably, Deeplearning4J, PyTorch and Caffe. The two latter ones have come out of Facebook, and Nicholson sees two possible explanations why a vendor which is not in the cloud race would want to invest so heavily in DL.

One, as a talent acquisition tool. Nicholson says that having an open source framework out there brings talent directly to vendors -- "People essentially onboard themselves, you just need to make them an offer. It's worked like that for us."

Two, as a way to bundle services on top of an API: "I bet anyone $100 Google will do that with apps on top of TensorFlow just as they did with Maps on top of Android for example, and I guess Facebook wants to go there too," he says.

What's a deep learning framework good for, and what's a good deep learning framework?

The DL framework space is rather fragmented at this time, and efforts for interoperability are hampered by each vendor looking out for its own interest. Case in point -- Keras. Keras is a sort of meta-API for DL, created with the goal of being a bridge among different frameworks.

Nicholson says there was a time about a year ago where it looked like Keras was going to be the de-facto standard high-level API for DL, and Skymind was on the verge of building its offering around it, but that did not happen. Why? Because other vendors won't let that happen:

I wish Keras would work as a high level portable meta-API, but I don't see it happening. Amazon now has Gluon, its own version of Keras. My interpretation is that AWS is doing that to keep users locked in -- you can think of it in similar terms to what Microsoft did with C# as a response to Java.

Then there is also Onnx, which is an interoperability layer between PyTorch, Caffe2 and CNTK -- Facebook and Microsoft trying to create competition for Google. Keras was created by a Google engineer, and although I do believe his stated intention of keeping it neutral, it is bundled with TensorFlow. TensorFlow is low-level and needs something like Keras for ease of use.


With so many deep learning frameworks out there, how do you choose one? Image: Skymind

So, how does one evaluate DL frameworks?

For organizations heavily invested in any cloud vendor that do not mind the lock-in, going with that vendor makes sense. There is a sort of MLaaS (Machine Learning as a Service) category evolving. Early versions of this such as BigML had the data gravity issue to deal with -- moving data in and out the cloud is hard. But when your data is already in the cloud, that's not a problem, and you're likely to move to IPaaS.

Special Feature

Special report: The future of Everything as a Service (free PDF)

SaaS had a major impact on the way companies consume cloud services. This ebook looks at how the as a service trend is spreading and transforming IT jobs.

Read More

Performance is an obvious way to compare DL frameworks, but the way to do this is not so obvious. There is the training and the operational phases in DL, and although for example Caffe2's creators say performance does not vary much across frameworks, Nicholson objects: "There are differences. For example, TensorFlow works great on Google cloud, but is notably slow on GPUs. CNTK was originally built for NLP sequential tasks and is much faster in that area."

Still, for Nicholson this is not the number one criterion. What is? Unsurprisingly, real-world applicability -- but what does that translate to?

We go to the same events with the people that built for example TensorFlow or PyTorch, and I've asked them point blank what their goal was. Their answer was that they built these frameworks to facilitate the work of ML researchers. In that respect they are great, they help you train better models, but they often fail in other aspects.

Deeplearning4J and Skymind

This is where Deeplearning4J comes in of course. Nicholson says their goal in building Deeplearning4J was to address the big numbers of developers, enterprise and otherwise, who are not well-versed in the languages used in the data science world, typically R and Python:

Java has a huge user base and is the backbone of enterprise IT. Organizations are trying to use ML to know more about their data and to transform. There is just not enough R and Python expertise out there to cover the demand, and many are turning to existing Java expertise.

Much of the data engineering used to build pipelines or deploy applications is done in Java anyway, and adding for example a ML model to do predictions should not be so hard. But it gets complex because it's linguistically fragmented -- what we need is a cross-team solution that works from exploration to production.

Although building a framework and a company centered around a programming language may seem a strange way to aim for mass adoption, there seems to be some basis to that reasoning. Nicholson says they have built a tensor library like NumPy for Java, aiming to create a parallel ecosystem utilizing specialized native processing libraries: "We felt there were things in Python that work better and the JVM needed to have."


Skymind aims to offer and end to end solution for running machine learning applications. Image: Skymind

Nicholson says that many tools will produce a blob of Python and C code, and throwing this over the wall to deploy in production is not easy. He also says he is not a fan of using Hadoop or even Spark for ML despite the fact that data resides there and they also integrate with these platforms in their stack: "It's not simple to be a ML company, but as opposed to these vendors this is all we do."

Nicholson says Skymind wants to be an open core company that is the Red Hat of ML. In addition to services, Skymind has its own execution environment -- a scale-out, high throughput server for AI as he calls it, which can run both in the cloud and on-premise.

Skymind offers solutions for collaboration and production deployment as well as SLAs and custom closed source models, and boasts users from the likes of IBM and Apple.

In terms of creating model marketplaces, Nicholson notes that by applying something called transfer learning a successful model that has been trained on ImageNet to identify dogs for example can be relatively easily retrained to identify other things.

The other avenue to explore to speed up the application of ML is automating the training part. Models have to learn by using the right hyper-parameters, which Nicholson notes can be automated as it is essentially a search problem. Google for example is already doing that with AutoML. While that may require lots of resources, "in five years from now it will be as cheap as storing data is today."

Who really owns your Internet of Things data?

In a world where more and more objects are coming online and vendors are getting involved in the supply chain, how can you keep track of what's yours and what's not?

Read More