AWS's Amazon SageMaker software, a set of tools for deploying machine learning, is not only spreading throughout many companies, it is becoming a key tool for some of the more demanding kinds of practitioners of machine learning, one of the executives in charge of it says.
"We are seeing very, very sophisticated practitioners moving to SageMaker because we take care of the infrastructure, and so it makes them an order-of-magnitude more productive," said Bratin Saha, AWS's vice president in charge of machine learning and engines.
Saha spoke with ZDNet during the third week of AWS's annual re:Invent conference, which this year was held virtually because of the pandemic.
The benefits of SageMaker have to do with all the details of how to stage training tasks and deploy inference tasks across a variety of infrastructure.
SageMaker, introduced in 2017, can automate a lot of the grunt work that goes into setting up and running such tasks.
While SageMaker might seem like something that automates machine learning for people who don't know how to do the basics, Saha told ZDNet that even experienced machine learning scientists find value in speeding up the routine tasks in a program's development.
"What they had to do up till now is spin up a cluster, make sure that the cluster was well utilized, spend a lot of time checking as the model is deployed, am I getting traffic spikes," said Saha, describing the traditional deployment tasks that had to be carried out by a machine learning data scientist. That workflow extends from initially gathering the data to labeling the data (in the case of labeled training), refine the model architecture, and then deploying trained models for inference usage and monitoring and maintaining those inference models as long as they are running live.
"You don't have to do any of that now," said Saha. "SageMaker gives you training that is server-less, in the sense that your billing starts when your model starts training, and stops when your model stops training."
Also: Amazon AWS unveils RedShift ML to 'bring machine learning to more builders'
Added Saha, "In addition, it works with spot instances in a very transparent way; you don't have to say, Hey, have my spot instances been pre-empted, is my job getting killed, SageMaker takes care of all of that." Such effective staging of jobs can reduce costs by ninety percent, Saha contends.
Saha said that customers such as Lyft and Intuit, despite having machine learning capabilities of their own, are more and more taking up the software to streamline their production systems.
"We have some of the most sophisticated customers working on SageMaker," said Saha.
"Look at Lyft, they are standardizing their training on SageMaker, their training times have come down from several days to a few hours," said Saha. "MobileEye is using SageMaker training," he said, referring to the autonomous vehicle chip unit within Intel. "Intuit has been able to reduce their training time from six months to a few days." Other customers include the NFL, JP Morgan Chase, Georgia Pacific, Saha noted.
Amazon itself has moved its AI work internally to SageMaker, he said. "Amazon dot com has invested in machine learning for more than twenty years, and they are moving on to SageMaker, and we have very sophisticated machine learning going on at Amazon dot com." As one example, Amazon's Alexa voice-activated appliance uses SageMaker Neo, an optimization tool that compiles trained models into a binary program with settings that will make the model run most efficiently when being used for inference tasks.
There are numerous other parts of SageMaker, such as pre-built containers with select machine learning algorithms; a "Feature Store" where one can pick out attributes to use in training; and what's known as the Data Wrangler to create original model features from training data.
AWS has been steadily adding to the tool set.
During his AWS re:Invent keynote two weeks ago, Amazon's vice president of machine learning, Swami Sivasubramanian, announced that SageMaker can now automatically break up the parts of a large neural net and distribute those parts across multiple computers. This form of parallel computing, known as model parallelism, is usually something that takes substantial effort.
Amazon was able to reduce neural network training time by forty percent, said Sivasubramanian, for very large deep learning networks, such as "T5," a version of Google's Transformer natural language processing.