Google announces Cloud TPU virtual machines for AI workloads

The general availability of Cloud TPU VMs mean users no longer have to remotely access Cloud TPU.
Written by Aimee Chanthadavong, Contributor

Google Cloud has announced the general availability of TPU virtual machines (VMs) for artificial intelligence workloads.

The general availability release includes a new TPU embedding API, which Google Cloud claims can accelerate large-scale, ML-based ranking and recommendation workloads

Google Cloud said embedding acceleration with Cloud TPU can help businesses lower cost associated with ranking and recommendation use-cases which commonly rely on deeply neural network-based algorithms that can be costly to run.

"They tend to use large amounts of data and can be difficult and expensive to train and deploy with traditional ML infrastructure," Google Cloud said in a blog post.

"Embedding acceleration with Cloud TPU can solve this problem at a lower cost. Embedding APIs can efficiently handle large amounts of data, such as embedding tables, by automatically sharding across hundreds of Cloud TPU chips in a pod, all connected to one another via the custom-built interconnect."

At the same time, the TPU VMs have been designed to support three major frameworks -- TensorFlow, PyTorch, and JAX -- that are offered through three environments for ease of setup with the respective framework.

Google Cloud added that the TPU VMs enable input data pipelines to be executed directly on the TPU hosts. Through this capability, users can build their own customer ops, such as TensorFlow Text so they are no longer bound to TensorFlow runtime release version.

Local execution on the host with the accelerator also enables use cases such as distributed reinforcement learning.

"With Cloud TPU VMs you can work interactively on the same hosts where the physical TPU hardware is attached," Google Cloud said.

"Our rapidly growing TPU user community has enthusiastically adopted this access mechanism, because it not only makes it possible to have a better debugging experience, but it also enables certain training setups such as distributed reinforcement learning which were not feasible with TPU Node (networks accessed) architecture."

Related Coverage

Editorial standards