Deep neural networks (DNNs) are powering the revolution in machine learning that is driving autonomous vehicles, and many other real-time data analysis tasks. The two most popular DNNs are convolutional -- for feature recognition -- and recurrent -- for time series analysis.
DNNs need to be trained on massive tagged datasets to develop a model - basically a matrix of feature weights - that can then be run on local hardware. When a trained neural network classifies or estimates various values, the process is called inference.
As the needed analytics become more complex, the computational requirements skyrocket. Simply recognizing a handwritten digit requires almost 700,000 arithmetic operations per digit, while a single object recognition requires billions of operations. In a self-driving car, navigating among other cars, street signs, pedestrians, and bicycles, the computational requirements are immense.
For the last decade the go-to hardware for DNNs has been multi-core CPUs and GPUs, but as we've gotten smarter about the DNN models, it's become clear that new architectures are needed. For example, matrix multiplications and convolutions are central operations in DNNs. Unlike standard workloads, the data movements and operations are well understood in advance, so the hardware CPUs need to queue up data and support instruction changes aren't needed.
Also: What happens when driverless car meets delivery robot at an intersection?
Standard multi-level caches aren't needed, so on-chip memory can be simpler. DNNs also simulate neurons, so specialized hardware to support neuron activation is helpful. And since real-time inference is a common requirement, using 8-bit quantities - saving chip real estate - enables many more arithmetic units to maximize parallel processing.
Google's Tensor Processing Unit (TPU), for example, is optimized for power and area efficiency for matrix math. Its Matrix Multiplication Unit has over 65,000 arithmetic logic units and, on a 700MHz TPU, can process 92 trillion 8-bit operations per second.
As Google states in their TPU paper, their design does not have:
. . . caches, branch prediction, out-of-order execution, multiprocessing, speculative prefetching, address coalescing, multithreading, context switching and so forth. Minimalism is a virtue of domain-specific processors.
Those are all the things that make CPUs perform well on common workloads, but slow down a neural network. Not all neural processors follow the TPU architecture, but the essential design center of a highly efficient, stripped down, matrix-optimized processor is key to neural processors.
The Storage Bits take
The acceleration of AI technology in the last decade is breathtaking, especially for those old enough to remember the disappointments of the 70s and 80s. The pace of research, prompted by the potential of commercial profits, is accelerating every year.
Also: The 14 AI technologies businesses should be pursuing TechRepublic
Combined with the relative simplicity of neural processor design, we can expect rapid improvements in neural processing hardware and software. As is normal in a new area, many ideas are being tried, and a winnowing process has already begun.
The important thing is that these processors are key to making machines intelligent. That is the next big computer revolution.
Courteous comments welcome, of course. Learn more about Intel's Nervana here.