In my previous post on the recent Linley Processor Conference, I wrote about the ways that semiconductor companies are developing heterogeneous systems to reach higher levels of performance and efficiency than with traditional hardware. One of the areas where this is most urgently needed is vision processing, a challenge that got a lot of attention at this year's conference.
The obvious application here is autonomous vehicles. One of the dirty secrets of self-driving cars is that today's test vehicles rely on a trunk full of electronics (see Ford's latest Fusion Hybrid autonomous development vehicle below). Sensors and software tend to be the big focus, but it still requires a powerful CPU and multiple GPUs burning hundreds of watts to process all this data and make decisions in real-time. Earlier this month, when Nvidia announced a future Drive PX Pegasus board, the company conceded that current hardware doesn't have the chops for fully autonomous driving. "The reality is we need more horsepower to get to Level 5," Danny Shapiro, Nvidia's senior director of automotive reportedly told journalists.
But it's not just automotive. Embedded vision processors will play a big role in robotics, drones, smart surveillance cameras, virtual reality and augmented reality, and human-machine interfaces. In a keynote, Chris Rowen, the CEO of Cognite Ventures, said this has led to a silicon design renaissance with established IP vendors such as Cadence (Tensilica), Ceva, Intel (Mobileye), Nvidia, and Synopsys competing with 95 start-ups working on embedded vision in these areas-including some 17 chip startups building neural engines.
In embedded vision, Pulin Desai, a marketing director at Cadence said, there are three separate systems for inference: Sensing (cameras, radar and lidar, microphones), pre- and post-processing (noise reduction, image stabilization, HDR, etc.), and analysis with neural networks for face and object recognition and gesture detection. The sensing is handled by sensors and ISPs (image signal processors) and the pre- and post-processing can be done on a Tensilica Vision DSP, but Cadence has a separate Tensilica Vision C5 DSP that is specifically designed to run neural networks.
Read also: Intel announces self-learning AI chip Loihi | No hype, just fact: Artificial intelligence in simple business terms | How we learned to talk to computers, and how they learned to answer back
Desai talked about the challenges of creating an SoC with an embedded neural engine for a product that won't reach the market until 2019 or 2020. The computational requirements for neural network algorithms for image recognition have grown 16X in less than four years, he said. At the same time, neural network architectures are changing rapidly and new applications are emerging so the hardware needs to be flexible. And it needs to handle all of this within a tight power budget.
The Vision C5 is a neural network DSP (NNDSP) designed to handle all neural network layers with 1,024 8-bit or 512 16-bit MACs in a single core delivering one trillion MACs per second in one square millimeter of die area. It can scale to any number of cores for higher performance and it is programmable. Manufactured on TSMC's 16nm process, a Vision C5 running at 690MHz can run AlexNet six times faster, Inception V3 up to nine times faster, and ResNet50 up to 4.5 times faster than "commercially available GPUs," according to Cadence.
The Kirin 970 in Huawei's new Mate 10 and Mate 10 Pro is the first smartphone SoC with a dedicated neural processing unit capable of 1.92 teraflops at half-precision (Cadence noted this several times but did not specifically state that it uses the Vision C5). Apple's A11 Bionic also has a neural engine and others are sure to follow. The Vision C5 is also targeted at SoCs for surveillance, automotive, drones, and wearables.
The competing Ceva-XM Vision DSPs are already used in camera modules, embedded in ISPs such as Rockchip's RK1608 or as separate companion chips for image processing. Ceva's solution for neural networks is to pair the CEVA-XM with a separate CNN Hardware Accelerator with up to 512 MAC units. Yair Siegel, Ceva's marketing director, talked about the growth of neural networks and said that state-of-the-art CNNs are extremely demanding in terms of computation and memory bandwidth. The Ceva Network Generator converts these models (in Caffe or TensorFlow) to fixed-point graph and partitions it to run efficiently across the Vision DSP and Hardware Accelerator. Ceva says that the Hardware Accelerator delivers a 10X in comparison to using the DSP alone on TinyYolo, a real-time object recognition algorithm.
Read also: Research alliance announces moonshot: Reverse engineering the human brain | Intel unveils the Nervana Neural Network Processor | Google's Pixel 2 has a secret chip which will make your photos better
Synopsys is taking a similar approach with its EV6x Embedded Vision Processor, which can combine up to four CPUs (each with a scalar unit and wide vector DSP) with an optional, programmable CNN Engine to accelerate convolutions. The CNN Engine is scalable from 880 to 1760 to 3520 MACs delivering up to 4.5 trillion MACs (or a total of 9 teraflops) on TSMC's 16nm process at 1.28GHz. A single EV61 vector DSP with CNN engine uses less than one square millimeter of die area and Synopsys said the tandem is capable of 2 trillion MACs per watt. Gordon Cooper, a product marketing manager at Synopsys, emphasized the tight integration between the vector DSPs and the CNN accelerator and said that the solution delivered the performance per watt to handle challenging applications such as ADAS (advanced driver assistance system) for pedestrian detection.
Qualcomm's solution to this problem has been to add new instructions, called Vector eXtensions or HVX, to the Hexagon DSPs in its Snapdragon SoCs. First introduced two years ago, these are already used to power the HDR photography features on Pixel phones-despite Google's recent development of its own Pixel Visual Core-and Google has previously demonstrated how offloading a TensorFlow image-recognition network from a quad-core CPU to a Hexagon DSP can boost performance by 13x.
But Rick Maule, a senior director of product management at Qualcomm, said that over the past couple of years the company has learned that customers need more processor cycles and faster memory access. Qualcomm's solution is to double the number of compute elements, boost the frequency 50 percent, and embed low-latency memory in those compute elements. These "proposed changes" would increase performance from 99 billion MACs per second on the Snapdragon 820 to 288 billion MACs per second, resulting in a 3X speed-up on the Inception V3 image-recognition model. In addition to performance improvements, Qualcomm is working to make neural networks easier to program with its Snapdragon Neural Processing Engine, and abstraction layer, and Halide, a domain-specific language for image processing and computational photography.
While these are all notable advances, AImotive, a startup based in Budapest, is betting that only purpose-built hardware will be able to deliver a complete Level 5 autonomous system in under 50 watts. "None of today's hardware can solve the challenges we are facing," said Márton Fehér, the head of the company's aiWare hardware IP, citing large inputs (streaming images and video), very deep networks, and the need for safe, real-time processing.
Fehér said that flexible, general-purpose DNN solutions for embedded, real-time inference are inefficient because the programmability isn't worth the trade-off in performance per watt. The aiWare architecture covers 96 percent to 100 percent of the DNN operations, maximizes MAC utilization, and minimizes the use of external memory.
The company currently has an FPGA-based development kit and public benchmark suite, and it is developing a test chip, manufactured on GlobalFoundries 22nm FD-SOI process, that will be available in the first quarter of 2018. Partners include Intel (Altera), Nvidia, NXP Semiconductors, and Qualcomm. AImotive has also developed an aiDrive software suite for autonomous driving and a driving simulator, and is working with Bosch, PSA Group (Peugeot, Citroën, DS Automobiles, Opel and Vauxhall), and Volvo, among others.
While there are many different approaches to solving the challenges with vision processing, the one thing that everyone at the Linley Processor Conference agreed on is that it is going to take much more powerful hardware. The amount of data coming off sensors is enormous, the models are growing larger, and it all needs to be processed in real-time using less power than current solutions. We are likely to see a lot more innovation in this area over the next couple of years as the industry grapples with these challenges.
Previous and related coverage
The growth of AI and large data sets pose great risks to privacy. Two top experts explain the issues to help your company manage this crucial part of the technology landscape.
Moore's Law is slowing at a time when new applications are demanding more muscle. The solution is to offload jobs to specialized hardware but these complex, heterogeneous systems will require a fresh approach.
Deep learning is already having a big impact in the data center. Now it is moving to the edge as chipmakers add neural engines to mobile processors. But Qualcomm, Intel and others are taking very different approaches.