Thirty years ago, Yann LeCun pioneered the use of a particular form of machine learning, called the convolutional neural network, or CNN, while at the University of Toronto. That approach, moving a filter over a set of pixels to detect patterns in images, showed promise in cracking problems such as getting the computer to recognize hand-written digits with minimal human guidance.
Years later, LeCun, then at NYU, launched a "conspiracy," as he has termed it, to bring machine learning back into the limelight after a long winter for the discipline. The key was LeCun's CNN, which had continued to develop in sophistication to the point where it could produce results in computer vision that stunned the field.
The new breakthroughs with CNNs, along with innovations by peers such as Yoshua Bengio, of Montreal's MILA group for machine learning, and Geoffrey Hinton of Google Brain, succeeded in creating a new springtime for AI research, in the form of deep learning.
Today, the appeal of convolutions shows no sign of cooling, as the technique spreads like a weed to every corner of machine learning. Recent research has shown the prevalence of CNNs among neural network models in use. A study by Microsoft published in November, which looked at the kinds of deep networks running on Android mobile devices, found that almost 90% of the networks recovered from mobile apps were some kind of CNN.
At the International Solid State Circuits Conference in San Francisco this week, LeCun reflected on why this simple technique has thrived, and why its influence will only grow, in his view.
"A lot of the signals that systems will have to deal with are natural signals that come from array sensors," said LeCun in an interview with ZDNet following his keynote talk on Monday.
"Anything coming from a camera, including panoramic cameras and etc. Audio, either in the form of raw audio or in the form of time-frequency representations." Videos also need to be processed via convolutions, LeCun said, including especially 3-D imaging from depth sensors such as LIDAR.
Indeed, many of the demonstrations of different chips at the show this week featured the familiar "segmentation map," an image on the computer screen of things seen through a camera, whether a car's view of the road or a view of people in the room. Each object is highlight by a colored outline with a label indicating the type of thing it is — a person, a dog, a street lamp, etc. Convolutions are the zeitgeist of the current age, the clearest expression of AI's ability to in some sense understand the world around it.
"So, for all of this, most of your cycles are going to be spent processing the low-level signal" said LeCun, meaning the cycles of compute will be spent picking out what each pixel of the billions of pixels in an image or video are displaying. "And what else but convolutions are you going to use? Right now, there's no alternative. So, most of the cycles are going to be spent doing convolutions, there's no question."
At the conference, there were plenty of chip innovations devoted to better processing of convolutions. For instance, researchers at the University of Michigan showed off a design for automotive applications that can displace some of the functions of LIDAR by instead analyzing many frames of video from a car's on-board camera. The researchers said they had made a breakthrough in how to develop a chip dedicated to CNNs that will run the network model at high speed while consuming less power.
There are some serious technical challenges as CNNs spread to applications beyond image recognition, said LeCun. In the case of 3-D imaging, with depth sensors such as LIDAR, where data comes in the form of "point clouds," noted LeCun, "most of the 3-D domains that your neural network will have to process will be empty — it's very sparse in terms of activations."
What one doesn't want are convolutions that waste a lot of time working on multiplying zeros of a mostly empty vector or matrix. "So, you would like to know where the data is, and then be smart about how you follow this" said LeCun. He noted there has been work on the matter at Facebook, using "sparse convnets," running on GPUs, "But it might require some more low-level support," in hardware, observed LeCun.
That's where hardware innovators have to get to work, he suggested.
Samsung researchers on Tuesday presented their design for a mobile system-on-a-chip that can "prune" inputs precisely to avoid processing zeros in sparse input samples. LeCun voiced his approval after hearing the talk, tweeting out a few specs of the Samsung work.
"The most frequently uttered word at ISSCC this morning: convolution," tweeted the godfather of CNNs.