How do neural networks see depth?

Binocular vision is what enables humans to estimate depth. But neural networks can create 3D images from a single monocular picture. How do they do it? Until now, nobody knew.

special feature

How to Implement AI and Machine Learning

The next wave of IT innovation will be powered by artificial intelligence and machine learning. We look at the ways companies can take advantage of it and how to get started.

Read More

One of the continuing problems with Deep Neural Networks (DNNs) is that humans typically do not understand how they achieve their amazing results. DNN models are trained on massive amounts of data until they surpass human accuracy, but the details of the models themselves are hidden inside thousands of equations. So, not human readable.

If there is implicit bias - or other flaws - in the training data, those problems will be encoded in the model, with humans none the wiser. That's why I found the paper How do neural networks see depth in single images? worth a read.

Researchers from the Technical University of Delft, in the Netherlands, analyzed the MonoDepth model, which estimates depth from a single image. A single image requires DNNs to rely on pictorial cues, which requires understanding the environment, an inherently assumption-laden process.

Current analysis of monocular depth estimation has focused on feature visualization and attribution. This is valuable, but the Delft researchers took a different approach:

We treat the neural network as a black box, only measuring the responses (in this case depth maps) to certain inputs. . . . [W]e modify or disturb the images, for instance by adding conflicting visual cues, and look for a correlation in the resulting depth maps.

In other words, they mess with images to see how the model changes the depth map. This enables them to estimate which features the model relies on to estimate depth.

The extensive research into human vision provides a long list of possible features that DNNs might use, including:

  • Vertical position relative to the horizon.
  • Depth order: which objects block others.
  • Texture detail: closer objects have clearer textures.
  • Apparent size.
  • Shading and illumination.

Human visual acuity is generally much better than even the highest resolution photographs, so not all of these features are accessible to DNNs. For example, photos, especially compressed ones, will tend to smear fine textures.


In their experiments, the researchers found that MonoDepth primarily uses the vertical position of objects to estimate their depth, rather than their apparent size. This can be affected by camera position - roll and pitch - and the model then tends to mis-estimate distance. Furthermore, MonoDepth is unreliable when faced with objects that weren't in its training set.

The Storage Bits take

While this study is limited to a single DNN - MonoDepth - trained on a single dataset - KITTI - it points up the need to profile these machine learning models. Given that we'll have tens of millions of machine vision enabled vehicles cruising around in the next decade, we don't want them mowing down costumed trick or treaters just because they don't look like the people they were trained to see.

What this human sees is that if we don't understand how DNNs achieve their results, we are bound to discover their limitations in practice, rather than in tests. Sufficient tragedies - think Boeing 737 Max - could cripple public acceptance of machine learning. And that, given the shrinking workforces in the developed world, would be an even greater tragedy if we are to keep our economies growing.

Comments welcome!