Artificial intelligence and the future of smartphone photography

3-D sensors coming to this year's smartphones are just the tip of a wave of machine learning-driven photography that will both correct shortcomings of smartphone pictures and also provide some stunning new aspects of photography.
Written by Tiernan Ray, Senior Contributing Writer

Photography has been transformed in the age of the smartphone. Not only is the pose different, as in the case of the selfie, but the entire nature of the process of light being captured by phone cameras is something else altogether. 

Cameras are no longer just a lense and a sensor, they are also the collection of algorithms that instantly manipulate images to achieve photographic results that would otherwise require hours of manipulation via desktop software. Photography has become computational photography.

Continued advances in machine learning forms of artificial intelligence will bring still more capabilities that will make today's smartphone pictures look passé. 

Recent examples of the state of the art on phones are Alphabet's Google's Pixel 3 smartphone pictures, and Apple's iPhone X photos. In the former case, Google has used machine learning to capture more detail under low-light conditions, so that night scenes look like daylight. These are simply not shots that ever existed in nature. They are super-resolution pictures.

Also: Nvidia's fabulous fakes unpack the black box of AI

And Apple, starting with iPhone X in 2017, added "bokeh," the artful blurring of elements outside of the focal point. This was not achieved via aspects of the lens itself, as is the case in traditional photography, but rather by a computational adjustment of the pixels after the image is captured.

It's quite possible 2019 and 2020's breakthrough development will be manipulating the perspective of an image to improve it. Hopefully, that will lead to a correction of the distortions inherent in smartphone photography that make them come up short next to digital single-lens-reflex (DSLR) camera pictures. 


How a convolutional neural network, or CNN, attempts to reconstruct reality from a picture. From "Understanding the Limitations of CNN-based Absolute Camera Pose Regression," by Torsten Sattler of Chalmers University of Technology, Qunjie Zhou and Laura Leal-Taixe of TU Munich, and Marc Pollefeys of ETH Zürich and Microsoft.

Sattler et al.

They could, in fact, achieve results akin to what are known as "tilt-shift" cameras. In a tilt-shift camera, the lens is angled to make up for the angle at which a person is standing with the camera, and thereby correct the distortions that would be created in the image because of the angle between the individual and the scene. Tilt-shift capabilities can be had by DSLR owners in a variety of removable lenses from various vendors.

The average phone camera has a lens barrel so tiny that everything it captures is distorted. Nothing is ever quite the right shape as it is in the real world. Most people may not notice or care, as they've become used to selfies on Instagram. But it would be nice if these aberrations could be ameliorated. And if they can, it would be a selling point for the next round of smartphones from Google, Apple, etc.

Increasingly, the iPhone and other cameras will carry rear cameras with 3-D sensors. These sensors, made by the likes of Lumentum Holdings and other chip vendors, measure the depth of the surroundings of the phone by sending out beams of light and counting how they return to the phone after bouncing off objects. Techniques such as "time-of-flight" allow the phone to measure in detail the three-dimensional structure of the surrounding environment.

Those sensors can take advantage of a vast body of statistical work that has been done in recent years to understand the relationship between 2-D images and the real world. 


Google's "Night Sight" feature on its Pixel 3 smartphones: scenes that never existed in nature. 


A whole lot of work has been done with statistics to achieve the kinds of physics that go into tilt-shift lenses, both with and without special camera gear. For example, a technique called "RANSAC," or "random sample consensus," goes back to 1981 and is specifically designed to find landmarks in the 3-D world that can be mapped to points in a 2-D image plane, to know how the 2-D image correlates to three-dimensional reality. Using that technique, it's possible to gain a greater understanding about how a two-dimensional representation corresponds to the real-world. 

A team of researchers at the University of Florence in 2015 built on RANSAC to infer the setup of a pan-tilt-zoom camera by reasoning backward from pictures it took. They were able to tune the actuators, the motors that control the camera, to a fine degree by using software to analyze how much distortion is introduced into pictures with different placements of the camera. And they were able to do it for video, not just still images.

Also: Facebook fakes the blur with AI to make VR more real

From that time, there's been a steady stream of work to estimate objects in pictures, referred to as pose estimation, and a related task, simultaneous localization and mapping, or SLAM, which constructs in software a "cloud" of points in a 3-D scene that can be used to understand how much distortion is in a digital photo. 

Researchers at the University of Erlangen-Nürnberg in Germany and the Woods Hole Oceanographic Institution in 2017 showed off a Python library, called CameraTransform, which lets one reckon the real dimensions of an object in the world by working backward from the image taken. 


Seeing around corners: a neural network created by researchers to infer objects occluded in a picture, consisting of an encoder-decoder combined with a generative adversarial network. Courtesy of Helisa Dhamo, Keisuke Tateno, Iro Laina, Nassir Navab, and Federico Tombari of the Technical University of Munich, with support from Canon, Inc.

Dhamo et al.

Last year, researchers at the Technical University of Munich, Germany and Canon, Inc. showed it's possible to take a single image and infer what's in the scene that's occluded by another object. Called a "layered depth image," it can create new scenes by removing an object from a photo, revealing the background that the camera never saw, but that was computed from the image. The approach uses the familiar encoder-decoder approach found in many neural network applications, to estimate the depth of a scene, and a "generative adversarial network," or GAN, to construct the parts of the scene that were never actually in view when the picture was taken.

All that research is bubbling up and is going to culminate in some fantastic abilities for the next crop of smartphone cameras, equipped with 3-D sensors. The results of this line of research should be stunning. At the very least, one can imagine portraits taken on smartphones that no longer have strange distortions of people's faces. Super-resolution pictures of architecture will be possible that create parallel lines by evening out all the distortions in the lens. The smartphone industry will be able to claim another victory over the DSLR market as phones churn out pictures with stunning levels of accuracy and realism. 

But, of course, the long-term trend for smartphone photography is away from realism, toward more striking effects that were not possible before computational photography. And so we may see uses of 3-D sensing that tend toward the surreal. 

Also: Apple hopes you'll figure out what to do with AI on the iPhone XS

For example, tilt-shift cameras can be used to create some strangely beautiful effects, such as narrowing the depth of field of the shot to an extreme degree. That has the effect of making landscapes look as if they're toy models, in an oddly satisfying way. There are apps for phones that will do something similar, but the effect of having 3-D sensors coupled to AI techniques will go well beyond what those apps achieve. There are techniques for achieving tilt-shift in Photoshop, but it will be much more satisfying to have the same effects come right out of the camera with each press of the shutter button.

Down the road, there'll be another stage that will mean a lot in terms of advancing machine learning techniques. It's possible to forego the use of 3-D sensors and just use a convolutional neural network, or CNN, to infer the coordinates in space of objects. That would save on the expense of building the sensors into phones. 

Must read

However, currently, such software-only approaches produce poor results, as discussed in a report out this week by researchers at Microsoft and academic collaborators. Known as "absolute pose regression," the software-only approach failed to generalize, they write, after training, meaning that whatever techniques the CNN acquired didn't correctly estimate geometry when tested with novel images.

The authors consider their work "an important sanity check" for software-only efforts, and they conclude that "there is still a significant amount of research to be done before pose regression approaches become practically relevant."

How will that work get done? Not by researchers alone. It will be done by lots of smartphone owners. With the newest models, containing 3-D sensors, they will snap away their impressive 3-D sensing-enhanced pictures. While they do so, their device, or the cloud, will be keeping track of how real-world geometry correlates to 2-D images. It will be using all that activity, in other words, to keep learning. Some day, with enough 3-D shots, the CNN, or whatever algorithm is used, will be smart enough to look at the world and know exactly what it's like even without help from 3-D depth perception. 

Are you looking forward to the next smartphone camera innovations? Tell me what you think in the comments section.

Scary smart tech: 9 real times AI has given us the creeps

Previous and related coverage:

What is AI? Everything you need to know

An executive guide to artificial intelligence, from machine learning and general AI to neural networks.

What is deep learning? Everything you need to know

The lowdown on deep learning: from how it relates to the wider field of machine learning through to how to get started with it.

What is machine learning? Everything you need to know

This guide explains what machine learning is, how it is related to artificial intelligence, how it works and why it matters.

What is cloud computing? Everything you need to know about

An introduction to cloud computing right from the basics up to IaaS and PaaS, hybrid, public, and private cloud.

Related stories:

Editorial standards