3D capture with normal smartphone cameras

New 4D visualization techniques use smartphone video capture to democratize virtualization.
Written by Greg Nichols, Contributing Writer

Fans of professional sports will be familiar with so-called 4D visualizations, which offer multiple vantages of a single scene. A receiver catches a pass, but was he inbounds? The live producer and announcers will run a replay, shifting the POV so viewers have the sensation of zooming around to the front of the action to have a look.

There's a whole lot of technology behind that seemingly simple maneuver, including a careful orchestration of fixed-point cameras that are positioned to enable near real-time video stitching. 

What if video from smartphone cameras aimed at a scene could be employed the same way?

That's the question that driving researchers at Carnegie Mellon University's Robotics Institute, who have now developed a way to combine iPhone videos that allow viewers to watch an event from various angles. The virtualization technology has clear applications in mixed reality, including editing out objects that obscure line of sight or adding or deleting people from a scene.

"We are only limited by the number of cameras," explains Aayush Bansal, a Ph.D. student in CMU's Robotics Institute. The CMU technique has been demonstrated with as many as 15 camera feeds at a time.

The demonstration points to a democratization of virtualized reality, which currently is the purview of expensive studios and live events that employ dozens of cameras coordinated to capture every angle. But just like the venerable disposal camera made everyone a wedding photographer, smartphones may soon be employed to crowdsourced so-called 4D visualizations of gatherings. Given that pulling out your phone and taking video at events like weddings and parties is commonplace already, the technology has the benefit of piggybacking off of conditioned behavior.

"The point of using iPhones was to show that anyone can use this system," Bansal said. "The world is our studio."

The challenge for the CMU researchers was to use unpredictably aimed videos to stitch together 3D scenes, which has never been done. The team used what are called convolutional neural nets, which use deep learning and robust visual data analysis. The convolutional neural nets identify common visual data across multiple feeds and works backwards to stitch the video together. 

The National Science Foundation, Office of Naval Research, and Qualcomm supported the research, which was conducted by Bansal and faculty members of the CMU Robotics Institute and was recently presented at the Computer Vision and Pattern Recognition conference last month.

Fittingly, that conference was held virtually.

Editorial standards