One thread of the recent artificial intelligence revival is crafting fakes that look and sound convincingly real, such as reproductions of paintings by known artists.
Researchers at Facebook's Oculus research lab, and colleagues at Carnegie Mellon, have developed neural networks that create fake videos showing what it would be like if one person spoke in the manner of another person, or videos of a cloudy day in a place where there were actually clear skies in reality.
The product of all this are videos that could be unsettling or thrilling, depending on your perspective: Comedian John Oliver's original monologue can be made to craft a new, fake sequence of video by fellow comedian Stephen Colbert, translating Oliver's expressions and mannerisms onto the likeness of Colbert.
The phenomenon, known as "retargeting," has been explored for years, mainly with still images. The new research promises to refine visual fakes by employing more of the clues provided by the moment-to-moment shifts of frames in a video.
The paper, Recyle-GAN: Unsupervised Video Retargeting, is posted on the arXiv pre-print server and was presented at the 15th European Conference on Computer Vision last month. It is authored by Aayush Bansal and Deva Ramanan of Carnegie Mellon, and Shugao Ma and Yaser Sheikh of Facebook's Oculus Research in Pittsburgh.
A webpage for the work has a lot of examples of videos that have been transformed into new versions, including the Oliver-Colbert mash up, and those with Barack Obama's speeches controlling frames of video with President Donald Trump's likeness, and vice versa. Each politician is made to ape the other one's eye movements, mouth movements, gaze, expressions, etc. There are also landscapes where a breeze has been artificially created in a scene of a calm day by copying the wind patterns in an original video.
The authors point out that such retargeting of movements could be of benefit in avatars for virtual reality: Existing methods have struggled to build avatars when features of faces are occluded, they note, and the extra information in the temporal sequence of frames of video could get around such obstacles. That would seem to at least partially explain the connection to Facebook's Oculus unit.
The deep learning network makes use of so-called generative adversarial networks, or GANs, where one system of equations, called a generator, must transform video frames in order to synthesize some new frames, and another system of equations, the discriminator, has to try to tell the fake from the original. The two compete with forgery and investigation, like counterfeiters and cops, until the generator gets so good at faking video frames that the discriminator can't tell them from the real thing.
The researchers built upon an August research paper from UC Berkeley researchers called Cycle-GAN. That work just transformed still images. The key notion, that of a "cycle," means that the original image can be recovered from the fake, in the same way a translation from English to French by a computer should be able to then yield the original sentence when translated back from French to English. This paper adds a knowledge of how a picture of a face or a landscape or a flower changes from one frame to another.
The authors believe they're improving translation from one thing to another by adding more "constraints" to the problem. As they write, "many natural visual signals are inherently spatiotemporal in nature, which provides strong temporal constraints for free. This results in significantly better mappings."
The results of the study suggest to the authors that adding temporal data improves the fakery. They had 15 human subjects look at the videos they created and say whether they were real or fake. Almost a third of the time, 28 percent, the people mistakenly judged a fake video created with this new approach as being genuine, whereas with videos that were created with the previous Cycle-GAN technology, they were only fooled about 7.3 percent of the time.
One outcome of such deception could be lots more fake stuff. As they refine the neural nets, the authors note they can better approximate the "style" of one video in a new fake video -- things like the cadence of one speaker imprinted onto another. "Using spatiotemporal generative models," they write, "would allow to even learn the speed of generated output. E.g. Two people may have different ways of content delivery and that one person can take longer than other to say the same thing."
The authors, happily, present one of their failures, always a refreshing ingredient. They took a video of a bird and mapped it to a video of an animated bird in origami format. The fake video fails when the real bird flies off its perch and out of the scene. The fake origami bird initially went away, but then it inappropriately flew back. The authors write that their neural net is having trouble dealing with the complete absence of the bird. "Our algorithm is not able to make transition of association when the real bird is completely invisible, and so it generated a random flying origami," they conclude.
Previous and related coverage:
An executive guide to artificial intelligence, from machine learning and general AI to neural networks.
The lowdown on deep learning: from how it relates to the wider field of machine learning through to how to get started with it.
This guide explains what machine learning is, how it is related to artificial intelligence, how it works and why it matters.
An introduction to cloud computing right from the basics up to IaaS and PaaS, hybrid, public, and private cloud.
- There is no one role for AI or data science: this is a team effort
- Startup Kindred brings sliver of hope for AI in robotics
- AI: The view from the Chief Data Science Office
- Salesforce intros Einstein Voice, an AI voice assistant for enterprises
- It's not the jobs AI is destroying that bother me, it's the ones that are growing
- How Facebook scales AI
- Google Duplex worries me CNET
- How the Google Home is better than the Amazon Echo CNET