Facebook Oculus research crafts strange mashup of John Oliver and Stephen Colbert
Researchers at Carnegie Mellon and at the Facebook reality lab created better fakes of videos by training a deep neural network to transfer the style of comedian John Oliver to the likeness of Stephen Colbert in a synthesized video. The results could be creepy or thrilling, depending on your point of view.
One thread of the recent artificial intelligence revival is crafting fakes that look and sound convincingly real, such as reproductions of paintings by known artists.
Researchers at Facebook's Oculus research lab, and colleagues at Carnegie Mellon, have developed neural networks that create fake videos showing what it would be like if one person spoke in the manner of another person, or videos of a cloudy day in a place where there were actually clear skies in reality.
The product of all this are videos that could be unsettling or thrilling, depending on your perspective: Comedian John Oliver's original monologue can be made to craft a new, fake sequence of video by fellow comedian Stephen Colbert, translating Oliver's expressions and mannerisms onto the likeness of Colbert.
The phenomenon, known as "retargeting," has been explored for years, mainly with still images. The new research promises to refine visual fakes by employing more of the clues provided by the moment-to-moment shifts of frames in a video.
The paper, Recyle-GAN: Unsupervised Video Retargeting, is posted on the arXiv pre-print server and was presented at the 15th European Conference on Computer Vision last month. It is authored by Aayush Bansal and Deva Ramanan of Carnegie Mellon, and Shugao Ma and Yaser Sheikh of Facebook's Oculus Research in Pittsburgh.
A webpage for the work has a lot of examples of videos that have been transformed into new versions, including the Oliver-Colbert mash up, and those with Barack Obama's speeches controlling frames of video with President Donald Trump's likeness, and vice versa. Each politician is made to ape the other one's eye movements, mouth movements, gaze, expressions, etc. There are also landscapes where a breeze has been artificially created in a scene of a calm day by copying the wind patterns in an original video.
The authors point out that such retargeting of movements could be of benefit in avatars for virtual reality: Existing methods have struggled to build avatars when features of faces are occluded, they note, and the extra information in the temporal sequence of frames of video could get around such obstacles. That would seem to at least partially explain the connection to Facebook's Oculus unit.
The deep learning network makes use of so-called generative adversarial networks, or GANs, where one system of equations, called a generator, must transform video frames in order to synthesize some new frames, and another system of equations, the discriminator, has to try to tell the fake from the original. The two compete with forgery and investigation, like counterfeiters and cops, until the generator gets so good at faking video frames that the discriminator can't tell them from the real thing.
The researchers built upon an August research paper from UC Berkeley researchers called Cycle-GAN. That work just transformed still images. The key notion, that of a "cycle," means that the original image can be recovered from the fake, in the same way a translation from English to French by a computer should be able to then yield the original sentence when translated back from French to English. This paper adds a knowledge of how a picture of a face or a landscape or a flower changes from one frame to another.
The authors believe they're improving translation from one thing to another by adding more "constraints" to the problem. As they write, "many natural visual signals are inherently spatiotemporal in nature, which provides strong temporal constraints for free. This results in significantly better mappings."
The results of the study suggest to the authors that adding temporal data improves the fakery. They had 15 human subjects look at the videos they created and say whether they were real or fake. Almost a third of the time, 28 percent, the people mistakenly judged a fake video created with this new approach as being genuine, whereas with videos that were created with the previous Cycle-GAN technology, they were only fooled about 7.3 percent of the time.
One outcome of such deception could be lots more fake stuff. As they refine the neural nets, the authors note they can better approximate the "style" of one video in a new fake video -- things like the cadence of one speaker imprinted onto another. "Using spatiotemporal generative models," they write, "would allow to even learn the speed of generated output. E.g. Two people may have different ways of content delivery and that one person can take longer than other to say the same thing."
The authors, happily, present one of their failures, always a refreshing ingredient. They took a video of a bird and mapped it to a video of an animated bird in origami format. The fake video fails when the real bird flies off its perch and out of the scene. The fake origami bird initially went away, but then it inappropriately flew back. The authors write that their neural net is having trouble dealing with the complete absence of the bird. "Our algorithm is not able to make transition of association when the real bird is completely invisible, and so it generated a random flying origami," they conclude.