On September 29, the Emmy for interactive documentary went to 'In Event of Moon Disaster', a film that uses artificial intelligence (AI) to create a fake video featuring former US President Richard Nixon. The film shows him delivering a speech that was prepared in case the Apollo 11 mission failed, leaving astronauts Neil Armstrong and Buzz Aldrin to die on the moon.
The multimedia project was created by the Massachusetts Institute of Technology's Center for Advanced Virtuality, with a bit of help from a Ukrainian voice-cloning startup, Respeecher, which worked on Nixon's voice.
Alex Serdiuk, the CEO of Respeecher, says the idea behind this seven-minute film was to show what online misinformation will look like in the future. The project was "not just an opportunity to do cool stuff with our technology, but also to showcase what these technologies are capable of," he said.
In the years to come, deepfake videos might become more common on social media and more difficult to spot, with awful consequences at the societal level. It's already known that fake news tends to travel faster. An MIT study showed, for instance, that false claims are 70% more likely to be shared than truth.
This danger is why Serdiuk says it's his duty to help raise awareness on the misuse of deepfakes. "That's quite an important part of the work we do, educating society about synthetic media technologies," he tells ZDNet.
How to do a deepfake
In Event of Moon Disaster was an ambitious multimedia project that benefited from the expertise of professionals across different fields. The film was co-directed by Francesca Panetta and Halsey Burgund at the MIT Center for Advanced Virtuality, who worked closely with two startups that handled the tech part of the project. The altered image of Richard Nixon was created by Tel Aviv-based Canny AI, while the voice of the President was generated by Respeecher's engineers in their small Kyiv offices.
The Emmy win, which came against Oculus TV's 'Micro Monsters with David Attenborough' and RT's 'Lessons of Auschwitz VR' project, came as a surprise for Respeecher, a startup that launched less than four years ago. Back then, Serdiuk and his friend Dmytro Bielievtsov participated in a hackathon in an attempt to do something interesting to complement the tedious data analytics jobs they did for banks and insurance companies.
At that hackathon, most teams focused on using AI for image processing, so Serdiuk and Bielievtsov decided to do something different and focused on sound. They started to build software that allowed someone to speak using another person's voice – in short, enabling speech-to-speech conversion. They liked the project and decided to continue developing it.
Soon, they met Grant Reaber, a Carnegie Mellon alumni interested in accent conversion, a somewhat similar field. The three decided to start a company, and Respeecher was born.
When MIT knocked on their door, their voice-conversion technology was still in the making, but they thought they were up to the task. They needed two things: old recordings of Richard Nixon and a recording of the script the President never delivered. MIT hired an actor to impersonate Nixon's speaking style, pronouncing certain words longer than others and making strategic pauses to add solemnity.
Then, using a deep neural net, Respeecher's engineers joined the two, adding Nixon's vocal timber on top of the actor's performance, thus creating a deepfake audio recording. To anyone listening, the synthetic voice sounds natural, and it's indistinguishable from the original.
To achieve this level of quality, Serdiuk's team needed several hours of recording from both Nixon and the actor. Now, they've improved their technology, and the process is more straightforward.
"We usually ask for about 60 minutes of speech recordings for target and source voices," he says. "In many projects, we had less data or worse data, so we know how to work with all data."
Unlike text-to-speech conversions, which often sound artificial, Respeecher's technology helps preserve emotions. "Our goal was to make the quality on that level where it would be satisfactory for high-demanded sound professionals in Hollywood," says Serdiuk.
Respeecher currently employs about 20 experts and has high-profile clients such as Lucasfilm on their books. The startup has worked on several cutting-edge projects in the past few years. For example, it has recreated Michael York's voice, allowing him to talk about his rare disease, amyloidosis.
"It was a very cool project in terms of using the technology for someone whose voice is gone, who cannot use this voice anymore," says Serdiuk. His team brought back another iconic voice, that of late American football coach Vince Lombardi, who sent an encouraging message for those struggling with the pandemic during the SuperBowl. In addition to that, Respeecher also synthesized the voice of the young Luke Skywalker for the last episode of season two of Mandalorian.
Serdiuk is optimistic, saying that his small Kyiv-based studio will continue to contribute to blockbusters: "It takes time to build credibility and reputation in Hollywood. But now, we are in a position where some cool projects are coming to us from word of mouth because some people in Hollywood use our technology, and they share this experience with their friends and coworkers."
Speech-to-speech conversions can be useful in a wide range of projects, from video games to films, from audiobooks to call centre assistants. Respeecher can emulate male-to-female and female-to-male conversions, and in the future, it might even work for voice dubbing in foreign languages.
Voice cloning raises a number of ethical questions, and some find the technology disturbing. The documentary 'Roadrunner: A Film About Anthony Bourdain' that appeared in cinemas during summer faced criticism after it was revealed that a segment of the voice of the late chef was created using voice-cloning technology. Bourdain did indeed write those sentences, but there was no recording of him reading them.
The use of AI was not signaled to the audience. It was only revealed when Morgan Neville mentioned it. Also, it's not clear if the crew got permission from Bourdain's family to create his voice synthetically.
Serdiuk says he and the other two co-founders created a set of rules both they and their clients should follow. Respeecher does not provide a public API, and the company has plans to introduce an audio watermark that can be detected by specialized software. Additionally, when a client wants to clone someone's voice, they need written consent from that person or their family.
"In my opinion, there is nothing new about this technology that our society has never seen before," says Serdiuk. "It's not different from Photoshop, right?"
The entertainment industry has yet to regulate deepfakes, but Serdiuk believes the set of rules his team developed should be mandatory, given that online misinformation might become more prevalent. The recent Emmy his team contributed to might be a small step in raising awareness on the dangers of deepfakes.
"We do spend a lot of time educating, telling about what's possible, showing what's possible," he said. "And this MIT project with President Nixon is a good example for that."