Meta unveils Voicebox AI to replicate the voices of your friends and loved ones

The technology breakthrough was announced via a published paper. Though it's not widely available, you can listen to demos.
Written by Maria Diaz, Staff Writer
Soundwave with microphone above it
SergeyBitos/Getty Images

As AI chatbots and art generators seem to gain more popularity by the minute, some of the most prominent players in the business are trying to stay in the game with their own tools. Meta just presented Voicebox, a text-guided, artificially-intelligent speech generator so powerful that the company claims to outperform all existing models. 

Voicebox is powerful enough to generate voices as easily as ChatGPT can generate text and Bing or Dall-E 2 can create images. Though the system isn't yet widely available for public use, Meta has made demos accessible to anyone interested in learning more about Voicebox. 

Also: Your next job interview could be with AI instead of a person

The system could be used in audio editing by content creators and editors, for example, as its voice generation makes for natural-sounding audio clips. But it's versatile enough to intelligently edit noise out of voice clips, like dogs barking, and regenerate the voice without missing a beat.

One of the abilities Voicebox presents is that it can match the audio style of a sample and generate text-to-speech clips. Essentially, visually-impaired users could give Voicebox an audio clip of a friend as short as two seconds, and it'd be able to read that friend's written messages in their voice using AI. 

The new generative AI tool can solve tasks via in-context learning, so it can process text it's never been given before and correctly generate context and inflections much like a person would read it by using existing knowledge to learn and tackle new challenges.

Also: Generative AI should be more inclusive as it evolves, according to OpenAI's CEO

The ethical and legal implications of this groundbreaking tool are not easily dismissible. Anyone could generate audio clips using recordings of a person's voice without permission and claim to have them say anything they want. 

In the published paper, Meta claims that a binary classification model can distinguish between real-world speech and that which Voicebox generates. Either way, since the system is not publicly available, Meta's metaphorical feet are yet to be held to the fire.

Also: LLMs aren't even as smart as dogs, says Meta's AI chief scientist

Meta trained Voicebox on 60,000 hours of English audiobooks and 50,000 hours of multilingual audiobooks in six languages for optimal performance. Its training enables it to perform multilingual text-to-speech with no training, speech denoising, styling, editing, and generating diverse speech samples.

In a paper published by Meta AI, the company claims it can generate diverse audio samples 20 times faster than Microsoft's VALL-E and more intelligible. 

Also: Even Google is warning its employees about AI chatbot use

Aside from being faster and making fewer errors than competitors, Meta claims Voicebox can convert written text into spoken words in one or multiple languages without being specifically trained for each language separately.

Compared to the previous state-of-the-art model, YourTTS, Voicebox was found to reduce the average word error rate from 10.9% to 5.2%, as well as increase the audio similarity from 0.335 to 0.481.

Editorial standards