As AI chatbots and art generators seem to gain more popularity by the minute, some of the most prominent players in the business are trying to stay in the game with their own tools. Meta just presented Voicebox, a text-guided, artificially-intelligent speech generator so powerful that the company claims to outperform all existing models.
Voicebox is powerful enough to generate voices as easily as ChatGPT can generate text and Bing or Dall-E 2 can create images. Though the system isn't yet widely available for public use, Meta has made demos accessible to anyone interested in learning more about Voicebox.
The system could be used in audio editing by content creators and editors, for example, as its voice generation makes for natural-sounding audio clips. But it's versatile enough to intelligently edit noise out of voice clips, like dogs barking, and regenerate the voice without missing a beat.
One of the abilities Voicebox presents is that it can match the audio style of a sample and generate text-to-speech clips. Essentially, visually-impaired users could give Voicebox an audio clip of a friend as short as two seconds, and it'd be able to read that friend's written messages in their voice using AI.
The new generative AI tool can solve tasks via in-context learning, so it can process text it's never been given before and correctly generate context and inflections much like a person would read it by using existing knowledge to learn and tackle new challenges.
The ethical and legal implications of this groundbreaking tool are not easily dismissible. Anyone could generate audio clips using recordings of a person's voice without permission and claim to have them say anything they want.
In the published paper, Meta claims that a binary classification model can distinguish between real-world speech and that which Voicebox generates. Either way, since the system is not publicly available, Meta's metaphorical feet are yet to be held to the fire.
Meta trained Voicebox on 60,000 hours of English audiobooks and 50,000 hours of multilingual audiobooks in six languages for optimal performance. Its training enables it to perform multilingual text-to-speech with no training, speech denoising, styling, editing, and generating diverse speech samples.
In a paper published by Meta AI, the company claims it can generate diverse audio samples 20 times faster than Microsoft's VALL-E and more intelligible.
Aside from being faster and making fewer errors than competitors, Meta claims Voicebox can convert written text into spoken words in one or multiple languages without being specifically trained for each language separately.
Compared to the previous state-of-the-art model, YourTTS, Voicebox was found to reduce the average word error rate from 10.9% to 5.2%, as well as increase the audio similarity from 0.335 to 0.481.