The researchers' abstract begins benignly enough. It uses lots of words, phrases, and acronyms that aren't familiar to, say, many lay human language models. It explains that the neural codec language model is called VALL-E.
Surely this name is supposed to soften you up. What could be scary about a technology that almost sounds like that cute little robot from a heartwarming movie?
Well, this perhaps: "VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt."
I've often wanted to emerge learning capabilities. Instead, I've had to resort to waiting for them to emerge.
And what emerges from the researchers' last sentence is shivering. Microsoft's big brains now only need 3 seconds of you saying something in order to fake longer sentences and perhaps large speeches that weren't made by you, but sound pretty much like you.
I won't descend into the science too much, as neither of us would benefit from that.
I'll merely mention that VALL-E uses an audio library put together by one of the world's most admired, trustworthy companies -- Meta. Called LibriLight, it's a repository of 7,000 people talking for a total of 60,000 hours.
I listened to a male speaking for 3 seconds. Then I listened to the 8 seconds his VALL-E version had been prompted to say: "They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission."
I defy you to notice much difference, if any.
It's true that many of the prompts sounded like very bad snippets of 18th century literature. Sample: "Thus did this humane and right-minded father comfort his unhappy daughter, and her mother, embracing her again, did all she could to soothe her feelings."
But what could I do other than listen to more examples presented by the researchers? Some VALL-E versions were a touch more suspicious than others. The diction didn't feel right. They felt spliced.
The overall effect, however, is pertinently scary.
You've been warned already, of couse. You know that when scammers call you, you shouldn't speak to them in case they record you and then recreate your diction to make your abstracted voice nefariously order expensive products.
This, though, seems another level of sophistication. Perhaps I've already watched too many episodes of Peacock's "The Capture," where deepfakes are presented as a natural part of government. Perhaps I really shouldn't be worried because Microsoft is such a nice, inoffensive company these days.
However, the idea that someone, anyone, can be easily fooled into believing I'm saying something that I didn't -- and never would -- doesn't garland me with comfort. Especially as the researchers claim they can replicate the "emotion and acoustic environment" of one's initial 3 seconds of speech too.
You'll be relieved, then, that the researchers may have spotted this potential for discomfort. They offer: "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker."
The solution? Building a detection system, say the researchers.
Which may leave one or two people wondering: "Why did you do this at all, then?"
Quite often in technology, the answer is: "Because we could."