Amazon, just say no: The looming horror of AI voice replication

If an AI like Alexa really can convert less than a minute of recorded voice into real-time speech, it opens the door to dystopian gaslighting at a whole new level. It's frightening, creepy, and disturbing.
Written by David Gewirtz, Senior Contributing Editor
cracked fake cloud faces

Do we really want to put the power of perfectly simulating a voice in the hands of stalkers and abusers?

Getty Images

Last week, we ran a news article entitled, "Amazon's Alexa reads a story in the voice of a child's deceased grandma." In it, ZDNet's Stephanie Condon discussed an Amazon presentation at its re:MARS conference (Amazon's annual confab on topics like machine learning, automation, robotics, and space).

In the presentation, Amazon's Alexa AI Senior VP Rohit Prasad showed a clip of a young boy asking an Echo device, "Alexa, can grandma finish reading me 'The Wizard of Oz'?" The video then showed the Echo reading the book using what Prasad said was the voice of the child's dead grandmother.

Hard stop. Did the hairs on the back of your neck just raise up? 'Cause that's not creepy at all. Not at all.

Prasad, though, characterized it as beneficial, saying "Human attributes of empathy and affect are key for building trust. They have become even more important in these times of the ongoing pandemic, when so many of us have lost someone we love. While AI can't eliminate that pain of loss, it can definitely make their memories last."

Hmm. Okay. So let's deconstruct this, shall we?

I hear dead people

There is a psychological sensory experience clinically described as SED, for "sensory and quasi-sensory experiences of the deceased." This is a more modern clinical term for what used to be described as hallucinations.

According to a November 2020 clinical study, Sensory and Quasi-Sensory Experiences of the Deceased in Bereavement: An Interdisciplinary and Integrative Review, SED experiences aren't necessarily a psychological disorder. Instead, somewhere between 47% and 82% of people who have experienced life events like the death of a loved one have experienced some sort of SED.

The study I'm citing is interesting in particular because it is both integrative and interdisciplinary, meaning it aggregates results of other studies across a variety of research areas. As such, it's a good summing up of the general clinical perception of SED.

According to the study, SED experiences cross boundaries and are experienced by all age groups, members of many religions, across all types of relationship loss, and even death circumstances.

But whether an SED experience is considered comforting or disturbing depends both on the individual and that individual's belief system. SED also manifests in all sorts of ways, from hearing footsteps, to experiences of presence, to sightings. It doesn't always have to be voice reproduction.

Overall, the report stops short of making a clinical value judgement about whether SED experiences are psychologically beneficial or detrimental, stating that further study is needed.

But -- bringing it back to Amazon's attempt to bottle dead Grandma's voice in a can -- it's absolutely unclear whether providing a child with a lost relative's voice will be comforting, or developmentally such a problem that it will provide continuing employment to therapists for years to come.

It is odd that Amazon chose to show voice replication from a deceased relative, rather than, say, a live and healthy grandmother who could record her voice for her cherished grandchild. But hey, if Amazon's researchers wanted to go for the macabre, who are we to judge?

That brings us to the discussion of voice replication overall. With a few limited constructive applications, I'm not sure releasing voice replication AI technology into the wild is a good idea. Amazon says they can take a short sample and construct an entire dialog from that short sample. There's something about this that seems terribly, horribly wrong.

What could possibly go wrong?

It almost sounds like how you'd describe a superpower in a show like The Umbrella Academy: upon hearing less than a minute of a person's voice, someone is able to say anything and make it sound exactly like that person had been the one to say it.

How could this possibly not be a force for good? Oh, boy. Buckle up.

Also: Has Alexa snapped? Why your Echo sometimes does creepy things

We're not talking just accidentally creepy here, like when Alexa suddenly started screaming or breaking out into evil-sounding laughter. Weird things happen by accident when you're innovating in a new area. Those behaviors, once discovered, are fixed.

No, what we're talking about is what could happen if bad actors get their hands on this technology and decide to use it for profit... or worse.

The American Psychological Association Dictionary of Psychology defines "gaslighting" as:

To manipulate another person into doubting his or her perceptions, experiences, or understanding of events. The term once referred to manipulation so extreme as to induce mental illness or to justify commitment of the gaslighted person to a psychiatric institution but is now used more generally. It is usually considered a colloquialism, though occasionally it is seen in clinical literature.

The term originated in a 1920s stage play, which was then produced as a 1944 movie called "Gaslight".

Unfortunately, gaslighting has entered the digital realm. In 2018, The New York Times ran an article describing how digital thermostats, locks, and lights were becoming tools of domestic abuse.

The Times described how these devices are "being used as a means for harassment, monitoring, revenge and control." Examples included turning thermostats to 100 degrees or suddenly blasting music.

The American Public University Edge also talks about digital gaslighting. The article explains, "This type of activity allows an abuser to easily demonstrate control over the victim, no matter where the abuser may be. It is another method that the abuser uses to slowly chip away at a victim's self-esteem and further exacerbate the victim's stress."

Now, let's take it up a notch. Just how easy would it be to send someone off the edge if they kept hearing the voice of their dead father or mother? If an abuser can convince someone they're being haunted or are losing control of their ability to discern reality, that abuser could then substitute in a malevolent subjective reality.

The whole idea sounds like bad fiction, but gaslighting is so prevalent in domestic abuse that the National Domestic Violence Hotline has an entire page dedicated to the gaslighting techniques an abusive partner might use. If you find yourself in this situation, you can reach the hotline at 1-800-799-7233.

Let's take it up another notch: add stalkers to the mix. Let's say you're at home and you get a call from your mom. It's your mom's number on caller ID. You answer and it sounds like your mom. She's been in an accident, or is in some kind of trouble. She begs you to come out and get her. And you do. Because your mom called, and of course you know what she sounds like.

But it's not your mom. There are methods for spoofing caller ID, and with AI voice replication, the potential for luring a victim increases considerably. Pair that with the ability to purchase personal identifying information (PII) with a shocking level of detail from many shady online purveyors, and you have a frightening scenario.

Don't dismiss these scenarios as low probability. The CDC reports that 1 in 6 women and 1 in 19 men have been stalked in their lifetime. The US Justice Department reports that "81 percent of women who were stalked by a current or former husband or cohabiting partner were also physically assaulted by that partner and 31 percent were also sexually assaulted by that partner." More than a million women and about 370,000 men are stalked annually.

So, do we really want to put the power of perfectly simulating a voice in the hands of stalkers and abusers?

Even if it's not for stalking or abuse, such a tool could also assist scammers. Like the previous example where Mom calls and asks you to pick her up (but, of course, it's not Mom), imagine a scenario where Dad gets a call at work from his daughter in college. She's had an emergency. Can he please send her a few thousand dollars?

Obviously, the quality of the grift will determine some of the believability of the call, but with enough available PII and a good script, someone will fall for the scam and give out credit card digits or wire money -- especially since the voice was the daughter's voice.

In combination with deepfake video technology, the potential for creating fake videos of individuals increases considerably. Whether that video is used by teenagers to bully a schoolmate, or by a disinformation campaign to convince a populace that a leader is up to no good, the idea of deepfakes with accurate voice representation is very troubling.

Constructive applications

There are some entertainment industry applications where voice replication can add value. It's only fair to say that this sort of technology has some positive potential as well.

For example, we've recently seen a young Luke Skywalker in 2021 Disney+ series The Mandalorian and The Book of Boba Fett.

Luke's image was digitally created over actor Graham Hamilton, but Mark Hamill was credited in Episode 6: From the Desert Comes a Stranger, even though he didn't provide Luke's voice. Instead, the producers used a tool called Respeecher, which used a sound bank of old Mark Hamill recordings that were pieced together for the episode.

Another possible application might be in smart assistants (and smart assistance) for dementia sufferers. While it might be a very fine line between gaslighting someone with diminished mental capacity and helping them cope, under proper psychiatric care, voice recreation might have positive applications.

My wife and I had a great time traveling across the country with the wise guidance of Yoda. We had added his voice to our old GPS. During our travels, we had Yoda's voice guiding us, turn by turn. It was comforting, especially during those long open and empty stretches, to have Yoda's calming voice and statements like "left, you must turn" to keep us on track.

Sadly, the Yoda Positioning System is no longer available, which may (or may not) say something about the market viability for celebrity character voices in personal electronics.

Stopping to think

There is a line in Jurassic Park that comes to mind at times like this: "Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should." In the movie's case, it was about recreating dinosaurs from dino DNA. But it applies just as well to using AI to recreate people's voices.

Entertainment AI software Respeecher needs one to two hours of voice samples to recreate a voice. Amazon's new technology requires less than a minute of recordings. That opens the door to a great many more recordings, including messages captured from voice mail and even commands given to Alexa and Siri.

Given the prevalence of sales of PII information, and even medical information, on the so-called Dark Web, it's logical to expect that hackers will also traffic in short voice recordings of potential victims, especially when those recordings only have to be a minute or so.

That means that if Amazon does release its dead grandma skill to the Alexa platform, it will be accessible to a broad audience. It's even possible to use Alexa and Alexa skills on home-grown non-Alexa devices, as this article shows. That means that even if this technology is limited to Alexa, it has the potential to be very problematic.

I have to ask: does Amazon really need to release this technology into the wild as a skill?

To be fair, Amazon isn't going to be the only company exploring voice replication. Fortune Business Insights predicts the global speech and voice recognition market will reach $28.3 billion by 2026 at an annual growth rate of almost 20%. With those kinds of numbers, you can be sure there will be other participants in this arena.

Protecting users from digital gaslighting, stalking, and scams will get progressively more difficult, and voice replication only makes it worse.

Writing in the Lawfare Blog, Dr. Irving Lachow, deputy director, cyber strategy and execution at the MITRE Corporation and a visiting fellow at the Hoover Institution, describes this situation as "PsyOps in the home."

He states that although there are anti-stalking laws on the books, "Many of these measures cannot be applied directly to cyber gaslighting because, unlike the stalkerware situation, abusers are not adding software to home-based smart devices in order to harass their victims. Instead, they are using the devices as they were intended to be used."

He also says that legal challenges are more difficult where executives at companies producing these technologies are somewhat untouchable. He states, "The technologies in question have legitimate and positive uses, and one cannot realistically target the executives of smart device companies just because their equipment has been used to cause someone harm."

Clearly, this is an issue that needs more consideration. Companies like Amazon need to evaluate carefully whether the features they're adding do more harm than good. Cybersecurity experts need to continue to harden IoT devices against outside hacking. Therapists and psychologists need to increase their awareness of digital gaslighting and other 21st-century threats.

But individuals can also protect themselves. Malware intelligence analyst Christopher Boyd, writing on behalf of the Malwarebytes blog, recommends keeping detailed records of incidents and log any data produced by devices. We add that it's important to manage your own passwords, use strong passwords, and if you're expecting trouble, be sure to learn how to lock down your devices.

Lachow reports, "Smart devices are ripe for exploitation in domestic abuse scenarios because often one person, usually a man, controls the information technology (IT) for the house. If the IT manager moves out but retains access to home-based smart devices via mobile apps or online interfaces, he or she could control the household environment."

Keep that in mind and learn all you can. As for AI-based systems learning to replicate the voices of deceased relatives, I have to say "Just say no." No. No. No-no-no. Baaaad things could happen.

I'm sure you can think of plenty of horrifying scenarios. Share those with us in the comments below. The more we're thinking about this and the more we're aware of it, the better we can prepare for it.

You can follow my day-to-day project updates on social media. Be sure to follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Editorial standards