Sci-fi sitcom Red Dwarf envisaged a future where computers appeared as disembodied heads that chatted to humans, albeit in a bored and offhand manner.
A similar vision of what next-generation user interfaces might look like went on show today when the virtual assistant Zoe was revealed by researchers at Toshiba's Cambridge Research Lab and the University of Cambridge's Department of Engineering.
Zoe is a 2D photorealistic digital avatar that can recite speech and display a range of emotions, courtesy of a text-to-speech engine and face-modelling program. It appears to the user as a head floating in space, as you can see in the demo video below.
The idea is that interfaces like Zoe could one day be the face of smartphone assistants like Siri, of audio books or on automated kiosks in, say, a doctor's surgery reception.
"In the short term I can imagine people using it with something like Siri on their phone, said Bjorn Stenger, head of the computer vision group at Toshiba Research Europe.
"Longer term, you could have it as an interactive assistant or someone who could look up things for you, teach you a language or chat with you about the news, but that's probably a little bit off."
Another possibility is that smartphone users may one day be able to create their own virtual assistants (VA) using training systems similar to those that generated the data for Zoe, researchers believe. These custom VAs could allow people to send face messages, where a virtual version of themselves reads out their message while looking and sounding happy, sad or whatever emotion is desired.
Talking avatars are nothing new: the digital newsreader Ananova dates back to the turn of the century, but Zoe is able to reflect a more believable range of human emotions on its face and through its voice, said Stenger.
"Obviously there have been talking heads before but this approach is more flexible and realistic than before," he said.
The flexibility in what Zoe can say and the emotions it can express comes from the large store of English phonemes, units of sound that make up a spoken language, and captured facial expressions, which Zoe's text-to-speech and facial modelling engines can draw upon.
This store was gathered from high definition video of Hollyoaks actress Zoe Carpenter reading thousands of lines of text from a wide variety of sources, from newspapers to phone directories.
Visual recognition software analysed the video to capture the shape and position of the face when uttering different phonemes, as well as when expressing different moods.
Meanwhile speech analysis software captured the phonemes that made up the words, and how these same sounds varied according to mood.
By combining these different data points, Zoe can recreate myriad emotions and read the majority of sentences it is given convincingly, Stenger said. For instance, combining happiness with tenderness and slightly increasing the speed and depth of the voice makes it sound friendly and welcoming. A combination of speed, anger and fear makes Zoe sound as if it is panicking.
Zoe currently exists as a test system where the user types in the words they want it to say and selects one of six preset moods - happy, sad, tender, angry, afraid and neutral - as well as setting the intensity of that emotion and the depth, pitch and speed of the voice. These settings are used to generate just under 50 parameters that dictate how to animate Zoe's face.
The virtual assistant doesn't exist outside of the lab at present and Stenger says the group will continue to focus on improving Zoe's believability. For Zoe to function as a virtual assistant that can field human queries, it would have to be combined with a speech recognition engine and a branching dialogue system, but this is not something researchers are looking at present.
The team who created Zoe are working with a school for autistic and deaf children, where the technology could be used to help pupils to "read" emotions and lip-read.
The researchers built the text-to-speech engine, face capture and modelling software and system training algorithms. A variety of programming languages were used but where performance was important they chose C++. There are no plans to open source the code at present.
Zoe's calibration is being carried out on a Linux cluster and the text-to-speech and face modelling engine runs on a Linux server. The end-user interface showing Zoe is a Java client and only tens of MB in size, so it is multiplatform and would sit happily on smartphone or tablet.
The prospect of real-life receptionists being replaced with automated systems might fill certain people with dread rather than excitement, and Stenger says he shares that apprehension about such interfaces being misused.
"I think one has to be really careful not to annoy people with bad systems," he said.
"But eventually interfaces that are more natural to interact with will come, I'm sure. It's more intelligible to hear a voice and see a face."