The future of UIs: A computer you can look in the eye

The future of UIs: A computer you can look in the eye

Summary: A photorealistic digital avatar has been unveiled, offering a glimpse of what researchers say could one day be the face of tech from smartphones to digital receptionists.

SHARE:

Sci-fi sitcom Red Dwarf envisaged a future where computers appeared as disembodied heads that chatted to humans, albeit in a bored and offhand manner.

A similar vision of what next-generation user interfaces might look like went on show today when the virtual assistant Zoe was revealed by researchers at Toshiba's Cambridge Research Lab and the University of Cambridge's Department of Engineering.

Zoe is a 2D photorealistic digital avatar that can recite speech and display a range of emotions, courtesy of a text-to-speech engine and face-modelling program. It appears to the user as a head floating in space, as you can see in the demo video below.

The idea is that interfaces like Zoe could one day be the face of smartphone assistants like Siri, of audio books or on automated kiosks in, say, a doctor's surgery reception.

"In the short term I can imagine people using it with something like Siri on their phone, said Bjorn Stenger, head of the computer vision group at Toshiba Research Europe.

"Longer term, you could have it as an interactive assistant or someone who could look up things for you, teach you a language or chat with you about the news, but that's probably a little bit off."

Another possibility is that smartphone users may one day be able to create their own virtual assistants (VA) using training systems similar to those that generated the data for Zoe, researchers believe. These custom VAs could allow people to send face messages, where a virtual version of themselves reads out their message while looking and sounding happy, sad or whatever emotion is desired.

Talking avatars are nothing new: the digital newsreader Ananova dates back to the turn of the century, but Zoe is able to reflect a more believable range of human emotions on its face and through its voice, said Stenger.

"Obviously there have been talking heads before but this approach is more flexible and realistic than before," he said.

The flexibility in what Zoe can say and the emotions it can express comes from the large store of English phonemes, units of sound that make up a spoken language, and captured facial expressions, which Zoe's text-to-speech and facial modelling engines can draw upon.

This store was gathered from high definition video of Hollyoaks actress Zoe Carpenter reading thousands of lines of text from a wide variety of sources, from newspapers to phone directories.

Visual recognition software analysed the video to capture the shape and position of the face when uttering different phonemes, as well as when expressing different moods.

Meanwhile speech analysis software captured the phonemes that made up the words, and how these same sounds varied according to mood.

By combining these different data points, Zoe can recreate myriad emotions and read the majority of sentences it is given convincingly, Stenger said. For instance, combining happiness with tenderness and slightly increasing the speed and depth of the voice makes it sound friendly and welcoming. A combination of speed, anger and fear makes Zoe sound as if it is panicking.

Zoe currently exists as a test system where the user types in the words they want it to say and selects one of six preset moods - happy, sad, tender, angry, afraid and neutral - as well as setting the intensity of that emotion and the depth, pitch and speed of the voice. These settings are used to generate just under 50 parameters that dictate how to animate Zoe's face.

The virtual assistant doesn't exist outside of the lab at present and Stenger says the group will continue to focus on improving Zoe's believability. For Zoe to function as a virtual assistant that can field human queries, it would have to be combined with a speech recognition engine and a branching dialogue system, but this is not something researchers are looking at present.

The team who created Zoe are working with a school for autistic and deaf children, where the technology could be used to help pupils to "read" emotions and lip-read.

The researchers built the text-to-speech engine, face capture and modelling software and system training algorithms. A variety of programming languages were used but where performance was important they chose C++. There are no plans to open source the code at present.

Zoe's calibration is being carried out on a Linux cluster and the text-to-speech and face modelling engine runs on a Linux server. The end-user interface showing Zoe is a Java client and only tens of MB in size, so it is multiplatform and would sit happily on smartphone or tablet.

The prospect of real-life receptionists being replaced with automated systems might fill certain people with dread rather than excitement, and Stenger says he shares that apprehension about such interfaces being misused.

"I think one has to be really careful not to annoy people with bad systems," he said.

"But eventually interfaces that are more natural to interact with will come, I'm sure. It's more intelligible to hear a voice and see a face."

Topics: Enterprise Software, Emerging Tech

About

Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • The friendly face of your computer is in a lot of science fiction

    Just find it a but sad that the expected usage is to chat to you... Like people used to before we invented it.
    MarknWill
  • Artificial Intelligence Human Like Robot 2012

    Hi Nick, Have you seen this ? Thought you may want to do a write up on it as well.

    http://www.youtube.com/watch?v=_ySljCcnq4o
    silhouett1
  • And what if we neither need nor want them?

    If I'm going to be talking to a face, I want to be the face of a *live* person, not an AI or a program avatar. I reserve my time for interacting with a virtual "talking head" for when I'm watching sci-fi movies or playing video games.

    If I need information from the computer, I can read it *much* faster than it can render it into audio. That saves me time, and limits any translation issues.
    spdragoo
  • have anyone ever self-checked-in in a doctor's office?

    "The idea is that interfaces like Zoe could one day be the face of smartphone assistants like Siri, of audio books or on automated kiosks in, say, a doctor's surgery reception."

    I never did. Forget the floating heads - they still have paper lists for you to sign into, and you still have to talk to the receptionist. Audio books are audio for a purpose - people listen to them while doing something else. As to the kiosks - imagine a floating head on the screen at the self check in at an airport.
    ForeverSPb
  • I don't see the need.

    I don't see the need.

    As indicated, this has been done before. And guess what? It never took off. And while being "more expressive" may look good on a senior project - I don't see it becoming mainstream. Nobody I know of would benefit from this.

    I seriously doubt a single receptionist will have his/her job replaced with this.
    CobraA1
    • It's called saving money!

      More and more jobs are being automated, those factory's full of robots used to be full of workers. When they get the tech right and the price down Receptionist will no doubt be targeted. If the tech keeps evolving robots will be able to fix them selves.
      Check out this link, good book, hard to find but a very good read. It's futuristic rather than hard core SciFi.

      http://en.wikipedia.org/wiki/User:Ylee/The_Two_Faces_Of_Tomorrow
      martin_js
      • Cheaper?

        So, the time/energy to program an artificial intelligence program, not to mention the additional time/energy to run the "talking head", is *cheaper* than a normal text/menu-selection interface like we already have available?

        That's like saying building a stealth fighter is cheaper than building a non-stealthy civilian helicopter...
        spdragoo
      • Been available for a while.

        As indicated, such things have been available for a while.

        That would seem to suggest that it's either not cheaper as claimed - or that businesses are unconvinced that being cheaper is worth it, considering how cold and impersonal it makes the business appear if it's not using humans for receptionists.

        "Check out this link, good book, hard to find but a very good read."

        Checking out the wikipedia page - yeah, I've read similar stories. Whether it will really turn out in such a way is speculation, not fact. As far as we know, it could turn out very differently.

        It all hinges on the idea that the ultimate destiny of technology is unavoidable - not something I agree with. I believe that we can and should control where technology brings us.
        CobraA1
  • Why am I thinking

    The ships computer, Holly from Red Dwarf.
    Richardbz
    • Maybe because Nick Heath mentioned it

      "Sci-fi sitcom Red Dwarf envisaged a future where computers appeared as disembodied heads that chatted to humans, albeit in a bored and offhand manner."

      However, if you go back to the 70s, a little known and quite frankly, not very good SF series called Star Lost had the same thing. So Red Dwarf wasn't first with it.
      mheartwood
  • Naaa, old stuff

    Japan is already experimenting with robot to great you at the reception desk. Why would you want a dodgy picture on a screen when a Robot can do it better. lol
    martin_js
    • should read

      Robots to greet you at the reception desk.
      martin_js