Talking PCs? Talk to the hand

Voice recognition and speech synthesis technologies may not have developed to the degree some science fiction writers hoped, but have nevertheless seen some startling successes
Written by Nick Hampshire, Contributor on

Being able to chat with a computer in plain English has been the standard fare of science fiction for decades, and yet, despite many promises from forecasters and other experts, we're still a long way from turning fantasy into fact.

Voice synthesis has been around for a long time. Bell Labs demonstrated a computer-based speech synthesis system running on an IBM704 in 1961, a demonstration seen by the author Arthur C. Clarke, giving him the inspiration for the talking computer HAL9000 in his book and film "2001: A Space Odyssey".

Forty-five years later, voice synthesis technology can be found in products as diverse as talking dolls, car information systems and various text-to-speech conversion services such as the one recently launched by BT. Many of these modern systems can convert text into a computer synthesised voice of quite respectable quality.

However, the problems faced by voice technology developers primarily lie not in getting a computer to talk, but in getting it to listen. Voice recognition has turned out to be a much harder task than researchers realised when work began on the problem over forty years ago. However, limited voice recognition applications are starting to creep into everyday use, voice input telephone menu systems are now commonplace, speech-to-text dictaphones are increasingly used for note-taking by doctors and lawyers, and voice input has started to appear in computer games systems.

The success of some of these limited-application voice recognition systems has recently prompted the big software heavyweights, Microsoft and IBM, to make further investments. IBM has hired more than a hundred extra speech technology researchers, with the aim of developing a system capable of matching the human level of speech recognition by 2010. And Bill Gates recently said that "we [Microsoft] aim to have computer systems capable of matching a human level of speech recognition by 2011".

If these predictions are true, then it means that within five years we could see the science fiction writers' vision of speech interaction with computers become a reality. However, there are still a lot of technological hurdles to overcome; to understand what these are, we need to delve further into the technology.

Speech synthesis
Speech synthesis, or Text to Speech (TTS) systems all consist of two parts, the front end which converts the text file into a "symbolic linguistic representation", and the back end which takes this symbolic representation and converts it into a speech waveform.

The front end first converts things like numbers and abbreviations into their written word equivalents to produce a normalised text. The next step is to phonetically transcribe each word, and divide the text into prosodic units such as phrases, clauses and sentences. The trouble is that text is full of words that are pronounced differently depending upon the context in which they are used, and this has required the development of sophisticated heuristic techniques that look at neighbouring words and statistics of frequency of occurrence in order to guess the proper pronunciation. The sequence of phonemes is then produced using either a dictionary or a rule-based approach.

The development of the front end speech synthesis system has been...

... the subject of a lot of work over the years, and has been complicated by the fact that the conversion requirements for every language are different. Thus the requirements for Spanish, which has a regular writing system, differ from those of English, which has a very irregular spelling system.

The back end speech synthesis system is where the biggest advances have taken place over the last few years. It is this system that dictates the naturalness and intelligibility of the synthesised speech, and is why we have moved from the very mechanical robotic-sounding synthesised speech of a decade or so ago to a naturalness and intelligibility that is often barely distinguishable from the voice of a human.

This naturalness and intelligibility has been particularly important where synthesised speech is used in automated telephone response systems. These are now often extremely sophisticated, and have started to be used to replace human operators in some call centre applications. Applications which are also driving the development of speech synthesis mark-up languages such as the XML-compliant SSML proposed by the W3C.

The two main technologies used in new generation back end systems are concatenative synthesis and formant synthesis. Concatenative synthesis is based upon the stringing together of a lot of small segments of pre-recorded speech, and the output can often be indistinguishable from real human voices, however, this comes at the cost of very large speech databases, often involving gigabytes of data and as much as a hundred hours of recorded speech.

Formant synthesis, on the other hand, uses rule-based techniques to generate the different voice waveforms, and as such does not suffer from the acoustic glitches that often appear in concatenative systems. This technique also offers better control over vocal intonation, tone and emotion, but has proved far more complex and has until quite recently produced synthesised speech that is very robotic. However, recent developments, particularly from Japanese developers of humanoid robots, have seen the development of more sophisticated electromechanical analogues of the human vocal tract that promise much more natural-sounding formant speech synthesis.

The relative sophistication of current speech synthesis technology means that it is not surprising that voice response systems are now incorporated into a very wide range of products, ranging from toys and computer games, to aircraft and automobile alert and warning systems, whilst text to speech systems are used for applications ranging from the generation of complex scripted phone messages, to reading aids for the blind.

Speech recognition
Speech recognition, on the other hand, is a much harder task, and commercial off-the-shelf systems have only been available since the 1990s. Because every person's voice is different, and words can be spoken in a range of different nuances, tones and emotions, the computational task of successfully recognising spoken words is considerable, and has been the subject of many years of continuing research work around the world.

A variety of different approaches are used, dynamic algorithms, neural networks, and knowledge bases, with the most widely used underlying technology being...

... the Hidden Markov Model. These techniques all attempt to search for the most likely word sequence given the fact that the acoustic signal will also contain a lot of background noise. The task is made easier if the system can be trained to recognise one person's voice pattern rather than that of many people, and it is also easier if isolated words are to be recognised rather than continuous speech. Similarly, the task is easier if the vocabulary is small, the grammar constrained and the context well-defined.

Grammar and context are particularly important elements in speech recognition, particularly in a highly complex language like English, and this has taken speech recognition system developers into areas like natural language analysis and comprehension.

The complexity of these problems has meant that most of the voice recognition systems developed to date are either small-vocabulary isolated-word recognition systems or large-vocabulary single-speaker recognition systems. Researchers are still a few years away from being able to produce a general purpose automatic speech recognition system that can recognise continuous speech from a wide variety of people and with a wide vocabulary as successfully as any human listener.

Although the technology for speaker-dependent large-vocabulary dictation systems now works quite well on a PC, they have not proved as popular as many predicted. This has been because in most situations it is quicker and easier to edit a document using a conventional keyboard and mouse. Furthermore the high background noise levels found in the average office make recognition hard, and recognition rates can fall as low as 50 percent compared with a normal quiet office level of up to 98 percent.

The application of speech recognition has been more successful in telephony, in applications that are not automatable using conventional push-button interactive voice response systems, such as directory assistance. Speech recognition technology is today widely used in automated phone-based information systems, such as travel booking and information, financial account information, and customer service call routing.

In such applications accuracy of recognition is very high, despite high noise levels, because such systems use constrained grammar recognition. This simply means that a highly optimised telephone application can trigger a prompt from the user to repeat the previous answer whenever the system's confidence in recognition of that input is low.

Speech recognition software is now increasingly used in mobile phones as a faster way to input SMS messages. Nuance Communications, one of the biggest producers of voice recognition products, claims that more than 50 million phones are now equipped with such software. Here, although the background noise levels can be very high, vocabulary size is much smaller and the grammar constrained, so once again recognition rates are high.

In such applications voice input is becoming popular because with multiple menus, options and sub-menu paths to access each application even a simple task on a modern mobile phone is becoming time consuming. Just writing and sending a five-word SMS message...

... requires about fifty keypad clicks, while voice input allows that message to be keyed in five times faster than with a keypad.

Another reason why voice input of a mobile phones menus and contact lists is proving popular is that it allows someone to use a phone and dial a number without looking at the keypad. This is of particular value to drivers, who are now able to make full use of their phone in a completely hands-free mode.

In fact, automotive applications look set to be a big growth area for speech technology. At the 2006 Geneva Motor Show, Fiat Auto in conjunction with Microsoft and Nuance Communications launched Blue&Me, a voice activated in-vehicle communications and entertainment system. It is based upon the Windows Mobile for Automotive operating system and allows drivers to integrate mobile phones, digital media players and other personal electronics devices with in-vehicle systems.

Blue&Me is compatible with most mobile phones with Bluetooth hands-free technology, and is currently available in nine languages: Italian, Spanish, Brazilian Portuguese, German, Polish, UK English, Dutch and French. "Drivers can keep their eyes on the road and their hands on the wheel, and consumers will enjoy their personal mobile devices in a safer and more integrated environment while on the road," says Craig Peddie, vice president and general manager for embedded speech solutions at Nuance.

Another area where speech technology is finding growing popularity is in speech-to-text dictation systems for use by professionals, such as doctors and lawyers. This is potentially a huge market and in the healthcare area alone it is estimated that more than $15bn (£8bn) is spent annually on the manual transcription of doctor's notes. The advent of specialist voice input dictation systems from companies like Philips and Nuance look set to generate big savings in this area of health expenditure.

"Improvements in speech technology and pressures on the healthcare industry create a compelling opportunity to transform manual transcription through speech-enabled solutions," said Paul Ricci, chairman and chief executive at Nuance. "The adoption of speech recognition [will] eliminate most manual transcription for healthcare in North America this decade, delivering over $5bn in savings to care facilities and transcription service organisations."

This view was backed up in a recent Frost and Sullivan Report on the "Market for European Health Care Voice Recognition Systems". The report noted: "The combined benefits of voice recognition and healthcare information systems, such as EMR/PACS/HIS, are important in extending the productive range of healthcare organisations and individuals alike, particularly in today's demanding healthcare environment."

Traditionally viewed as simply a means of dictating text into a personal computer, today's voice recognition software can play a far more significant role in the healthcare environment. In addition to pure...

... dictation, speech-recognition software can be used to manage email, streamline repetitive tasks on the PC, reduce transcription and charting costs, speed up information turnaround and protect employees from repetitive stress injury (RSI).

Voice recognition can be integrated with most electronic medical record (EMR) applications to make those programs more effective and easier to use. Searches, queries and form-filling are all faster to perform by voice than using a keyboard. Charting, prescription writing, aftercare instructions, order entry, database searches, document assembly/automation and patient record management software programs are all highly conducive to control by speech. Rapid hardware advancements and improvements in the technology itself have increased the software's utility, accuracy, speed and ease of use.

The market leader in this area is Philips, which in 2001 launched its Intelligent Speech Interpretation technology with automated punctuation, hesitation filtering and formatting. Intelligent Speech Interpretation allows the production of high-quality documents with the minimum of human intervention by using sophisticated analysis of meaning. To ensure accuracy Philips also uses a synchronous playback technology, which allows the recognised text and the audio file to be played back simultaneously.

Philips has just enhanced SpeechMagic to enabling adequate speech recognition in Citrix environments. The deployment of speech recognition and digital dictation applications from Citrix servers is a key competitive factor and is important in the centralisation of IT administration, applications and the delivery of data. A further advantage is that it will provide an extremely high level of security, since no files are stored locally, and will thus dramatically improve the protection of personal data, a key factor in all medical data systems.

However, to get a real glimpse of where voice technology will take us within a few years, take a look at the Sony TalkMan for the PSP. Launched at the Tokyo Games Show in September of last year, this software uses a microphone that clips onto the top of a PSP. It is basically a language translation system, but using both voice input and voice output. Speak a word or phrase into the microphone and the system will produce spoken output in the chosen language. Initially these are confined to Japanese, Chinese, and Korean, but later this year a European version will be launched with six languages: French, English, German, Spanish, Italian and Japanese.

This may only be a games machine program, and much like any phrase book is limited in its vocabulary to a small selection of scenarios and a few languages. It is thus a long way from being a universal translator. However, it demonstrates the possibility, and within a few years that possibility could become a reality. Not only could we be talking and listening to machines, but they could be acting as translators thus finally breaking down one of the last great barriers to universal human communications.

Editorial standards