Inside the SpinVox Brain

The power of human hardware rarely fails to impress...

The power of human hardware rarely fails to impress...

How much human interaction powers SpinVox's voicemail-to-text conversion system? Natasha Lomas was invited to the company's HQ to see a demo of the system. Did it impress?

A trip to the HQ of SpinVox - the voicemail-to-text conversion company I wrote about last week - has given me a newfound respect for human hardware. By which I mean the ear, the brain and above all the brain's ability to grub and process a grain of meaning from the polluted and chaotic environments humans create.

Listening to a friend explain the implications of the subplot of Moon from across a Tube carriage tortured by the sound of screeching brakes and screaming children? No problem. Filtering out the omnipresent swoosh of lorries and vans on the walk to work to eavesdrop on the conversation of the man on his mobile behind you? It can be done.

Yep, the brain and its tools are impressive alright. But what about SpinVox's Brain and SpinVox's tools?

Along with several other journalists who have been following 'SpinGate' by publicly wondering how much human intervention is required in SpinVox's Voice Message Conversion System (aka The Brain), I was invited to the corporate headquarters in Marlow-on-Thames for a demo of the system - led by company CIO Rob Wheatley.


The reception desk at SpinVox HQ (Photo credit: Natasha Lomas/

It was also billed as a chance to ask some of the questions not cleared up by last week's flutter of press releases - for me the biggest lure. I was expecting the tech demo to be interesting and competent but, as it would obviously be operating in test conditions, a mere taster of a business that can surely only be understood in the daily grind and grit of real-world operation. After all, three journalists in a room can only make so much noise.

So what does SpinVox's technology look like? Although we were shown a diagram of the workflow process - with both its automated and human components - we were forbidden from taking photos or filming. Wheatley also gave us an impassioned plea to "please be sensitive" with what they were telling us - although we were not asked to sign an NDA. A somewhat contradictory message that.

So in the interests of a) brevity and b) sensitivity here's my rough translation of how SpinVox's system works:

After cleaning up and rating the audio quality - and doing some fundamental checks such as 'what language is this?' and 'is there a message at all?' - the system uses the words it can pluck out of the mire to hazard a guess on the identity of the words it can't. Think 'Spears' coming after 'Britney'.

Wheatley talked about the system building "a lattice" of probabilities of what might be being said - and this is where the terminology starts to sound a tad over-engineered to my ear. A 'lattice of probabilities' is surely kith and kin to the predictive text you get on your phone - i.e. sometimes kind of useful but all too frequently annoyingly misguided as to what it is you're actually trying to say despite the fact you've stacked it with your favourite swearwords by adding them to the user dictionary.

(Does predictive text get better over time? For what it's worth I actually find my phone gets worse at helping me write text messages as more and more once-favoured words accumulate in the dictionary and then plonk themselves into phrases where they're no longer wanted. But I digress.)

Wheatley talked up the 'statistical analysis, acoustic modelling and user learning' that the system apparently uses to get better at predicting the next word each user might have said. And if humans had the vocabulary of sheep this might be an easy task but there's surely no escaping the fact the spoken word does anything but conform to type - even if CEO and co-founder Christina Domecq reckons many speakers can be described as 'average Joes'... (continued on page 2)

"Obviously a doctor who uses more unique terminology and a more expansive dictionary will require more human intervention than the kind of average Joe on the street," Domecq told us at the demo - a qualifying appendix to her assertion that fewer "human interventions" are required the longer SpinVox has been in a market.

Maybe fewer then, depending on the type of professionals who are signing up to the service and the things they are discussing.

The phrase "it varies" came up at a lot in conversation at the demo.

What proportion of messages are fully automated? What proportion need a human touch? What proportion are totally unintelligible and not touched by machine or human? It varies and it depends, we were told time and again - with variables including product type, market, language, time in market, 'newness' of SpinVox user etc etc.

SpinVox was not able to show off the system's learning capability as the demo platform was a shrunken splinter of its real-world system set aside for testing purposes and therefore disconnected from all the data the real Brain presumably accrues about its users. So demo or no demo, people will still have to take the company's word for it that its technology gets better over time.

But what of the demo then? How did SpinVox perform?

Out of a total of four voicemails left in the quiet conditions of the meeting room only the shortest and most basic message (left by Wheatley) passed through entirely unaided - "Hi Rob, can you give me a call back when you get this? Thanks. Bye". It would be a pretty dumb mechanical eardrum that fell at that hurdle in test conditions.

The other three short voicemails all required the assistance of Ellie, an unfailingly enthusiastic human agent who was in the room to apply her dextrous ears and fingers to Tenzing - SpinVox's call centre software where machine predictions go for help.

The first message that ended up with Ellie contained the word 'SpinVox' which seemed to have foxed The Brain. Another was either spoken too quickly or rejected because of the word 'Tesco'. A third drove the system to distraction - admittedly it was a voicemail purposefully encoded in a Texan drawl so see-sawingly folksy it had most people in the room scratching their heads.

But still: the serious point is that accents vary as much, perhaps even more than, vocabulary - so welcome to real world conditions.

The upshot of the day is I came away enormously impressed with Ellie's enthusiasm, her lightning quick fingers, sensitive ears and very human brain. But far less impressed with what a software Brain can hear, perceive and intuit.

One thing's for sure: technology has an awful long way to go to get anywhere near the hardware inside our heads.

But considering the human brain is the product of millions of years of evolution - and not wanting to be too hard on the fledgling upstarts - let's not forget SpinVox's six-year-old Brain is not even a toddler beside it.