AI transcription sucks (here's the workaround)

A $60M bet that automation with human oversight is a recipe for near-perfect speech-to-text.
Written by Greg Nichols, Contributing Writer

I've searched for a reliable way to autonomously transcribe natural speech for years. I'm a journalist, and I often have hours of taped interviews with sources around the globe to transcribe. For now, I'm still paying for people-powered transcription services.

Speech to text has been a huge challenge for AI developers, and it's a puzzle that's being closely watched in a variety of industries. The technology has implications far beyond quoting sources; human-machine interfaces in fields like robotics, autonomous vehicles, and personal computing will benefit from computers that can accurately interpret natural speech. 

Transcription, then, is a kind of technological entry point, a straightforward market need that can help spur development of a technology that will have broad resonance and incalculable implications for how we interact with machines.

"Like nearly every market segment, the education, legal, and media and entertainment industries have had to quickly move to a remote environment," says Jai Das, Managing Director and President at Sapphire Ventures. "As a result, the need for AI-driven, real-time and accurate transcription services has skyrocketed." 

The problem is natural contextual speech, along with accents and dialects, has made the quest for AI-driven transcription quixotic to date. So what do you do when there's a ripe market for a technology but the capability just isn't there yet? 

Well, you improvise and use the tools at your disposal while pouring money into technology development.

That's the strategy of an innovative transcription and captioning solution called Verbit, which utilizes an in-house, AI-based technology, along with an army of human overseers, to transform live and recorded video and audio into nearly perfect captions and transcripts for the higher education, legal, media, and enterprise industries. 

"Verbit combines the speed and low cost of Automatic Speech Recognition technology with the accuracy of human transcription to solve this massive problem for companies and organizations in these markets," says Das, whose venture firm recently led Verbit's $60 million Series C. Total funding for Verbit now tops $100 million.

Verbit's model uses cutting edge transcription technology technology, which filters out background noises and echoes and recognizes things like domain specific terms. The acoustic, linguistic, and contextual data is then thoroughly checked by Verbit's human transcribers, who maintain quality assurance by editing and reviewing the material and incorporating customer-supplied notes, guidelines, and more. I've often been delighted when human transcribers I work with include little contextual notes about spellings and usage in their transcriptions.

I like this strategy a lot. Verbit can tap into a huge need among major enterprise players -- namely, the need for real-time transcription -- with a core technology that's good but not yet perfect. The hybrid human-machine model enables the company to go to market with a high-quality product while continuing to invest in development. Despite dystopian nightmares of robots stealing jobs, that's the way automation is going to infiltrate the enterprise in the foreseeable future: by joining forces with humans rather than displacing them outright. 

According to a company statement, Verbit will use this latest investment round to further fuel its significant growth by continuing to innovate its data-driven product capabilities and increase the number of languages it supports. 

Editorial standards