Zoom calls will now come with the option of live captions, in a move that's likely to make life easier for remote workers whose attention spans suffer steep declines during online meetings.
To put an end to the unfortunate miscommunications frequently caused by remote collaboration tools, speech-to-text transcription company Otter.ai is expanding its technology to enable speakers on a Zoom call to see their words turned into accurate captions in real time.
So there should be no more excuses for misreporting the numbers presented by your sales team, or missing the list of targets put forward by your manager.
SEE: Top 100+ tips for telecommuters and managers (free PDF) (TechRepublic)
Captions will appear directly within the call, with a couple of seconds of lag, and presumably will be accurate enough for key information to consistently come out in the form of plain text.
The new feature will be particularly helpful to users with accessibility needs, as well as non-native English speakers struggling to make out the meaning of a sentence. Otter.ai currently only supports the English language, but can handle a variety of accents including southern American, Indian, British including Scottish, Chinese, and various European accents.
Otter.ai is not exactly new to the increasingly popular speech-to-text scene. The company started making a name for itself two years ago, when it launched the technology as a tool to capture and transcribe live speech, acting as a smart note-taking assistant for speeches, meetings or interviews.
Available as a mobile app or as a web-based tool, the technology soon started supporting online conferences, offering users the option to turn Zoom cloud recordings into written conversations to keep a record of their virtual meetings.
Earlier this year, Otter.ai launched Live Notes – a new feature that enables users to open a live transcript of the call during a video conference, in a separate shared file, which transcribes what is being said in real time.
Based on a sophisticated algorithm, Live Notes can separate human voices to identify different speakers and includes their name in the transcript to indicate that a given participant has started intervening. Users can then go back to the file, to check a detail if they have missed a sentence or jumped late into the call.
The new announcement, therefore, builds on top of Live Notes, integrating the transcribed quotes directly into Zoom's platform during a virtual meeting. In a demo call showcasing the technology, Otter.ai's founder Sam Liang told ZDNet: "Now, you will have Live Notes still going on in the background, but then you will also have the captions put down in the call. And there's a pretty broad range of people that this will be helpful to.
"It's definitely a great help for people with a hearing disability, but also for international, distributed workforces who don't speak English as their native language. And education as well: online classes could benefit from captions, on top of the Live Notes that they can go back to, to facilitate learning."
The transcription is not exactly pitch perfect: some sentences don't make sense and words occasionally come up deformed. Overall, however, Otter.ai's algorithm, especially given the tool's ease of use and accessibility, appears to be pretty accurate – an assessment echoed by most online reviews and user experiences.
Liang is confident that the technology's accuracy is only improving as more users get on board, providing more training data for the speech-to-text algorithm and helping the AI work its way through background noise and strong accents.
In fact, the company's algorithm has now transcribed over one billion minutes of audio from more than 30 million meetings – a number that was largely boosted by the surge in Zoom calls caused by remote working during the past few months, which has resulted in a five-fold increase in usage for Otter.ai's services.
"We have been working on this for over four years now," says Liang. "And the number of users and meetings has been growing exponentially. All the data from our transcriptions is anonymously used by the machine-learning algorithm – so the algorithm is constantly learning new words and improving its accuracy."
Liang has a PhD from Stanford University in electrical engineering and is also on the patent for Google Maps' blue dot, having led the location platform team for the search and advertising giant.
SEE: WFH and burnout: How to be a better boss to remote workers
The field of speech-to-text technology has been notoriously difficult and is littered with examples of poorly performing tools.
A few years ago, for example, Google launched a highly anticipated new pair of wireless earbuds, complete with a real-time translation service that, in theory, could recognize speech in one language, translate the words in the destination language on the user's phone, and then read out the new sentence.
It quickly became obvious that the technology was struggling to recognize speakers' words if they attempted to submit complicated sentences, or if they had an accent. The reason is pretty straightforward: no matter how sophisticated the artificial intelligence, recognizing human speech is tricky.
There is a reason why typing 'Why is speech to text' in Google's search bar results in recommendations such as 'Why is speech to text not working' or 'Why is speech to text so bad'.
"There are many different challenges when it comes to language," says Liang. "Spoken language has tremendous amounts of variation.
"There are so many different accents, even within a single country like the US, and at the same time a lot of words have a similar pronunciation. And then new words are being invented every day, as well as acronyms, company names and other new terminology."
Another issue is noise: the loud AC in Liang's conference room makes it harder for the algorithm to accurately pick up on his words during the call, broken as they are by the sound of fans spinning. Dodgy internet connections also mean speakers' voices can cut off, fade away, or break up – all of which can come in the way of the technology's accuracy.
SEE: COVID-19: A guide and checklist for restarting your business (TechRepublic Premium)
A mix of long-trained, deep-learning models and big data explain Otter.ai's encouraging capabilities, argues Liang. The algorithm is capable of considering the sentence as a whole and predicting what the correct output might be, based on previous datasets of speech.
By considering the context of an entire sentence, rather than working on a word-by-word basis, the AI can make more accurate decisions.
Similar methods have sparked the interest of the industry's biggest players, with IBM now offering a cloud-based, highly accurate speech-to-text platform as part of Watson's services, while Amazon Transcribe offers an API for automatic speech recognition.
However, Otter.ai is arguably the most consumer-facing technology out there. Liang confirmed that the company is now working on a smoother integration with platforms like Microsoft Teams, Google Meet or Cisco Webex, to open up access to the transcription and live-captions features.
In Zoom, live captions are available now for Otter customers paying for a Business plan, as well as for Zoom Pro customers.