An especially hot area of enterprise software in recent years has been the front office, functions such as sales and support. There is a rush to transform the entire "go-to-market" process, and more recently, that raging market has collided with another very hot industry, the Internet voice services that replace phones with cloud-based communications.
Throw in a little artificial intelligence, and what do you have? A San Francisco startup called Dialpad, founded in 2011, is trying to transformer customer support, sales, and marketing with a bit of machine learning and a lot of data. The company has received $120 million in financing in four rounds from Google, Andreessen Horowitz, and others.
"This market is massive," says Craig Walker, founder and chief executive of Dialpad, in an interview with ZDNet. "It's the biggest TAM there is," using the Wall Street jargon for a "total addressable market."
What Walker is referring to are four functions that can be transformed, he believes, through voice-calling software: Voice conferencing; the replacement of the traditional "private branch exchange, or "PBX"; the "sales center" of companies; and the traditional call center. Together, they make up a $140 billion market known as "enterprise voice." Walker has seen the market evolve: he was previously product manager for Google Voice and related products, after he sold a previous startup, GrandCentral Communications, to Google in 2007.
There are lots of players in that $140 billion TAM, such as Zoom Video Communications, RingCentral, Five9, and Cisco Systems. And there has been a lot of mergers and acquisitions activity in recent years, at big price tags, such as Cisco's $1.6-billion purchase in February last year of IP-voice vendor Broadsoft.
But Walker believes his company is at the next frontier in enterprise voice. The company is using machine learning techniques to mine call data for meaningful insights about support and sales.
"We are going to create a new market, a sales-calling AI-enhanced market," says Walker. "We think that voice intelligence is the last offline data set," he explains. "The most important conversations with customers are by phone, not over email, and today, all that is in the vapor."
To build this new AI market, Dialpad a year ago paid $50 million for a data mining start-up called TalkIQ. TalkIQ can process calls in real-time using a unique automatic speech recognition system, and a natural language processing program. It can yield patterns in calls even while a sales rep or customer support rep is on the call.
"Out of an hour-long conversation, say, we can break out things such as, What was the sentiment in the conversation? And we can do that not only in real-time, while the call is happening, we can also show the call agents information about the call on their screen while they're talking."
"It's like those VH1 Pop Up Videos," he says, making a somewhat dated reference to the music channel's use of word balloons for commentary about music videos back in the 1990s.
Before Walker latched onto TalkIQ, a story was already taking shape at that company five years ago about how voice communications could revolutionize the front office.
Specifically, enterprises want to see call centers as less of a cost center, more of a way to cross-sell product, a single point in which to unify conversations with both existing customers and prospects.
To do so would entail moving from a script-based call function to a technology that would derive insights in an ongoing fashion.
"The best telecom product is the one that bakes speech and natural language processing into the line," says Dan O'Connell, chief strategist at Dialpad, who was CEO of TalkIQ at the time of the deal.
"That should be seamless, I just make a call, and magic happens."
To make magic happen requires a lot of hard work by the company's head of AI and machine learning, Etienne Manderscheid, who was the head of data science at TalkIQ. What the company refers to as "customization at scale" is the ability to tune the details of speech and language algorithms for customers and industries beyond what conventional machine learning offers.
In a way, Manderscheid and his team are at the forefront of applied AI, doing all the dirty work that makes the stuff work in practice.
It begins with processing the sound of a call to extract the text, where Mandersheid and the team have built a custom speech recognition system. Many leading ASR systems these days, such as the kind built into smartphones, process the audio signal of a conversation at a rate of 44 kilohertz. That won't work for a Polycom-type room conferencing system, which uses an 8-kilohertz encoding. So part of the technical challenge for TalkIQ was to perform speech recognition within an 8-kilohertz domain.
The software works on both the acoustic component of the sound feed and the language component. "We are refining things at the language level to boost certain words that are important for customers, such as the names of companies and products," explains Mandersheid.
To the initial transcription of text, Mandersheid and the team may add processing with a large language model such as Google's "BERT," to re-score and improve the transcripts.
To produce natural language processing, and, ultimately, language understanding, the team first creates a "minimum viable product," or MVP, using methods as simple as "regular expressions" before getting into any machine learning development.
It's important because building a labeled training set for machine learning can lead to a dead end.
"For a lot of natural language processing, we learned this the hard way," says Mandersheid. "Initially, we would build a labeled training set, which is hard to change."
A typical NLP model such as BERT "doesn't have a steering wheel," as he likes to put it. "At the beginning of the work, with a highly nimble team, you need to have a steering wheel to iterate fast."
The process of iteration refines the labeled training set to fit a given customer's needs. That can take several months once the MVP has been established.
"We start with a seeding model, which gives a good first pass, and then a model such as BERT can help us generalize from there," says Mandersheid. "It's like active learning: you use a seed model to generate the first set of positives. If you were sampling at random, you just wouldn't get enough instances [of a given token] to build a good classifier."
"A lot of the most meaningful events for us are rare events," explains Mandersheid. "Consider how often a price objection comes up" in a phone call between a sales rep and a prospect. "It happens rarely, but it's meaningful in terms of the impact on the sales opportunity.
"We've had to find solutions to how to collect enough positive examples to build a labeled training set with such rare occurrences."
The careful work of building a voice and language combo extends to the word "embedding" that machine learning uses to represent the sounds and text it is processing. Mandersheid and the team have developed domain-specific custom embeddings. Terms such as "opportunity" in the sales domain have a special meaning, he points out, as does "close," meaning, to seal a deal.
Mandersheid Adds that mainstream language models in machine learning, including BERT and a more recent approach, "XLNet," don't take into account the audio cues that come with speech processing. But those cues "give extra understanding to what's happening in a call, which we plan to add to our embeddings in the future."
Put all those parts together, and the result is something unique. "From a geeky perspective, we control the entire tech stack from telephony, through ASR to NLP, end to end, and that brings a big accuracy advantage; I don't know anyone else in our space doing this today," says Mandersheid.
There is a vision of a next step, he explains, a synthesis of all the parts in one giant model. "We would like ultimately to see one big model that is a seamless model for words, moments, sales outcomes, and all the rest, so you avoid cascading errors over many levels."
It seems in retrospect that it was kismet that TalkIQ's burgeoning machine learning approach merges with Dialpad's cloud telephony approach.
"We had 20 employees and $20 million in the bank, but it was clear this was going to be the future of telecom," as O'Connell described the thinking at TalkIQ when Dialpad's offer came along.
With 470 employees, 62,000 customers, and approaching $100 million in sales annually, the new Dialpad is making "voice cool again," O'Connell quips.
It is also solving, he believes, the customer problem of retention and expansion, which is key in an era of subscription-based product selling.
"Everyone loves the shiny new object," he says, meaning, winning new logos.
But, "as I acquire more of the market, my new logos flatten out, and then I want to retain, to avoid churn, and to cross-sell and up-sell."
The best corollary for Dialpad, says CEO Walker, is the bigger cloud companies, such as Salesforce and ServiceNow and Workday. Younger cloud vendors such as Twilio would like to compete in voice but don't have what it takes, he insists.
"The thing is, voice is really hard to do," he says. "There are a lot of legacy parts to it, and lots of worldwide regimes to navigate. Take a Slack guy and say, Be good at telephony, and he can build a cool UI, but tying that into 911 in Malaysia is really difficult."
Walker sees a prospect for Dialpad to go public in a couple of years, once all the pieces are in place to meet public-market standards.
The market for cloud is hot, and as the elevated valuation of Zoom's IPO, earlier this year, shows, for the moment there is tremendous interest in a new generation of infrastructure that does something novel with the humble phone call.