ZoomInfo launched what the company calls the first "market-ready semantic search engine." It aggregates and organizes data on companies, product categories, industries and other vectors, crawling Web sites, press releases, blogs news services, financial filings and other public sources.
ZoomInfo has the rudiments of semantic search, analyzing the data it crawls and extracting names, titles, companies, products and other business information. The search engine applies data pivoting--search on a company and get employees or find out about open jobs at the company via a partnership with indeed.com. It associates companies with news and people or people with SEC filings and co-workers. The problem is that ZoomInfo's semantic search engine isn't quite smart enough to be trustworthy, which could be said of most search engines. They are immature, and often return flawed or unfathomable mountains of results.
A search by company for IBM turns up some basic information, and lists Ramon Demper as the company CEO and CTO. As far as I know Sam Palmisano is the IBM CEO and Demper left IBM in 1993. A search for ZDNet in both basic and powersearch (requires registration) and by company and people turned up outdated and grossly incorrect information. Similarly, a search on CNET turned up a lot of erroneous information.
According to Russ Glass, vice president of products at ZoomInfo, the error on the IBM page is a result of having an automated system. "A data source that talks about Demper and looks like a recent piece can fool our algorithms," Glass said. "Over time it gets scrubbed out. We attach all the Web sources to a piece of information, so users can see where it comes from. Currently, we are only updating every four weeks, but we plan to go nightly."
The fact that Demper is the CEO of ICM, and not IBM, may have played into the error, but it's not credible that there isn't enough information available for a search engine to determine that Demper is not the CEO of IBM. "It's largely accurate, but you can look at sources for each profile we compile to confirm each person is where we said they are," Glass added.
"Largely accurate" is the current state of the art, and not good enough. Some human scrubbing would be a good way to improve search results.
Searching for security software companies in California with $50 million or less in revenue and fewer than 100 employees turned up Network Associates, which merged with McAfee in 2002, as the first entry.
For searches, such for “Web services management” companies located in California area, ZoomInfo produced what appeared to be a good result, although given results from other searches, it difficult to trust the search engine implicitly. ZoomInfo’s provides a summary of each company based on public information it gathers.
According to the company, a proprietary natural language extraction technology is applied to the data, analyzing sentence structure, relationships between words, aliases (Federal Express = FedEx) and verb meanings to interpret each sentence.
"We have a three stage process. We crawl the pages and use extraction algorithms to pull data out from what we read and apply semantics for tagging purposes, creating explicit and implicit ties associated with an entity," said Glass. "Once we build the tags, we go through a recombination process and fold the information in on itself. We correct tags based on errors, looking at multiple sources that contradict each other, determined by time frames and authority from link structure and authority we set for certain sites."
Currently, ZoomInfo crawls only English language site and resources that contain business information. "We crawl a billion pages. but we will expand that with additional crawlers. We want to get to .edu and .gov, which we are not covering much now, and to English language sites outside of the U.S., Canada, the UK and Australia," Glass said.
Additionally, ZoomInfo is announced ZoomExec, which provides fee-based ($99 per month) access to information on executive-level prospects, including work history, education and contact information gathered from public sources.