Powerset: The natural language search mashup platform

Powerset finally came out of stealth mode tonight at its offices in San Francisco. The technology is not ready until September, but the company unveiled what it is doing to revolutionize search.

Powerset finally came out of stealth mode tonight at its offices in San Francisco. The technology is not ready until September, but the company unveiled what it is doing to revolutionize search. "The patents and technology are locked down," said Steve Newcomb, a Powerset co-founder and COO. "We are after a pretty big goal--replacing the core of the search engine." Powerset has attracted top talent among its 70-person crew, and plans to use tens of thousand of developers to build a search ecosystem around its platform, and compete with Google.

Powerset is using natural language technology from Xerox Parc and focusing its efforts on the indexing. It's building a search destination site and a platform that leverages the wisdom of the crowds for development.

At the event, Newcomb introduced Powerlabs, which won't be open to the community until September.

"Imagine a mashup between Facebook, Digg and Google Apps, but you get to participate in the building of the products that sit on top of our platform. You log into a social network, like you would Facebook, and you get certified to be a Powerlabber. Once certified you can join different interest groups, such as travel, and participate in idea and mashup competitions. QA is embedded and its all bloggable."

"Instead of being stealth mode, we are being more open than any other company has been in the launch process. If we screw up and it's not going well, we will take the hit, and if it goes well we take that. It's a wisdom of the crowd idea of competitions. We will build the winners and widgets for Facebook, blogs, and others," said Newcomb.

"We want as many people in Powerlabs to help us build and test the product. Powerlabs tells us when we are ready to go. We could have 50,000 people QAing our product," he added. So far Powerset has 10,000 Powerlabs users. "Imagine how many widgets that could sit inside of Facebook, MySpace and even Second Life. It gives us the ability to launch with an extremely passionate set of people," Newcomb said.

In addition, Newcomb suggested that the traditional search box could be replaced. "Imagine an instant messenger interface or voice interface instead of the search box to interact with a search engine," Newcomb said.

Mark Johnson, who heads up Powerlabs, described it as a way of combining product design, marketing (launching) and community. It will educate users on natural language search and help attract and retain users, he said. Powerlabs users get a sense of participation in building the product, he added.

Community participants will have to be certified for Powerlabs, meaning that individuals have to go through some training to be of help to Powerset, Newcomb said. Within Powerlabs, leaders who contribute the most will be highlighted. Ideas in Powerlabs will be voted upon and discussed.


Powerset also will allow users to give search results a thumbs up or down to help improve the database. Powerlabs users will also vote on sets of results as to whether Google or Powerset delivered the best set of results. "This is where we get to have fun, and where we think we are a lot better. We want you to score us head to head," Newcomb said. "Then imagine the QA that helps us make it better." Powerlabs users will achieve levels and points in an account, as in an online game, and more access based on voting.

"We have people who were interviewing at Google, and now they would work for nothing at Powerset to be part of the company," Newcomb said. "The passion is unbelievable. Powerset is ripping the core out again, like Google did, and they just want to be a part of it."

Regarding intellectual property for those contributing to Powerset, there may or may not be an agreement with those participating in Powerlabs, Newcomb said.

Powerset hopes that building a platform and attracting developers will save it from the fate of other search startups."We are trying to challenge everything out there," Newcomb said. He said the search companies like Kosmix had a 90 percent attrition in attention not long after their announcement. "After a few blogstorms they die--it's the dead cat bounce model. We want to launch with super passionate people--David versus Goliath people and with the people who can build products off of our platform."

Google is Goliath in this battle. "Screw alpha, beta and blogs to launch--[Powerlabs] is the way to do it. The power of the blogosphere and the people active in our community is a big force, and it is a big deal to take on Google," Newcomb said. "We are not anti-Google, we just believe the next decade is about computational linguistics." However, Newcomb said he does "want to own the space."

"Powerlabs will tell us when we are ready to go, and ready to build out the index," Newcomb said. "We are in a financial position to wait until it is ready to go, and not a crap product."

Parsing pages is a scaling and potential performance problem. Newcomb said the barrier is getting it down to seconds per sentence--Wikipedia has 25 sentences on average per page. Powerset has made a big investment in a datacenter. "It takes us one second right now, and we are driving it down every month," Newcomb said. At this point, Powerset has 720 Intel cores in its datacenter, Newcomb said, and is riding Moore's Law to improve economics and performance.

He said the Powerset can refresh its index at the sentence level, not just by the page.

Newcomb also said that the company would open source the datacenter schema.

Powerset is also using Amazon's EC2 compute utility to help build its index.

Other factoids and claims:

The Powerset front end will be build in Ruby on Rails.

No one has applied the level of NLP (natural language processing) to index the Web.

Scott Prevost, Powerset director of product, showed demos of the technology, matching meaning to meaning rather than keywords."We enable to you to pool information from the Web that Google can't do," he said. "We are linguistically reading pages and indexing that information. It's different from Google. In other ways we are similar, but it's different in the kind of information and the way we can use the information to match meaning to meaning, not just using things like Google PageRank, but using semantic data as well," Prevost said.

The index at this point includes Wikipedia, New York Times, and Powerset will be adding data from blogs in the near future. Powerset is also using the Freebase, a semantic Web startup, ontology. "We haven't decided if we are doing porn or not," Newcomb said.

Regarding voice recognition, Newcomb said it would require a grammarless system. IBM has solved that problem but only on a client, not on a distributed basis, which is required to be used in Powerset. Voice does match up well with natural language search. "When people speak they are not going want to speak 'keywordease,' " Prevost said.

Powerset may license its search engine and will display ads for revenue generaton. "When we launch we will have a partner for advertising and keywords," Newcomb said. He did not reveal who would be the partner.

Prevost gave an example of how the semantics and natural language works.

  • Take the phrase: "Sir Edward Heath died from pneumonia"
  • Powerset parses the phrase and part of speech with the Xerox NLP technology.
  • It extracts entities and semantic relationships, such as the subject "died."
  • It can expand by looking at similar entities and extractions.
  • It knows that Heath is a UK prime minister and knows that a prime minister is politician and that pneumonia a kind of disease and dying means death.

As a result, the query can be asked in a lot of different ways, such as "What killed Edward Heath?"

"When Powerset reads a page we semantically analyze it and look at other knowledge resources and index the facts. When we see a query, we do a similar analysis and look at how the meaning of query matches the meaning in the index. For example, we know that in the query 'acquisitions in 2001,' 2001 is linked to acquisitions," Prevost said. "In Google, you don't necessarily get that kind of tight linguistic relationship to match the meaning of the query to the meaning in the index." Nor does Google has the entity relationships to augment search results.


Scott Prevost, Powerset director of product, demos Powerset results using Wikipedia versus Google results for the same query

In a query "who mocked Blair," the result comes back with "caricatures, lampoons, impersonator and taunted" as verbs related to Blair in determining what results to show, using lexicons and ranking functions.

Powerset can also pull structured data from databases like Metaweb's Freebase, and create a mashup to query it.

In another example, Prevost explained how if you wanted to find every instance of a sentence when Jesus said or told something within Wikipedia, you could not do it easily with Google. "In Powerset you type in 'what did Jesus say.' It just can't get simpler than that," Prevost said.

"In the query, 'what did Steve Jobs say about the iPod,' you get a mishmash of people saying things about Steve Jobs in Google, but we respect tight linguistic connections encoded in the index," Prevost explained.

On VentureBeat in February, Peter Norvig, director of research at Google, said regarding semantic search:

I have always believed (well, at least for the past 15 years) that the way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons. The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective.

"We are betting that we win, but we don't know, but we do know that search is going to get better because of it," Newcomb said. "We are betting on our index."

If Powerset is successful, it will definitely get Google's engineers focused on NLP and semantic search and on opening up its search platform. But Newcomb believes that even putting 500 engineers against the problem would not give Google a fast way to catch Powerset on the technology front.