Summary: The arms race for scientists with expertise in various areas of search, data mining and data analysis is in full flower, as in the tug of war between Google and Microsoft over the services of Kai Fu Lee. Meanwhile, Yahoo is making a major investment from its nearly $500 million annual engineering spend to build out its own world-class research group.

In fact, based on my conversation yesterday with Prabhakar Raghavan, the new head of Yahoo’s research group, Yahoo has its sights set on Nobel prizes and making breakthroughs to ensure the future of the company. I don’t think he was exaggerating. Search and creating more personalized user experiences that take advantage of underlying data and relationships is still in an infant phase. Yahoo, Google, Microsoft, Amazon and other major players understand that the spoils will go to those who provide answers, rather than links, and develop ways in which billions of consumers and creators of content can participate in an economic and social value chain.

Raghavan, who spent 14 years doing search and data mining-related research at IBM and was lured from his stint as CTO of enterprise search vendor Verity, told me that he intends "to go after the best in world and to get them." He said that Yahoo will be able to attract top talent because of its stable and profitable business and the opportunity to impact Yahoo’s audience, who account for 12 to 15 percent of all the Web activity worldwide (Yahoo’s numbers). "We have an amazing outreach," Raghavan said. "Ten terabytes of data, which for a scientist is pretty appealing." Raghavan is also well connected in the research community—he is editor in chief of the prestigious Journal of the ACM.

For scientists with expertise in information retrieval, computational linguistics, machine learning, matrix and graph algorithms, unsupervised clustering, data mining and related areas, it’s like the U.S. housing market. They'll have multiple bidders and command a premium. Raghavan said that Yahoo would stay away from high-profile hires and focus more on university researchers, as well as hiring college student interns and grooming them for jobs. Yahoo also recently formed a research center in association with the University of California, Berkeley. That said, he was able to recruit a well-regarded colleague, Andrew Tomkins from IBM.

Raghavan noted that his group will be active and open to the research community. "We will publish our research and interact with peers--it's critical to the success of a research organization. There is an obvious aspect of marketing, PR and being visible contributors of ideas to the community. That said, we will not take every trade secret and publish it. It's a challenge other industry leaders have solved before. We will publish and be judicious about how we do it."

Raghavan has been in the job just over a month, but he has been impressed by what he called the "thirst for ideas that flow form research to the business." He acknowledged that moving research into products is a challenge. He listed improving search, building a better advertising platform, making better sense of social media, large-scale distributed computing, and developing incentive structures and tools as his goals. 

Regarding search, Raghavan said, "We have two views of better search. Most people are not interested in search—they want to get things done. The future has to be more friendly to people getting tasks done. You don’t want to spend two weeks of evenings sitting at a keyboard and piecing together a vacation plan. You want a system to go out and find the answers, based on future technology that goes beyond crawling and indexing pages." 

That future technology, according to Raghavan, is diving into the “deep Web” and semi-structured queries. "I hesitate to use the buzzword of 'Semantic Web'--but it is about entity extraction, XML queries, unstructured queries, semantic ambiguity. We have to build a view of the world. When you issue a query, it has richer view than a text index. We’ll start to see manifestations of this in five years."

On the back end, Raghavan wants to solve the problems like spam and to "align the commercial incentives of a billion content providers with social good intent." He pointed to the field of mechanism design, a sub-field of microeconomics and game theory, as key to creating economic models that encourage people to participate in a clean, well-lighted digital marketplace with billions of content creators and consumers. 

"We want to inspire the audience to give more data and more. If someone creates a snippet of music and others remix it and it finally becomes a hit, how do you divvy up the proceeds amongst all the constituents? That [economic incentive network] has to be figured out. There is a lot of microeconomics that is not fully understood, and it’s one of the areas we want to understand. There will be Nobel Prize in economics award for this stuff, and I wouldn’t be unhappy if it came from our group."

Along those lines, Raghavan and Jon Kleinberg authored a paper recently entitled "Query Incentive Networks," which looks at networks of interacting agents as economic systems, in which "users seeking information or services can pose queries, together with incentives for answering them, that are propagated along paths in the network." 

Yahoo wants to turn its fragmented set of services, content and marketplaces into a cohesive whole and to aggregate, distribute and monetize the creative output of its users. "We have a plethora of opportunities looking at different social networks, such as blogs, instant messaging, My Web, Yahoo 360, and other services, across Yahoo properties," Raghavan said. Yahoo's social search engine My Web 2.0, for example, allows Yahoo users to archive, tag and annotate search results and share them with other people using the service. Users can also search their contacts' My Web and browse content that others on Yahoo's network have shared. 

But determining what data from the pools of Yahoo services and billions of inputs is useful to people and will create a breakthrough in the user experience is one of his team's challenges. "It’s a classic problem in statistical machine learning—you might have 200 data points, but how do you zero in on the three that make a difference?"

As part of Yahoo's Research initiative to harness the activity on its properties in ways that create new revenue streams and sticky user experiences, Raghavan’s team will be racing its competitors to come up with standards and methods for determining value, incentive systems, frictionless payments and rights management. “We will let the market determine what is interesting and those who contribute the interesting stuff will get rewarded,” Raghavan said.

However, without standards across user networks, every site will be a cul-de-sac.  An incentive system on one site will not interoperate with another site. It’s like requiring users to have a different card for every kind of ATM machine. I asked Raghavan whether users should have access and control to the data collected by Yahoo. “Users should have control of what data is collected or given up and knowledge of what is done with it,” he said. “Giving every person their clickstream doesn’t make a lot of sense—most don’t want it—but they should have knowledge and control.”

However, Raghavan supported the concept of being able to exchange your data collection—such as your Amazon or Yahoo shopping clickstream and forms input—with another site. “The data belongs to the user because it’s about the user, but we are not at a point today where multiple shopping sites can exchange data. It’s metadata challenge, but it’s more of a standards activity, not a research issue. 

In addition, his group is working on aspects of personalization. "Personalizing is a loaded word, and it sometimes gets trivialized. It’s not about customizing the colors on the MyYahoo page," Raghavan said. "It’s more of a social phenomenon that takes into account what others are doing, especially people like yourself. Content, context and community coming together is a long-standing dream in our business—we are all going after it. But, the catch is when the user is not only a consumer but also creator of content. It leads to interesting possibilities in tandem with data mining and the user experience. You have to decide what content to show that users will find valuable, and not irritate users with too much content."

Raghavan has also spent time looking at how to mine blogs for predicting the movement of products and developing new user experiences. "We are looking at sources of information-- text, photos, podcasts--whatever we can mine from the back end. Then we look at what users want, and bring the two together to create an application from all chatter going on," Raghavan said.  "We can dream up cool experiences, but they have to be grounded in product reality. As we develop technology, markets start to react, so mining begets a reaction from market and begets more mining, so we are constantly working on more scenarios."

Underpinning all of Yahoo's--as well as every other megasite's--dreams of growing to billions of active, transacting, content creating and consuming users is the ability to build an efficiency platform with millions of computers and data sets distributed around the globe. With 345 million unique users per month across 25 countries and in 13 languages, Yahoo, as well as its competitors--especially Google--has some experience in planetary scale computing.

While the progress over the last ten years of the Web has been significant, we are still in the Stone Age of search, social networks, incentive models and personalization. With the competitive juices flowing in research labs, and wide open commercial opportunities, the next ten years will be more about answers than links, but not without some serious flailing...

  • Alfred Nobel would say "huh"?

    The Nobel Prize is an international award given yearly since 1901 for achievements in physics, chemistry, medicine, literature and for peace ( Just what category would Yahoo be going for? I suppose that Mathematics would fall under Physics, and Computer Science would fall under math, but saying a semantic web search breakthough is a Physics prize is sortof a stretch. I don't think source code could win the Literature prize either.

    Don't get me wrong, I like think tanks. Xerox Parc pioneered many of the things that you see right in front of you today. Bell Labs was ALWAYS a prestigeous place - where many a Nobel laureate worked. I see good things comming from Yahoo (in the forefront of the WiMAX steamroller/Streaming Paradigm), but no Nobel Prizes . . .
    Roger Ramjet