X
Business

Improving internal search and detecting cloaking

One of the problems with using search inside the enterprise is that corporate users rarely link to internal data on Web pages like users on the greater Internet do. This means that using incoming links as a way of ranking search results for relevance (page rank) doesn't work nearly as well for internal search.
Written by Phil Windley, Contributor
www2006.jpg
One of the problems with using search inside the enterprise is that corporate users rarely link to internal data on Web pages like users on the greater Internet do. This means that using incoming links as a way of ranking search results for relevance (page rank) doesn't work nearly as well for internal search.

I attended a session at WWW2006 today by Pavel Dmitriev from Cornell University that discusses the efficacy of various solutions. (See my notes or the paper.)

tim_berners-lee_bagpipe-380.jpg

Tim Berners-Lee chatting with a bagpipe player at Edinburgh Castle

The primary strategy for improving relevance is to ask the users. Corporate users are much more likely to participate in making internal search better. There are both implicit and explicit methods for doing this. Explicit methods are fairly obvious things like pop-up surveys, etc.

Implicit methods use query and click logs to infer which pages were the most relevant. You can choose to count any click as a vote of relevance or just the last one (on the assumption that the user will keep searching until they find what they want). You can also aggregate session data on the assumption that searches close together are related.

The results of the experiments were surprising. It turns out that while explicit annotation methods result in a significant improvement in document ranking, none of the implicit methods they tested had any measurable impact. This could have some real lessons for corporate information strategies. I can also think of a few companies who's products are based on inference techniques to produce relevance results in their knowledge bases. I wonder whether any of their inference strategies match the ones tested in this research?

wu-leheigh.jpg
Another interesting talk today was by Baoning Wu from Lehigh University. Maybe I'm just out of it, but I'd never heard of cloaking before. If you're in the same boat, here's the summary: Cloaking is the process of returning different pages to a search engine crawler for a given URL than you return to other users. You can imagine why people intent on getting higher search engine rankings than they deserve might want to do this.

Wu's work (see my notes or the paper) is an attempt to detect cloaking. It turns out that you can't look at a single file, you have to look at the file the search engine crawler sees and the page that is returned to a browser. But because some sites serve up pages with changing content, you can't even just look at two--you need four to get good results.

Of course the problem is that if Google, Yahoo, and others had to grab each page four times, they'd choke and so would many Web sites. Wu created a technique that first filters pages by looking at two pages for comparison and then does the four-way test on the ones that aren't filtered out. The filter cuts out about 90% of the pages--a significant reduction. The overall technique is 96% accurate in detecting cloaked pages in their tests.

They crawled the dmoz Open Directory Project pages and reduced the 4.3 million candidate pages to just under 400,000 URLs that needed to be looked at further. The classifier found 46,000 URLs that used cloaking. The ODP test also showed that, not surprisingly, some categories, like Arts and Games, are more likely to contain cloaked pages than others, like News. example.

I also went to a panel discussion on freeing data and one on identity management. Probably the highlight of the conference so far, though was the tour of Edinburgh Castle, even if the food was lousy.

Editorial standards