The study, conducted by researchers at IBM, Compaq and AltaVista, is to be presented at scientific conferences next week. It builds on previous research into the structure of the World Wide Web and argues against the widely held impression that the entire Internet is highly interconnected.
The researchers used AltaVista's Web crawler to trace more than 200 million Web pages in May and October 1999, following the 1.5 billion links embedded in those pages. That sample is just a fraction of the estimated billion-plus pages on the Web, but it dwarfs the 40 million pages used for previous studies.
On the basis of their analysis, the researchers set out a "Bow Tie Theory" of Web structure:
The central core, the knot of the bow tie, represents Web pages that are interconnected so well that you can eventually get from any page in the core to any other page just by following Internet links. Examples of core pages would include the home pages for IBM.com and MSNBC.com, said Nam LaMore, an IBM spokesman. This "strongly connected core" makes up just 30 percent of the entire Web sample.
Another 24 percent represents "origination pages." These are pages with links that you can eventually follow into the core - but which cannot be accessed through links from the core. One example is a personal Web page about your pet that includes links to online pet stores.
"You point to them, but no one (in the strongly connected core) is pointing back at you," LaMore said.
Yet another 24 percent consists of "destination pages" that can be accessed from links in the connected core but do not link back to the core. One example are research papers buried deep on university or corporate web sites. Such a page "could be on IBM.com/research/projects/almaden and on and on - and finally here's where it dumps you," La More said.
The other 22 percent is completely disconnected from the central core: These pages are either "tendrils," connected by links only to pages in one of the other categories; "tubes," which link origination and destination pages without going through the core; or "islands" not linked to the rest of the Internet at all. An example of an "island" would be a group of student or family Web pages that link only to one another.
The only way to find such pages would be to know the address in advance. Even most search engines would not be able to find such an island, unless it was linked to the rest of the Internet at some point in the past.
Moreover, the researchers found, the proportions for these four categories remained constant between the May and October surveys, even though the number of Web pages grew substantially.
Previous studies on the Web's structure, or topology, seemed to suggest that most randomly selected pairs of pages would be separated by, say, just 19 clicks.
"Our experimental evidence reveals a rather more detailed and subtle picture: Significant portions of the Web cannot at all be reached from other significant portions of the Web," the researchers wrote. In many other cases, two Web pages can be bridged only by going through hundreds of clicks, they said.
If you picked two random pages and tried to click from one to the other, "there's a 75 percent chance that you will never get there," LaMore said.
If a path did exist, the average click separation would be 16, the researchers said. But if there were two-way links between the two sites - in other words, if a path existed not only from page A to page B, but also from page B back to page A - the number of average clicks would fall to about seven.
For e-commerce sites, the study underlined the importance of being on the Web's main thoroughfares, with links both to and from your site, rather than sitting at the end of the road, said Compaq (cpq) spokeswoman Eileen Quinn.
LaMore said the findings might also promote new strategies for Web surfing. Most people now use search engines to find particular sites or topics. But search engines such as AltaVista and Google also have the capability to search for pages that link to a specified page. So if you were interested in pets, you could look for all the sites that link to a particular online pet store.
Such tools could, in effect, reverse the Web's one-way traffic.
"The essence of surfing right now is one-way. ... If a browser were to have reverse-surfing capability, then you actually have more resources available to you than you do now," LaMore said. He also raised the possibility of cashing in on such links.
"If you know who's linked to you, then perhaps you know your content is valuable. (You might say) 'Hey, let's throw up a royalty, a fee for pointing to me,' " he said.
One of the researchers behind the earlier studies on the "19 clicks of separation" said he didn't dispute the new findings. Although he had not yet seen the full details of the new study, Notre Dame physicist Albert-Laszlo Barabasi acknowledged that his own work "used a poor man's method" to survey the Web.
"It's really good news that someone finally took the energy to map out the World Wide Web and look at its structure," Barabasi told MSNBC.com via telephone from Budapest, where he is on sabbatical.
He emphasized that his 19-click figure was an average, involving wide variation.
"The distance between Yahoo and everybody is about two or three. But take, for example, my Web page and try to find it without a search engine," he said half-jokingly. He said he would look forward to the study's formal presentation at the 9th International World Wide Web Conference in Amsterdam.
"I sentimentally believe in it, because the World Wide Web is what we call a directed network, in the sense that you can get from A to B, but you can't necessarily get from B to A," he said.
Like Barabasi's research, the new study found that Web interconnectedness followed the same sort of distribution found in organic systems - and even in sociological networks. A classic example is Hollywood, where only a few at the top are part of the in-crowd and thousands at the bottom toil in obscurity.
"The Hollywood actor network has the same structure as the World Wide Web," Barabasi said.