The Web's bigger than you think

Summary:A new statistical survey estimates that the World Wide Web contains at least 320 million pages - far more than previously thought. What's more, the researchers say their results suggest that any one Internet search engine covers "just a fraction" of the Web.

A new statistical survey estimates that the World Wide Web contains at least 320 million pages - far more than previously thought.

What's more, the researchers say their results suggest that any one Internet search engine covers "just a fraction" of the Web.

Gauging the World Wide Web is like trying to corral a mushroom cloud: Some say the Web could grow by 1,000 percent in just a few years. Even the researchers behind the new study, published in Friday's issue of the journal Science, acknowledge that their results are merely a snapshot of a fast-changing phenomenon.




Keep up-to-date on the latest search tools and methods in ZDNet's Web SearchUser


Programmers can 'fix' your site so it comes up first on search engine searches.


ZDNet University offers a course in searching the Net.




But Steve Lawrence and C. Lee Giles of the NEC Research Institute contend that their effort ranks among the most scientifically sound surveys of Web size and coverage.

"We've put effort into making this as accurate as we can," Lawrence said.

Although the Web now serves as the world's greatest information resource, the biggest challenge is finding the precise information you need or want, leading to a lucrative boom in Internet search engines. Since their inception, search-engine sites have ranked among the most frequently visited spots on the Web.



Do search engines deliver for you? Add your comments to the bottom of this page.





Lawrence and Giles analyzed the coverage of six full-text search engines - AltaVista, Excite, HotBot, Infoseek, Lycos and Northern Light - not only to see how they compared but also to derive their figures for total Web size.

Yahoo, a well-known search engine, did not figure directly in their calculations because its much smaller index is constructed manually rather than by using "crawler" software. However, Yahoo also returns results from AltaVista.

The researchers analyzed results from 575 search-engine queries made in December under a rigorous set of constraints. For example, only documents that could be downloaded and actually contained the query terms were counted.

Then Lawrence and Giles looked at the overlap between pairs of search engines. By comparing the proportions of overlap, they could derive an estimate for the total size of the "indexable Web" - which would not count documents hiding behind search forms, password-protected pages or other documents excluded from Web indexing.

The numbers game
Comparing the two biggest search engines - AltaVista and HotBot - yielded the estimate of 320 million pages. The actual number is probably even higher, Lawrence and Giles said, because of the limitations under which they conducted their analysis and because some Web pages are probably not indexed independently.

Previous estimates of the Web's size ranged from 100 million to 200 million pages.

Using the new figures, Lawrence and Giles estimated that HotBot covered 34 percent of the indexable Web, with AltaVista at 28 percent, Northern Light at 20 percent, Excite at 14 percent, Infoseek at 10 percent and Lycos at 3 percent.

Lawrence said he has not discussed the figures with search-engine companies because the study was being held for release Thursday. But he has shared the results with other computer scientists, and he said "most people do seem to be surprised that the coverage of the search engines is smaller than what they think."

Size doesn't matter

He also stressed that size wasn't everything.

"It's possible that some of the other engines do not have a technology that scales as well," Lawrence said. "Or perhaps they could index the Web more comprehensively, but they choose to devote resources towards other areas such as improving the order of the results."
The researchers said there may be a tradeoff between database size and the frequency of updates. Their figures showed that HotBot served up the highest percentage of invalid links (5.3 percent), while Lycos had the best percentage on that score (1.6 percent). The other figures for invalid links were 2 percent for Excite, 2.5 percent for AltaVista, 2.6 percent for Infoseek and 5 percent for Northern Light. The figures on invalid links vary widely over time, however.

Lawrence said the research was conducted independently with no funding from any search-engine company. He said NEC was working on techniques for searching the Web but declined to provide further details.

Topics: Tech Industry

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.