ie8 fix
Click Here
madison

Here's how to speed up the Internet - create an open index

By | July 15, 2010, 5:55pm PDT

Summary: An easy way to speed up the Internet by having an open index that all search engines can use. A Google exec says it’s a good idea…

When I was in Brazil recently, I met with Berthier Ribeiro-Neto, head of engineering at Google Brazil. During our conversation I mentioned an idea I had about making the Google index into an open database that anyone could access, I said that this could dramatically speed up the Internet.

He said it was a good idea and that I “should write a position paper” on this subject.

(As a further thought, maybe it could also serve to take away some of the heat Google is feeling lately, in terms of its index rankings potentially favoring its own business interests.)

Here is my logic:

Looking at my server logs shows that 20 different robots visit my site, one of the more frequent ones is the Googlebot. Each of these robots is trying to create an index of my site.

Each of these robots takes up a considerable amount of my resources. For June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8 gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together, they used up 45% of my bandwidth just to create an index of my site.

These robots are all seeking the same information and they use nearly one-half of my bandwidth, slowing the site for all my readers. This is also the same for tens of millions of web sites.

What if there was a single index that anyone could access?

You would get an immediate speed increase in the Internet for no additional investment in infrastructure.

Google and others, could perform their own analysis of the index using their secret algorithms. After all, the value is not in the index it is in the analysis of that index.

Mr. Ribeiro-Neto said, “That’s a good idea. You probably wouldn’t even need to spider the web sites.”

Each web site could update the central index automatically each time something changed. This would result in a massive savings in bandwidth used by dozens of robots scouring the Internet for new information.

What if Google opened up its index to the world as a goodwill gesture because it has the best index? It could still maintain the privacy of its algorithm but everyone would have the same information on which to perform their analysis.

It would show that there was nothing unusual or unethical in how Google collects information for its index. This might relieve some of the pressure it has come under this week to reveal more about how its search service is presented.

Also, Google founders were once strong advocates that the search index should be run as a non-profit.

On page 39 “Inside Larry and Sergey’s Brain” by Richard Brandt (referral link).

Andrei Broder, who led the team that created the AltaVista search engine, the best of its time, talks about meeting Larry and Sergey. “When the discussion turned to the topic of making money from the technology, Broder found that Page had a profound difference of philosophy on the subject. “It was a very funny thing about Larry,” Broder recalls. “He was very adamant about search engines not being owned by commercial entities. He said it should all be done by a nonprofit. I guess Larry has changed his mind about that.”

Brian Lent, now CEO at Medio Systems:

“The problem with the Google search engine at the time, Lent recalls, is that Larry and Sergey didn’t want to commercialize it, and Lent was anxious to become an entrepreneur. Their mantra at the time was more socialistic than entrepreneurial. “Originally, ‘Don’t be evil’ was ‘Don’t go commercial,’” says Lent.

- - -

Please see:

- The NYTimes: The Google Algorithm

- FT.com / Comment / Opinion - Do not neutralize the web’s endless search (Subscription required.)


Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Tom Foremski reports on the business and culture of Silicon Valley at the intersection of technology and media.

Disclosure

Tom Foremski

Tom Foremski is the editor and publisher of Silicon Valley Watcher and Silicon Valley Watch. Tibco Software is an advertiser.

Biography

Tom Foremski

In May 2004, Tom Foremski became the first journalist to leave a major newspaper, the Financial Times, to make a living as a full-time journalist blogger. He writes the popular news blog Silicon Valley Watcher--reporting on the business of Silicon Valley.

Tom arrived in San Francisco in 1984, and has covered US technology markets for leading computer journals around the world.

21
Comments

Join the conversation!

Just In

RE: Here's how to speed up the Internet - create an open index
yantangseo 17th Sep
@IamJayaTiwari
Amazing! 3 I download it replica watches happy
0 Votes
+ -
I am not a technical expert in these things but what you are saying is a nice concept and we need better techniques for fast internet access.
Best wishes .
@IamJayaTiwari
Amazing! 3 I download it replica watches happy
0 Votes
+ -
Several reasons why it was never done
terry flores 16th Jul 2010
Search indexes and spiderbots are technological afterthoughts in the development of the World Wide Web. While there are now some conventions for them, there is no intrinsic mechanism to support search. It is still a relatively brute-force operation. Thus the massive consumption of bandwidth. And it happens without any active participation from the webmaster, who may neither know nor care what is happening, until the bandwidth charges start to show up.

The very nature of search has changed a lot. Google is not so much a search engine anymore as it is a private mirror of the entire web. It not only searches, it keeps copies of everything it can get its hands on. That makes a lot of people uncomfortable, and outrages a lot of content providers. Yes, there are things still not accessible like databases, but even those are now searched with increasing sophistication.

When you think about it, why not take a step farther: if Google keeps a copy of your website, just publish the stuff once, then let everybody access Google for it and leave your stuff alone? That would certainly reduce your bandwidth impact!
0 Votes
+ -
Contributr
Good idea...
foremski 19th Jul 2010
Eric Schmidt once said that Google's goal was to eventually run everyone's web site.
0 Votes
+ -
No index for you
iPad-awan 16th Jul 2010
"What if Google opened up its index to the world as a goodwill gesture because it has the best index?"

Google talks a great deal about doing what's right but what they really mean is doing what's right for Google and the public can eat cake.

btw, I like the originality of this article. Good job.
0 Votes
+ -
Amature thoughts
Hameiri 16th Jul 2010
I really don't know much about this, but it seems to me, when you send out your bot to get the info, you have alot of control over what you gather.

If, on the other hand, someone else gathers the info, or the websites supply it, they will have more control, and you have to wonder about the integrity of the index.

In other words, if you gather information, and the entity doesn't know exactly how you get it, then the info is more likely to be objective. If the sites send you the info, then they can manipulate what you get.
0 Votes
+ -
Contributr
How about just one bot?
foremski 19th Jul 2010
Yes, you make a good point about checking the veracity of the info provided. If there was one "openbot" checking the web sites and then possibly marking down the sites that try to cheat, then that should work.
0 Votes
+ -
Single index means single point of vulnerability.

As to an open index that all web content publishers could post to, we'd be back to the days of malicious sites that loaded up their META element with Britney Spears to game the search technologies of the day.

Plus, Google isn't really indexing the site. Indexing means finding relations, assigning a key to the relation, and then saving the keys in a tree or tree-algorithmic structure. While searches on the index are log n, the building of the index is n log n, which is tolerable. However, the index represents one hierarchical view of the data, and one cannot create and maintain all high-performance indexing schemes on such a loosely coupled data set as the world wide web.

So Google doesn't. I think the content is not indexed, but the searches are. Content is harvested and given to a server which waits for a query. Upon getting that query the query and content are mapped to a data structure which is handed up to the controlling server for reduction. If the content host gets the same query again, it pulls the past results, runs the map on new content, and flags the content that has been removed from the web as "cached." Between instances of the same query, the content may change very little. Web content changes may be distributed across the servers in a balanced manner, so no one server has a higher burden of checking for new material.

There's sorting involved in delivering results, radix sorts are the fastest with QuickSort next. Since QuickSort performs best on randomly ordered data, it would also follow that results from the server, at the reduce stage, should be as unordered as possible.

So, your idea would require a data center, an unbiased and comprehensive real-time way for content to get into that center, probably requiring that the center pull the content for integrity reasons. However, the algorithms that enable quick searching are necessarily on the servers in the center, and at this point, you need a content distribution algorithm and mapreduce implementation that all search vendors may consume. This gets dicey, especially as the search vendors rely on their implementations to provide competitive value. The whole thing would have to be run by an organization that is recognized, respected, and supported by all the governments of the world.

Meanwhile Google and Bing give us fairly quick results at no charge to us users. Competition means there are more players sending out robots and an incentive to be more responsive to us customers. I'm not a free market in all things guy, but, here it makes sense to me.
0 Votes
+ -
More
DannyO_0x98 Updated - 16th Jul 2010
Since your real issue is the bandwidth usage of robots, we should think about the reasons. One data pool for all, as I argued above, has no economic incentives and has real complications with the political structure of the world.

So, is there a way to have many robots, but lower consumption? Sure, they don't grab all of your content on each visit. Well, how do they know what changed unless they compare it to an image from the last visit?

Good question. One way would be to make the robots.txt a bit more sophisticated, perhaps pointing the robots to a diff file which would be maintained by the host and updated when the pages update.

This does bring us back to the integrity issue, that is, a site may say "My content is now...." in the diff-for-robots file when the changes weren't made and the actual content is malicious.

The response to that is perhaps the only thing I'll quote approvingly from Reagan: trust but verify. Perhaps at random visits, the robots pull full content and rely on the diff-for-robots file the rest of the time.

Now that protocol would seem to address your problem more effectively and immediately.
0 Votes
+ -
Not realistic at all
Rick_R Updated - 16th Jul 2010
"These robots are all seeking the same information and they use nearly one-half of my bandwidth, slowing the site for all my readers."

Realistically, given the capacity of today's servers, if your site only provided about 21 GB the entire month, it's extremely doubtful the spiders caused a noticeable slowdown. Also, most of the major spiders deliberately limit how much data they pull in a given time specifically so they won't cause a noticeable slowdown (or be blacklisted by the hosting company).

There's a fundamental flaw in your reasoning, which is that this is a capitalist free-market economy. What you're talking about is fundamentally a monopoly, no matter who owns or controls it.

Also, imagine if "everyone got together" for a single index and then some country did what China did with Google. Every country that wanted a political whipping boy would find an excuse to censor the single source--in the U.S. it would be alleged terrorism or child porn, in Moslem countries it would be alleged insults to Islam, in Germany it would be alleged pro-Nazi or Holocaust denial, etc., etc. Real or imagined, governments would eventually control the only source of information.

Plus, in capitalist societies there will always be someone claiming "We do it cheaper, faster, more extensively, better," or whatever. There would always be someone trying to "reinvent the wheel" and people buying into it "not to miss out" on "the latest and greatest". Unless strong laws were passed and enforced making it a serious crime to spider websites, there would be no way to prevent competitors.

And even if "everyone got together", eventually different countries would want to have their own index out of national pride or distrust of whoever they considered to be in charge of "the" index.
0 Votes
+ -
I agree with Rick_R here
pgit 16th Jul 2010
@Rick_R

"single point of failure" isn't limited to the technology, I'd give the 'uptime' of human nature a solid 30 seconds here before this idea would blow apart.
0 Votes
+ -
Maybe do things in reverse...
hsdajr 16th Jul 2010
Create a mechanism whereby each website indexes itself when something has changed, and the bots just look at the index, and if the checksum has changed from the last time, it gets it, otherwise it moves on to the next website.
Oops sorry, I see that's already been suggested, didn't scroll enough!
0 Votes
+ -
I'm all for an open index, but with a catch...
adornoe@... Updated - 16th Jul 2010
the catch being that the website with the index doesn't get to own the indexed data.

Essentially, the application and the algorithms and database structure would belong to the indexing website, but the actual data would be owned by the creators or people that the data relates to.

In that context, if a columnist had an article being indexed and/or hosted by an "index"ing company, the company would have all rights to use the article/information to monetize it, but the columnist would still own the "data" within the article. If a columnist allowed a website to index his columns, that would be an implied authorization for the indexing site to use the column. However, the indexing site would have to be offering something of value to the columnist in return, such as part of the revenues brought in by traffic to the column, or links (traffic generator) to the columnist's own website where the column is also being hosted.

The columnist would have the right to "turn off" access to his columns via the indexing website. Indexing should only serve as a means to get at data/information, but it should never be a means towards ownership of someone else's data/information.
0 Votes
+ -
How about NOT ALLOWING these bots to
janitorman 16th Jul 2010
access your website? I know there must be a way to turn them away. If you want your site "indexed" do it yourself, and put a site search ON THE SITE ITSELF rather than letting a third party do it. Then, all a search bot would have access to, would be information you allowed them.. maybe a monthly update that says what's on your site SENT TO THEM, not letting them come to you!... makes sense to me, but then again I'm old-fashioned.
Why should google share it's storage, bandwidth, and processing power with its competitors?
I have to admit, that's a bit of good writing and at first blush seems like a great idea. But after a few minute's thought back inside the box, I came up with:

Single-sourced Point Access? NEVER! No!
- Single server location only? NO!
- Apparently most people so far haven't heard of the robots.txt file, which Google accepts and all the big searchers do as it's a benefit to both Google and the site owner. And that's not the only way to avoid bots if you really don't want to be scanned, believe it or not.
- Even today Google would do well by just trimming down their presentations to minimze the amount of junk that comes up in a search.
- Human natue will eventually turn such an enterprise into a pay-for service and fail to produce the euphoria it is intended to.
- Trying to stop all the search engines now in existance would be a monumental and prohibitively expensive task that could never come about. I'd be one of the first to start mirroring such a site, and then working it together into a new, better operation, knowing the effort cannot be completed.

- Can you just IMAGINE the number of home-search-bots there would be if if Google became the only home of search-data. I know I'd be working on one! Welll, not really I guess; I'd wait for one to show up at a forge or similar and use those.

What I'd really like to see a search engine come up with is a way for me to say I only want to see, say, 80% relevance ratings, or 90%, or 50%, etc..
0 Votes
+ -
Who is going to pay for it?
betelgeuse68 17th Jul 2010
All great and dandy but Google makes money to do what it does. I don't need more taxes to support your suggestion and unfortunately that's what it will likely take. I would rather leave it to private enterprise, e.g., Google.

-M
"These robots are all seeking the same information and they use nearly one-half of my bandwidth"

Not necessarily. Some robots are just interested in the text, some may be interested in the images (for those image searches), some may be interested in other content.

In any case - this isn't gonna happen, and I don't we want it to happen, either. It would simply create a monopoly.

And frankly, there's plenty of bandwidth to go around.

It's not as if people are going to notice the fraction of a second difference - as far as I can tell, your pages load instantly.
This seems like a good idea but I have noticed from my site logs page errors from stuff that was removed from the site 6 months ago reason I think is google caches most of the older stuff for faster action when you search & there is a long delay before that cache is refreshed especially when pages are removed 2 or three directory's deep & final html links remain.This is still the case when frequent site maps are submitted to google as well.
The search engines still have a big problem to overcome the person finding a way will become very famous if not rich,The indexing of jpg & other picture files that are not pointed to by html links & therefore cannot have any alt tags to describe their content or existence the names of the file alone not being enough.
If an open source index is created, it should be the property of the W3C. That way no one can game the system. How the index is managed is a problem for the engineers to sort out. I am not even going to pretend I know about that!

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix
Click Here
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix
ie8 fix