The googlebot can find pages not directly linked

The googlebot can find pages not directly linked

Summary: Google has the ability to find the location of pages/files that have no direct links to them. If it has a web address Google can find it.

TOPICS: Google

Dell's confidential specs for future Dell notebooks were discovered and distributed by Google. Elinor Mills from here writes a web primer on how to avoid the problem by using a robot.txt file and by not linking to sensitive materials.

However, did you know that the Googlebot can find pages that have no direct link from the home page of a web site? That's what a Google engineer said at a search conference a couple of years ago.

If I find the reference I will post it, but the gist of it was that Google has the technology so that it can find and catalog information on a server without having to follow links.

And you would expect Google to have such technology since its mission is to index and copy all of the world's information. Not "all linked information" but "all information."

- - -

Here is an interesting discussion on this topic.

Topic: Google

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • It's a little freaky

    Recently I've been getting form submits from Googlebot. It sees a form with a select combo box, and it submits the form for every possible value of that combo. And they seem to follow some javascript links too.
    • What are the forms for?

      Is it email newsletters?
  • Don't be daft

    Read what you just typed, and listen carefully to what the Google engineer said: GoogleBot can find pages not DIRECTLY linked to the HOMEPAGE. Pages *indirectly* linked-to, for example through an intermediary page deeper in yoru site, are all it can find.

    IE, if you've got a page linked-to from your homepage, then pages linked off those, GoogleBot will follow the links and find the pages buried one or more levels deep in your site.

    There is no known way for a web user-agent (browser, GoogleBot, whatever) to know the existence of a page unless it's linked-to somewhere, by someone. There's certainly no way to make a web-server produce a list of all files hosted on it - this would be a catastrophic security hole, and goes against the entire design of HTTP (where you only get what you ask for, and there's no mechanism for "discovery" of other resources other than links).

    Sure, there are lots of ways you can *accidentally* create a link to an orphaned page[1], but unless there's a link seomwhere, it's flat-out impossible for GoogleBot to know there's a page there, short of submitting a speculative request for a page called every possible combination of alphanumeric characters in turn (called a brute-force attack in cryptography), and that would take so long it's utterly infeasible... as well as being completely pointless.

    [1] Eg, if you browse an orphaned page on your site (*you* know it's there, right?), and then follow a link away from it onto someone else's site. In this case, their web-server logs will contain the URL of the "orphan" page that linked to their site (called the "HTTP Referer" [sic]). If that site then publishes their logs (deliberately or accidentally, from a badly-configured web server), the GoogleBot can find the URL and index that page of your site. Either way, somebody publishes a link somewhere.

    GoogleBot is neither omniscient, nor magic.
    • Brute Force can be effective

      Brute force URL guessing can be quite productive in some searches, particularly those that follow an obvious format.

      When a series of interesting images appear in a web account with no publicly available index, and all of them are in the format DSCXXXXX.JPG, you can bet your ass I'm going to run a little loop and try a few thousand other possibilities - if the image server is very fast and won't shut you down after any number of failed requests.

      Its not too farfetched to suppose that someone might create a bot capable of detecting the orginaization of page numbering schemes in data sets where at least a few known pages exist. Its also not too farfetched to suppose that someone preprogrammed a few obvious places to check either, for example when encountering DSCXXXXX.JPG check neighboring numbers 100 out, when encountering &p=XX, check for other values of &p, etc.

      By The Way: The number of sites publishing logs with links declined dramatically after someone started abusing this practice a couple of years ago in order to generate links to a website on several thousands of servers, and made the code to do this available to the public. ...
  • RE:The googlebot

    That's pretty amazing... <a href="">replica watches</a> really great work on this!!