“Google currently does not allow outsiders to gain access to raw data because of privacy concerns. Searches are logged by time of day, originating I.P. address (information that can be used to link searches to a specific computer), and the sites on which the user clicked. People tell things to search engines that they would never talk about publicly—Viagra, pregnancy scares, fraud, face lifts. What is interesting in the aggregate can seem an invasion of privacy if narrowed to an individual.
“So, does Google ever get subpoenas for its information? 'Google does not comment on the details of legal matters involving Google,' Mr. Brin responded.”
New York Times, 28 November 2002
This is an interesting quote they used—of course I would expect Google to keep this information private but I have some news for the New York Times, I log the IP address along with not only the day, but the time! (Gasp! Shock! Horror!) Heck, nearly every webserver in existence logs this very information (Seminole is an exception here, but that's due to the environment it is meant to run in where the amount of resources are very limited). But there is no way that Google can determine which link you clicked on since that information doesn't go through their server. Granted, if you click on “Similar Pages” or the cached copy, then yes, they can see which pages you are interested in. And it's likely that if you use their toolbar then they might see what page you selected, but not having see it I can't say that they do.
I grant them that it is puzzling that Google sets such a long lived cookie, but the site is still usable with cookies disabled, and modern browsers like Mozilla allow you to allow/disallow cookies on a per-site basis.
Thirdly, in order for Google to access the links to crawl a deep site of thousands of pages, a hierarchical system of doorway pages is needed so that crawler can start at the top and work its way down. A single site with thousands of pages typically has all external links coming into the home page, and few or none coming into deep pages. The home page PageRank therefore gets distributed to the deep pages by virtue of the hierarchical internal linking structure. But by the time the crawler gets to the real “meat” at the bottom of the tree, these pages frequently end up with a PageRank of zero. This zero is devastating for the ranking of that page, even assuming that Google's crawler gets to it, and it ends up in the index, and it has excellent on-page characteristics. The bottom line is that only big, popular sites can put their databases on the web and expect Google to cover their data adequately. And that's true even for websites that had their data on the web long before Google started up in 1999.
My experience is quite different. This weblog/journal, a database driven site (more or less) with a god-awful number of potential links, has been crawled, and crawled deeply by Google (and a few other search engines) and just checking, last month 11% of all hits came from search queries from Google (actual visits it's probably closer to 50% or 60%). Even Mark's weblog/journal is getting spidered by Google and while he doesn't have the traffic that I currently do, it's about par for what I had when this journal first went live.
Those who launch new websites in 2002 have a much more difficult time getting traffic to their sites than they did before Google became dominant.
People who launched a new website in 2002 (or 2003 or 2004 or for the forseeable future) are always going to have more difficulty getting traffic to their site because the web is always growing! It's called “competition”; Google is irrelevant to this. In fact, the web is growing faster than the search engines can keep up and in effect, the web is this ever expanding frontier but the frontier may be a bit difficult to find at times. Methinks these people should read up on power laws and how it relates to the web (link found via Google).