The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, November 10, 2001

A google spiders

In checking the log files for this site I've notived that Google has finally found it and has spent the past few days spidering through it.

There are a few thousand links for it to follow (out of what? A million potential URLs on this site? I know the Electric King James has over fifteen million URLs). For instance, there are three just for the years, 12 each for each year (okay, so there's only 11 for this year, but close enough) so that's now 39 URLs. Each day (for those days that have an entry) have at least one entry and while I may have skipped a day or two here and there, let's say there's an averave of 300 per year, so that's over 900 there. And if you assume an average of two entries per day (remember, you can retrieve the entire day, or just an entry) that's another 600 per year or 1,800 so we're now up to nearly 3,000 URLs that Google has to crawl through (with lots of duplication).

robots.txt for

# Go away---we don't want you 
# to endlessly spider this
# site.

User-agent: *
Disallow: /

There's a reason I don't allow web robots/spiders to the Electric King James—it would take way too long to index the site (if indeed, the spider in question was even aware of all the possible URLs) and my machine isn't all that powerful to begin with (it being a 33MHz 486 and all). But I feel that there is a research problem lurking here that some interprising Masters or Ph.D. candidate could tackle: how best to spider a site that allows multiple views per document.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.