I've talked about this a bit, but if a dedicated search engine wants to successfully scan a weblog there are a few ways to go about it.
One, grab the RSS file for the weblog and index the links from that. That will allow you to populate the search engine with the permanent links for the entries. Another thing it will allow you to do is properly index the appropriate entries. Google does a good job of indexing pages, but a rather poor one of indexing individual entries of a weblog, since it generally views pages as one entity and not as a possible collection of entities. So that if I mention say, “hot dogs” on the first of the month, “wet papertowels” on the fifteenth and “ugly gargoyles at Notre Dame” on the last day of the month, someone looking for “hot wet gargoyles” at Google is going to find the page that archives that month.
Which is probably not what I, nor the searcher in question, want.
Well, unless I'm looking for disturbing search request material, but I digress.
Even if the permanent links point to a portion of a page, the link would be something like
Which points to a part of the page at
And somewhere on that page is an anchor tag with the ID of “31415926” which is most likely at the top of the entry in question. From there you index until you hit the next named anchor tag that matches another entry in the RSS file.
And if you hit a site like mine, the RSS file will have links that bring up individual pages for each entry.
Now, you might still have to contend with a weblog that doesn't have an Rich Site Summary file, but then, you could just fall back to indexing between named anchor points anyway and use heuristics to figure out what may be the permanent links to index under.
I'm sure that people looking for “hot wet gargoyles” will thank you.