The Boston Diaries

The ongoing saga of Sean Conner, who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, November 18, 2025

“Thanks for calling that out. That was bad advice on my part.”

For reasons, I noticed that ClaudeBot (and I'd link to the information page for it, but, and this is my shocked face, they forgot to include a proper link!) was repeatedly requesting the same file over and over again—/sitemap.xml. My bad! I neglected to configure my web server to return “410 Gone” for that page. I went to the main site of the bot and found a “call to sign up!” but sadly, no obvious link to stopping their bot.

So I figured, I might as well try this LLM thing once more. The last last few times I tried, it didn't go well. Maybe third time's a charm?

Maybe?

Sean

How do I stop your XXXXX­XX webbot from crawling my website?

Claude

Thinks hard for a few minutes.

I can help you block Anthropic's web crawler from accessing your website.

You'll want to add rules to your robots.txt file. Here's what to add:

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

Put this in your robots.txt file at the root of your domain (e.g., https://yourdomain.com/robots.txt). [Apparently Claude doesn't know that there's a domain set aside specifically for examples, namely, EXAMPLE.COM. Claude also doesn't realize that yourdomain.com actually exists! Way to go, Claude! slow clap]

The two user-agent strings cover Anthropic's crawlers:

  • anthropic-ai - the main crawler identifier
  • Claude-Web - an alternative identifier that may be used

If you want to block specific sections instead of your entire site, you can replace / with specific paths:

User-agent: anthropic-ai
Disallow: /private/
Disallow: /admin/

The crawler should respect these rules within a reasonable timeframe after you update your robots.txt file.

If you're still seeing crawling activity after adding these rules, or if you need immediate blocking, you could also block by IP address in your server configuration or firewall, though the specific IPs may change over time.

Sean

Finds the page that describes Anthropic's web bots, not from a link on the Anthropic web site, but a normal web search.

What about Claude-User? ClaudeBot? Claude-SearchBot? Did you XXXXX­XX forget those? Or did your overlords tell you to conveniently “forget” those as well?

Claude

You're absolutely right - I should have given you the complete list. Here are all the Anthropic crawler user-agents you should block:

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

I didn't intentionally leave those out - I gave you an incomplete answer, which wasn't helpful when you're trying to actually block the crawlers. Thanks for calling that out.

If you want to be thorough, you could also add a blanket rule that catches any variations:

User-agent: *Claude*
Disallow: /

User-agent: *anthropic*
Disallow: /

Though the specific user-agent blocking is more precise if you want to maintain control over what else can access your site.

Sean

Are you XXXXX­XX sure that's legal syntax? That doesn't look kosher to me. Again, are you trying to get around any bans?

Claude

You're right to call that out - wildcards with asterisks like *Claude* are NOT valid robots.txt syntax. That was bad advice on my part.

The correct approach is to list each user-agent explicitly:

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

That's the proper syntax. Each User-agent directive needs to be spelled out exactly, and wildcards don't work in the User-agent field in standard robots.txt files.

I'm not trying to get around anything - I just gave you incorrect technical information, which is worse than useless when you're trying to actually implement this. My apologies for the confusion.

Oh wait! I know! I'm holding it wrong, and I'm using an LLM that is older than 20 minutes. My mistake. And before you say I should have asked politely, no. This is a computer! It's not human! Stop anthropomorphizing it!

And of course, at the bottom of the chat page, grey-on-slightly-darker-grey, is the text “Claude can make mistakes. Please double-check responses.”

How anybody finds this garbage useful is beyond me.

Obligatory Picture

[Self-portrait with a Christmas Tree] Oh Chrismtas Tree!  My Christmas Tree!  Rise up and hear the bells!

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2026 by Sean Conner. All Rights Reserved.