How to Block AI Bots from Training on Your Content
I’ve decided I don’t want the AI bots crawling my sites and training on my content. Here’s how to block all or some of them from your sites.
I’ve decided I don’t want the AI bots crawling my sites and training on my content. Here’s how to block all or some of them from your sites.
I’m going to give you the instructions first and not force you to scroll through a 5,000 screen Explanation of Robots.txt before I give them to you. I know unnecessarily long posts designed for ad revenue are obnoxious, even though it’s the only way bloggers can eke out a living these days.
That said, after the instructions, I’ll provide some info about that file for those unfamiliar with it, because it is powerful. You can accidentally block bots such as the Google Bot, which will de-index your site from search. I… know a friend who did that once. Yes, um, a friend.
So if you don’t know much about robots.txt, I strongly encourage you to read below the instructions.
Please also be aware there are different schools of thought on whether it’s a good idea to block these bots. Some people believe it will knock you out of search without giving you much benefit. They may be right.
But after reading former Google CEO Eric Schmidt’s fantasies about getting rid of links in search altogether, I think this is probably the direction we’re heading. So I’m willing to take the chance of losing out on search – especially since HCU killed most of my search traffic – for the benefit of not seeing my uncredited articles show up in search results and other people’s websites.
I’m not against AI in principle. I just don’t want it training on my sites.
How to Block AI Bots
As I mentioned, you’re going to block these bots using your robots.txt file. This is a file that sits on your web server, at your top directory.
You can access it through Cpanel or whatever your host uses for file management. You can also access it through some SEO plugins like Yoast. And ftp is another option.
Here’s what it looks like in my host’s control panel:
You’re going to click and edit that robots.txt file. It’s just a text file, so it’s like editing a Word doc. If you want to block all of these bots, but no search bots, you can simply cut and page this into your robots.txt file, changing the sitemap line at the top to show your website instead of mine.
Sitemap: https://blogaliving.com/sitemap.xml User-agent: * Disallow: User-agent: CCBot Disallow: / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: Google-Extended Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Claude-Web Disallow: / User-agent: Httrack Disallow: / User-agent: HTTrack Disallow: / User-agent: Wget Disallow: / User-agent: MJ12bot Disallow: / User-agent: SeznamBot Disallow: / User-agent: DotBot Disallow: / User-agent: BLEXBot Disallow: /
And then click “save”, and you’re done. This has been tested (see below) and does not block any of the good bots.
Breaking It Down
Let’s look at what the above code means.
Sitemap: https://blogaliving.com/sitemap.xml
is simply telling bots where my sitemap is. This helps search bots crawl your site.
User-agent: * Disallow:
This weird bit of text tells the bots they’re not disallowed from crawling through any part of your site. It’s basically telling bots like Google search to feel free to wander around, which is a good thing. Note that if you accidentally typed a slash “/” after “Disallow:” you would be telling the bots they’re not allowed anywhere on your site. That’s one of the many fun ways you can accidentally block good bots.
User-agent: CCBot Disallow: /
So you can probably figure this one out yourself now: it’s telling this CCBot that it’s not allowed anywhere on your site. That’s exactly what we want to tell the AI bots.
Is this Every AI Bot?
Nope! And even if it was now, wait 5 minutes – people will be developing new AI bots all the time.
This is a fairly conservative list. I’ve seen much longer lists of bots to block, but I’m being cautious. For example, his list blocks GoogleOther. From what I’m reading, it’s a secondary version of the search engine bot, designed to take some stress off the main GoogleBot which has a ridiculous amount of pages to crawl these days.
I don’t want to block any bot that’s related to search. So at this point, without knowing for certain that GoogleOther is scraping my content, I’m not going to block it.
You can add bots to this list for yourself if you have researched them and are sure you want to. All you would need to do is go to the bottom of the list and add:
User-agent: [bot you want to disallow] Disallow: /
How to Test
Always test your robots.txt code! The slightest typo can completely change what this file will do.
To test my robots.txt code, I went to the Merkle robots.txt validator, pasted in my code (always copy directly from your robots.txt file, not the code on this website that you think you pasted into your file), filled in my site URL and chose the Googlebot (standard search engine bot). Then I clicked “test.”
See the green “Allowed” at the bottom right? That says the Google search bot can crawl my site, which I do want it to do.
I also tested it on the other search engine bots. Merkle has all of them, which is great. Here’s the menu section that shows the Bing bots:
I also tested the social media bots:
And finally I tested it on the AI crawlers:
And the green “Allowed” at the bottom turned into a red “Disallowed.” Which means they can’t crawl the site, which is what I want.
And that’s it! I’ve added the code to the file and tested it, and it’s blocking the AI bots but not the “good” bots for search and social media.`
What the Robots.txt File Does
As I mentioned, the robots.txt file is a simple text file that lives in your site’s root directory. It tells web crawlers, search engines, and other bots that visit your site what they are and aren’t allowed to do.
It can get very complex, blocking them from some directories but not others, and so on. There used to be reasons to do this but not so much anymore. If you don’t know of a reason you need to do this, then you almost certainly don’t.
Note that this file cannot stop so-called bad bots – by definition, bad bots are the ones that ignore robots.txt and do whatever they want on your site. They usually don’t cause any trouble, but they can throw off your analytics.
Last Updated: