How to Block AI Bots from Training on Your Content

I’ve decided I don’t want the AI bots crawling my sites and training on my content. Here’s how to block all or some of them from your sites.

Pinterest Hidden Image

I’ve decided I don’t want the AI bots crawling my sites and training on my content. Here’s how to block all or some of them from your sites.

I’m going to give you the instructions first and not force you to scroll through a 5,000 screen Explanation of Robots.txt before I give them to you. I know unnecessarily long posts designed for ad revenue are obnoxious, even though it’s the only way bloggers can eke out a living these days.

Illustration of a bot standing on a motherboardPin

That said, after the instructions, I’ll provide some info about that file for those unfamiliar with it, because it is powerful. You can accidentally block bots such as the Google Bot, which will de-index your site from search. I… know a friend who did that once. Yes, um, a friend.

So if you don’t know much about robots.txt, I strongly encourage you to read below the instructions.

Please also be aware there are different schools of thought on whether it’s a good idea to block these bots. Some people believe it will knock you out of search without giving you much benefit. They may be right.

But after reading former Google CEO Eric Schmidt’s fantasies about getting rid of links in search altogether, I think this is probably the direction we’re heading. So I’m willing to take the chance of losing out on search – especially since HCU killed most of my search traffic – for the benefit of not seeing my uncredited articles show up in search results and other people’s websites.

I’m not against AI in principle. I just don’t want it training on my sites.

How to Block AI Bots

As I mentioned, you’re going to block these bots using your robots.txt file. This is a file that sits on your web server, at your top directory.

You can access it through Cpanel or whatever your host uses for file management. You can also access it through some SEO plugins like Yoast. And ftp is another option.

Here’s what it looks like in my host’s control panel:

Screenshot of a robots.txt file in a web serverPin

You’re going to click and edit that robots.txt file. It’s just a text file, so it’s like editing a Word doc. If you want to block all of these bots, but no search bots, you can simply cut and page this into your robots.txt file, changing the sitemap line at the top to show your website instead of mine.

Sitemap: https://blogaliving.com/sitemap.xml
User-agent: *
Disallow: 
User-agent: CCBot 
Disallow: / 
User-agent: GPTBot 
Disallow: / 
User-agent: ChatGPT-User 
Disallow: / 
User-agent: Google-Extended 
Disallow: / 
User-agent: anthropic-ai 
Disallow: / 
User-agent: Claude-Web 
Disallow: / 
User-agent: Httrack 
Disallow: / 
User-agent: HTTrack 
Disallow: / 
User-agent: Wget 
Disallow: / 
User-agent: MJ12bot 
Disallow: / 
User-agent: SeznamBot 
Disallow: / 
User-agent: DotBot 
Disallow: / 
User-agent: BLEXBot 
Disallow: /

And then click “save”, and you’re done. This has been tested (see below) and does not block any of the good bots.

Breaking It Down

Let’s look at what the above code means.

Sitemap: https://blogaliving.com/sitemap.xml

is simply telling bots where my sitemap is. This helps search bots crawl your site.

 
User-agent:
* Disallow:

This weird bit of text tells the bots they’re not disallowed from crawling through any part of your site. It’s basically telling bots like Google search to feel free to wander around, which is a good thing. Note that if you accidentally typed a slash “/” after “Disallow:” you would be telling the bots they’re not allowed anywhere on your site. That’s one of the many fun ways you can accidentally block good bots.

 
User-agent: CCBot
Disallow: /

So you can probably figure this one out yourself now: it’s telling this CCBot that it’s not allowed anywhere on your site. That’s exactly what we want to tell the AI bots.

Is this Every AI Bot?

Nope! And even if it was now, wait 5 minutes – people will be developing new AI bots all the time.

This is a fairly conservative list. I’ve seen much longer lists of bots to block, but I’m being cautious. For example, his list blocks GoogleOther. From what I’m reading, it’s a secondary version of the search engine bot, designed to take some stress off the main GoogleBot which has a ridiculous amount of pages to crawl these days.

I don’t want to block any bot that’s related to search. So at this point, without knowing for certain that GoogleOther is scraping my content, I’m not going to block it.

You can add bots to this list for yourself if you have researched them and are sure you want to. All you would need to do is go to the bottom of the list and add:

User-agent: [bot you want to disallow]
Disallow: /

How to Test

Always test your robots.txt code! The slightest typo can completely change what this file will do.

To test my robots.txt code, I went to the Merkle robots.txt validator, pasted in my code (always copy directly from your robots.txt file, not the code on this website that you think you pasted into your file), filled in my site URL and chose the Googlebot (standard search engine bot). Then I clicked “test.”

Screenshot of validator testPin

See the green “Allowed” at the bottom right? That says the Google search bot can crawl my site, which I do want it to do.

Pin

I also tested it on the other search engine bots. Merkle has all of them, which is great. Here’s the menu section that shows the Bing bots:

Screenshot of dropdown menu showing BingbotPin

I also tested the social media bots:

Screenshot of dropdown menu showing social media botsPin

And finally I tested it on the AI crawlers:

Screenshot of dropdown menu showing AI botsPin

And the green “Allowed” at the bottom turned into a red “Disallowed.” Which means they can’t crawl the site, which is what I want.

Screenshot of red Disallowed test resultPin

And that’s it! I’ve added the code to the file and tested it, and it’s blocking the AI bots but not the “good” bots for search and social media.`

What the Robots.txt File Does

As I mentioned, the robots.txt file is a simple text file that lives in your site’s root directory. It tells web crawlers, search engines, and other bots that visit your site what they are and aren’t allowed to do.

It can get very complex, blocking them from some directories but not others, and so on. There used to be reasons to do this but not so much anymore. If you don’t know of a reason you need to do this, then you almost certainly don’t.

Note that this file cannot stop so-called bad bots – by definition, bad bots are the ones that ignore robots.txt and do whatever they want on your site. They usually don’t cause any trouble, but they can throw off your analytics.

 

Last Updated:

April 28, 2025

More Like This

Leave a Reply

Your email address will not be published. Required fields are marked *