Cloudflare Now Blocks AI Crawlers by Default: Don't Vanish From AI Search

What Changed, and Why You Probably Missed It

Cloudflare now blocks AI crawlers by default. Not as an opt-in toggle you flip yourself, but as the baseline for newly onboarded sites and anyone following the recommended configuration. AI bots hit the wall before they reach your content. Alongside that, Cloudflare shipped Pay Per Crawl: when a crawler asks for your pages, the server can return an HTTP 402 with a crawler-price header that quotes a per-fetch rate, and the publisher sets that price.

For content platforms, the mechanism makes sense. Early reported numbers show roughly a 32% drop in unauthorized bot traffic, and participating publishers saw about a 27% lift in data-licensing revenue. The logic is clean: AI companies used to scrape content for free to train on, and now you can charge a toll.

The trouble lives in the word “default.” Plenty of sites had AI bot access switched off automatically, and the owners had no idea. You didn’t edit robots.txt. You didn’t touch a setting. Cloudflare pushed a config one day, or you onboarded a new CDN, and suddenly GPTBot and ClaudeBot started eating 402s and 403s.

For a store that leans on AI search for traffic, this is the slow-boil version of a disaster. No error shows up. No dashboard turns red. One day you notice that asking ChatGPT “what brands are good for X” no longer surfaces you, while the competitor down the street who allowed the crawlers is still getting cited. You’ve gone invisible in the answer, and you’re blaming it on a ranking wobble.

Not All Bots Are the Same: Training vs Retrieval

Sort the bots out first. The most common mistake is treating everything with “AI” in the name as one bucket and blocking the lot, which takes down the crawlers that actually send you traffic.

Training crawlers fetch your content to feed model training. GPTBot, Google-Extended, and CCBot fall here. Whatever they pull might faintly surface in some future model release months from now, with no direct line to this week’s orders. Blocking them is mostly a content-licensing and revenue question.

Retrieval crawlers are a different animal. OAI-SearchBot, ChatGPT-User, and PerplexityBot fetch live content right when a shopper asks a question in ChatGPT or Perplexity, and they assemble the answer that shopper sees in that moment. Someone is asking “best sunscreen for sensitive skin” right now, and whether a retrieval bot can reach your page decides whether your SKU appears in the recommendation they read. For AI shopping visibility, the retrieval bots are exactly the ones you want to get through.

One distinction trips up a lot of people: Google-Extended and Googlebot are not the same thing. Google-Extended governs training access for Gemini and Vertex. Block it and your normal Google Search ranking is untouched, because regular Search indexing runs through Googlebot, a separate crawler entirely. So “I don’t want my content training Gemini” and “I want to stay in Google Search” can both be true at once. Don’t paint with one brush just because the name says Google.

Going the other way, blocking a retrieval bot draws real blood. Block PerplexityBot and you disappear from Perplexity’s answers. Block OAI-SearchBot and ChatGPT-User and ChatGPT can’t fetch you when it retrieves live. Those surfaces are becoming front doors for shopping decisions.

Here’s a table that gathers the key bots so you can check them against robots.txt one by one:

BotOwnerTypeWhat blocking it costs you
GPTBotOpenAITrainingContent stays out of OpenAI model training; little effect on current sales
OAI-SearchBotOpenAIRetrievalChatGPT search/answers can’t fetch you
ChatGPT-UserOpenAIRetrievalChatGPT can’t pull your page during a live user query
ClaudeBotAnthropicTraining/crawlAnthropic’s crawl is blocked
PerplexityBotPerplexityRetrievalYou vanish from Perplexity’s answers
Google-ExtendedGoogleTrainingExcluded from Gemini/Vertex training; Google Search unaffected
CCBotCommon CrawlTrainingExcluded from the Common Crawl public dataset

Revenue or Visibility: It’s a Tradeoff

Once the taxonomy is clear, the decision is a fork. Either you treat your content as an asset to license for money (Pay Per Crawl, block bots), or you stay discoverable in AI answers (allow the retrieval bots).

Both sides have a case. A publisher sitting on a deep catalog of original content, with stable traffic that doesn’t depend on AI referrals for new customers, can reasonably pick revenue. The content got scraped for free for years, and recouping some licensing money is fair. That 27% revenue lift is aimed squarely at those players.

An e-commerce store sits in the opposite spot. Your content isn’t a licensing product. It’s the thing that helps a buyer find you and place an order. Shoppers increasingly skip scrolling ten blue links on Google and just ask an AI “is there a product good for X,” and the AI hands back a few specific picks. Get shut out by the retrieval crawlers in that flow and you’ve quietly walked away from a growing acquisition channel without anyone telling you.

So for the vast majority of stores that depend on AI discovery, the answer is close to settled: allow the retrieval bots. The training bots are a separate call. Block them if you want to keep your content out of training sets, allow them if you don’t care, and either way it barely touches today’s revenue. The thing you cannot get fuzzy on is killing the retrieval bots alongside the training ones by accident.

There’s a sensible middle path too: allow all retrieval, block all training. You keep your content out of training corpora while preserving your shot at being cited in ChatGPT and Perplexity. For most stores that’s a solid default posture.

The Audit Checklist

First, find out whether your domain even sits behind Cloudflare and what its bot-management or AI-crawler default is set to. Lots of stores got Cloudflare added by a site builder or an ops person, and the owner never knew the layer existed. Open the Cloudflare dashboard, find Bot Management or the AI-crawler controls, and confirm whether the default allows or blocks. If it reads “block AI bots by default,” you can assume you’re affected.

Second, pull up robots.txt and walk through each user-agent in the table above. Confirm the retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot) are not under a Disallow. While you’re in there, watch for a blunt User-agent: * block followed by a wall of Disallow lines that shuts out everyone, AI bots included.

Third, add explicit Allow rules for the retrieval crawlers you want, rather than leaning on a default. Writing them out is clearer, and it makes them harder to delete by accident when someone edits the config later. For the training bots, make a conscious choice per the tradeoff above. Block them or allow them, but let it be a decision you made, not one a default value made for you.

Fourth, editing robots.txt alone isn’t enough, because Cloudflare can block the request before robots.txt is ever read. Go into Cloudflare’s firewall, WAF, and bot rules and confirm the bots you allowed aren’t getting stopped at the edge. robots.txt is a polite “please don’t crawl” agreement; Cloudflare’s block actually drops the request at the network layer. Both layers have to agree.

Fifth, re-check after any Cloudflare or CDN configuration change. This isn’t a one-time job. Switching plans, enabling new features, or migrating CDNs can all overwrite your allow-rules with a default again. Build the habit: change a config, re-verify the bot allow-list.

How to Confirm It Actually Worked

Don’t assume you’re done the moment you save. Defaults are sneaky, and you want real evidence.

The hardest evidence is server logs. Filter for the OAI-SearchBot, ChatGPT-User, and PerplexityBot user-agents and check that their recent requests return 200, not 402 or 403. A 402 usually means Pay Per Crawl is asking for money; a 403 means the edge layer rejected the request outright, so head back to step four and check the Cloudflare layer. Logs tell you, plainly, whether the bots are actually getting in, and that’s more trustworthy than any toggle in a dashboard.

If you don’t have full log access, use curl to spoof those crawler user-agents against your own pages and read the status code that comes back. It’s crude but effective, and it at least separates the 200, 402, and 403 outcomes. Some bots also validate by source IP, so a spoof won’t perfectly reproduce a real fetch, but status-layer blocking shows up fine this way.

Then go ask. Every few days, search your core categories and the questions buyers most commonly ask inside ChatGPT and Perplexity, and see whether you show up. Perplexity prints source links directly, so confirming a citation is easy. ChatGPT sometimes lists sources and sometimes doesn’t, so run a few queries and look for the pattern. Expect a lag at the start, since the retrieval bots need to re-crawl and the AI needs time to fold you back into answers. Don’t panic if you see nothing the same day.

Treat log verification as the main signal and test queries as the backup. Logs tell you whether the door is open; the queries tell you whether guests actually walk in. When both line up, the fix has landed.

Related Articles