Robots.txt for AI Crawlers: How to Control What AI Engines See and Index
Your robots.txt file is the first thing every AI crawler reads before it touches your site. It is a plain text file at your domain root that tells bots which URLs they can access and which they cannot. For decades, this file was a simple gatekeeper for Googlebot and Bingbot. Now it is the front line of a much larger battle — one that determines whether your content appears in ChatGPT, Perplexity, Claude, Gemini, and every other AI-powered search engine.
The stakes are real. An HTTP Archive analysis of 12.15 million websites found that 94.12% maintain robots.txt files with at least one directive. Over 560,000 sites now specifically reference GPTBot. Another 560,000 reference ClaudeBot. These numbers grew from near zero to hundreds of thousands in under two years. The decision you make about your robots.txt configuration directly controls whether AI engines can discover, crawl, and cite your pages — or whether your content remains invisible to the fastest-growing discovery channel in search history.
The AI Crawler Landscape in 2026
Before configuring your robots.txt, you need to understand which bots exist, who operates them, and what they do with your content. AI crawlers fall into two distinct categories: training crawlers that scrape content to build AI models, and retrieval crawlers that fetch content to generate real-time answers with citations.
Training Crawlers
Training crawlers download your content to include it in future model training datasets. They do not send traffic back to your site. They do not cite you. They extract value from your content without providing any visibility benefit in return.
The primary training crawlers you should know:
- GPTBot (OpenAI) — The most blocked AI crawler on the web. GPTBot scrapes pages to train OpenAI's foundation models. It has a crawl-to-refer ratio of 1,255:1, meaning it crawls 1,255 pages for every single referral it sends back. Nearly 21% of the top 1,000 websites block it specifically.
- ClaudeBot (Anthropic) — Anthropic's training crawler. Its crawl-to-refer ratio is even worse at 20,583:1. ClaudeBot blocking grew fastest during Q1 2026, with its share of disallow rules rising from 9.6% in January to 10.1% by March.
- Google-Extended — Google's opt-out mechanism for Gemini model training. Referenced by 262,000 sites as of September 2024. Blocking Google-Extended does not affect your Google Search visibility — it only prevents your content from training Gemini.
- Meta-ExternalAgent — Meta's crawler for training its Llama models. Provides no search visibility benefit.
- CCBot (Common Crawl) — The crawler behind the Common Crawl dataset, which feeds training pipelines for numerous AI companies. Widely blocked across publisher sites.
- Amazonbot — Amazon's crawler, which saw the largest blocking growth at +0.87 percentage points in early 2026.
Retrieval and Search Crawlers
Retrieval crawlers fetch your content in real time to generate answers that include citations linking back to your site. These are the crawlers that can send you traffic.
- ChatGPT-User — The crawler that fires when a ChatGPT user asks a question and the system retrieves live web content. As of December 2025, OpenAI changed ChatGPT-User so it no longer complies with robots.txt for user-initiated actions. This means blocking it in robots.txt may not prevent retrieval.
- OAI-SearchBot — OpenAI's dedicated search crawler for ChatGPT Search. It indexes content similarly to a traditional search engine crawler.
- PerplexityBot — Perplexity's crawler for indexing content it can cite in answers. Blocked by 67% of major news sites, yet it appears more often in allow rules than disallow rules on non-publisher sites because it returns meaningful traffic.
- Applebot-Extended — Apple's opt-out for Apple Intelligence training. Referenced by 262,000 sites alongside Google-Extended.
The Blocking Decision: What the Data Shows
The data from BuzzStream's analysis of news publishers reveals a clear pattern: 79% of top news sites block AI training bots, and 71% block AI retrieval bots. But only 14% block all AI bots entirely, while 18% do not block any.
This split reflects the core tension. Blocking training crawlers protects your intellectual property and prevents your content from being used without compensation. But blocking retrieval crawlers eliminates your visibility in the fastest-growing search channels.
The recommended strategy that has emerged across the industry is selective blocking. Block training-focused bots while explicitly allowing search and retrieval bots. According to Cloudflare's analysis, this approach blocks approximately 89.4% of extractive traffic while preserving the 10.2% that sends actual visitors to your site.
Recommended Robots.txt Configuration
Here is the configuration that balances content protection with AI search visibility:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow AI search and retrieval crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Standard search engine crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: *
Disallow: /admin
Disallow: /checkout
Disallow: /cart
Sitemap: https://yourdomain.com/sitemap.xml
This configuration explicitly blocks six training crawlers while explicitly allowing three retrieval crawlers. The explicit allow directives are important — they make your intent unambiguous to the crawler and to any intermediary systems that process robots.txt on behalf of AI platforms.
Why Explicit Allow Rules Matter
Many sites only include disallow rules and rely on the implicit default that unlisted user agents are allowed. This works technically, but explicit allow rules serve two purposes. First, they make your configuration self-documenting. Anyone reviewing your robots.txt immediately understands your strategy. Second, some AI platforms have reported that explicit allow signals are weighted as a positive trust indicator during content selection.
Shopify-Specific Configuration
Shopify stores face a unique challenge. Shopify silently updated its default robots.txt template to block several AI crawlers without notifying store owners. This means your Shopify store may already be blocking GPTBot, ClaudeBot, and other AI crawlers without your knowledge.
To check, visit https://yourstore.com/robots.txt and review the current directives. If AI crawlers are blocked and you want visibility in AI search, you need to customize your robots.txt through Shopify's robots.txt.liquid template.
In Shopify, navigate to Online Store, then Themes, then Edit Code, and find the robots.txt.liquid file in the Templates folder. Add your custom directives using Shopify's Liquid syntax:
{% comment %}
Allow AI search crawlers for product discovery
{% endcomment %}
User-agent: OAI-SearchBot
Allow: /collections/
Allow: /products/
Allow: /pages/
Allow: /blogs/
User-agent: PerplexityBot
Allow: /collections/
Allow: /products/
Allow: /pages/
Allow: /blogs/
User-agent: ChatGPT-User
Allow: /collections/
Allow: /products/
Allow: /pages/
Allow: /blogs/
For Shopify stores, you want AI search crawlers to access your product pages, collection pages, blog content, and informational pages. You do not want them accessing checkout flows, customer account pages, or internal admin paths — which Shopify already blocks by default.
The GPTBot Explosion: A Timeline
Understanding how quickly this landscape changed helps contextualize why robots.txt configuration demands active management:
- August 2023 — OpenAI announces GPTBot. Blocking goes from 0 to 125,000 sites within one month.
- November 2023 — GPTBot blocking reaches 578,000 sites.
- April 2024 — PerplexityBot appears on 31,000 sites' robots.txt files and grows rapidly.
- September 2024 — Google-Extended and Applebot-Extended each reach 262,000 sites.
- 2025 — GPTBot and ClaudeBot each surpass 560,000 sites with specific directives.
- December 2025 — OpenAI changes ChatGPT-User to bypass robots.txt for user-initiated retrieval, fundamentally altering the enforcement model.
This timeline shows that robots.txt for AI is not a set-and-forget configuration. It requires quarterly review as new crawlers appear and existing crawlers change their compliance behavior.
Crawl-to-Refer Ratios: Measuring the Value Exchange
Not all AI crawlers provide equal value. The crawl-to-refer ratio measures how many pages a bot crawls for every referral visit it sends back to your site. Lower ratios indicate better value exchange.
The Cloudflare data reveals extreme differences:
- GPTBot — 1,255:1 ratio. It crawls 1,255 of your pages for every one visitor it sends back. This is an extractive relationship.
- ClaudeBot — 20,583:1 ratio. Even more extractive. For every 20,583 pages crawled, you get one referral.
- PerplexityBot — Significantly better ratio, which explains why it appears more often in allow rules. Perplexity's citation-heavy interface drives meaningful referral traffic.
- ChatGPT-User — Variable ratio depending on query type, but generally favorable because it only fires during active user sessions.
These ratios should inform your blocking strategy. Crawlers with extreme ratios are primarily extracting value. Crawlers with favorable ratios are providing a genuine traffic exchange.
Beyond Robots.txt: Additional AI Crawler Controls
Robots.txt is the primary mechanism, but it is not the only one. Several complementary approaches exist:
HTTP Headers — The X-Robots-Tag HTTP header can specify AI-specific directives on a per-page or per-directory basis. This provides more granular control than robots.txt, which operates at the path level.
Meta Tags — The <meta name="robots"> tag in your HTML head can include AI-specific directives. Some publishers use noai or noimageai values, though standardization is still evolving.
Cloudflare AI Bot Management — If you use Cloudflare, their AI bot management tools provide real-time blocking and rate limiting that goes beyond robots.txt. This is particularly useful because some AI crawlers do not fully respect robots.txt directives.
TDMRep (Text and Data Mining Reservation Protocol) — An emerging standard that provides machine-readable licensing information. It allows you to specify whether your content can be used for training, and under what conditions.
Monitoring and Verification
Configuring robots.txt is only the first step. You need to verify that crawlers are actually respecting your directives and monitor for new crawlers that your configuration does not yet address.
Check your server logs monthly for AI crawler user agents. Look for user agents you have blocked that are still accessing your content — this indicates non-compliance. Review the list of known AI crawlers quarterly, as new ones appear regularly.
Google Search Console and Bing Webmaster Tools both provide crawl statistics that can help you identify AI bot activity. Third-party tools like Cloudflare's bot analytics dashboard provide more detailed breakdowns.
The robots.txt file is the simplest and most powerful lever you have for controlling AI crawler access. Configure it deliberately, review it regularly, and treat it as an active part of your AI search strategy rather than a static technical artifact.