Robots.txt Tester & Validator

A single misplaced line in robots.txt can block an entire section of your site from search engines. The file looks simple, just a handful of User-agent and Disallow directives, but precedence rules, wildcard patterns, and the gap between what you think you wrote and what Googlebot actually reads make it easy to get wrong.

Google Search Console has a robots.txt tester, but it only checks against Googlebot. It tells you nothing about GPTBot, Bingbot, or any other crawler that reads the same file. And it cannot validate the syntax itself. If you have a directive before a User-agent line, it just silently ignores it.

This tool parses the full robots.txt content, tests any URL path against any user-agent, and flags syntax problems. Paste your file, enter a path, and see exactly which rule matches and why. Everything runs in your browser. No data leaves your machine.

What robots.txt does and how crawlers use it

The robots.txt file is a plain text file at the root of a domain (https://example.com/robots.txt) that tells web crawlers which paths they are allowed or not allowed to fetch. It follows the Robots Exclusion Protocol, formalized in RFC 9309 in 2022.

When a compliant crawler arrives at your domain, it fetches /robots.txt first. It finds the group whose User-agent matches its own name (or falls back to the wildcard * group), then evaluates the Allow and Disallow rules in that group. If a path is disallowed, the crawler skips it. If allowed or not mentioned, the crawler proceeds.

Robots.txt is advisory, not enforceable. Legitimate crawlers from Google, Bing, OpenAI, and others respect it. Malicious scrapers do not. The file controls crawling, not indexing. A page that is disallowed in robots.txt can still appear in search results if other pages link to it. To prevent indexing, you need a noindex meta tag or an X-Robots-Tag header.

How to use this robots.txt tester

Using the tool takes three steps. Paste the contents of your robots.txt file into the text area. Enter the URL path you want to test (for example, /admin/settings or /api/v1/users). Select a user-agent from the dropdown (Googlebot, GPTBot, Bingbot, or any other bot) and click Test URL.

The results panel shows an immediate ALLOWED or BLOCKED verdict, the specific rule that matched (with line number), and the complete list of parsed directives grouped by user-agent. If you only want to check the syntax without testing a specific path, click Validate Only to see warnings and parsed structure.

The tester highlights the matching group and the matching rule in the parsed directives view, so you can see exactly which line produced the result. If a wildcard group served as the fallback, that is shown too.

Robots.txt directive reference

Every robots.txt file uses a small set of directives. Here is what each one does:

Directive	Purpose
User-agent	Specifies which crawler the following rules apply to. Use `*` for all bots.
Disallow	Blocks the crawler from fetching the specified path prefix. An empty value (`Disallow:`) means allow everything.
Allow	Explicitly permits a path within a broader `Disallow`. Useful for carving out exceptions.
Sitemap	Points crawlers to one or more XML sitemaps. Not group-specific. Can appear anywhere in the file.
Crawl-delay	Requests a pause (in seconds) between successive fetches. Supported by Bing and Yandex, ignored by Googlebot.
Host	Specifies the preferred domain for crawling. A Yandex-specific extension, not part of the standard.

Common robots.txt mistakes and how to fix them

Most robots.txt problems come from a few recurring errors. Here are the ones this validator catches or that you should watch for:

Rules before a User-agent line. Every Allow or Disallow must belong to a group that starts with User-agent. If you put a Disallow at the top of the file without declaring a User-agent first, compliant crawlers ignore the rule entirely.

Blocking CSS and JavaScript. Blocking /css/ or /js/ directories prevents Googlebot from rendering your pages. Google has stated repeatedly that it needs access to these resources for proper indexing. If your search console shows rendering errors, check robots.txt first.

Confusing Disallow with noindex. A Disallow rule stops crawling, not indexing. If other sites link to a disallowed page, Google may still index the URL with whatever information it can gather from link text. To remove a page from search results, use a noindex meta tag or X-Robots-Tag header instead.

Trailing whitespace and encoding issues. Invisible characters, BOM markers, or Windows-style line endings can cause unexpected parsing failures. If a rule looks correct but is not matching, copy the raw file content and paste it into a validator to check for hidden characters.

Blocking your own sitemap. If your Disallow covers the path where your sitemap lives (for example, Disallow: / without an Allow: /sitemap.xml), crawlers may not discover or fetch it even though you declared it with a Sitemap: directive.

Wildcard patterns and the $ anchor

While the original robots.txt spec only supported simple prefix matching, Google and Bing extended it with two pattern characters: * (match any sequence of characters) and $ (match the end of the URL).

The wildcard * matches zero or more characters anywhere in the path. For example, Disallow: /private/*/config blocks /private/user/config, /private/admin/config, and anything else that fits the pattern. Without the wildcard, you would need a separate rule for each intermediate path segment.

The $ anchor forces an exact-end match. Disallow: /*.pdf$ blocks all URLs ending in .pdf but still allows /docs/pdf-guide/ because the URL does not end with .pdf. Without the $, the pattern /*.pdf would also match /file.pdf?download=true since prefix matching continues past the match point.

This tester supports both * wildcards and the $ end anchor, matching the behavior of Googlebot and Bingbot. The parsed directives view shows the exact pattern so you can verify it matches what you intended.

How Googlebot, Bingbot, and GPTBot read robots.txt differently

All three crawlers follow the same basic protocol, but there are important differences in how they handle edge cases and extensions.

Googlebot follows RFC 9309 strictly. It supports Allow, Disallow, wildcards (*), and the $ anchor. It uses the most specific matching rule (longest path wins). It ignores Crawl-delay entirely -- use Search Console's crawl rate settings instead. Googlebot uses multiple user-agent tokens: Googlebot for web search, Googlebot-Image for image search, and Google-Extended for Gemini and AI training data.

Bingbot also supports wildcards and the $ anchor. Unlike Googlebot, it respects Crawl-delay. Bing's implementation follows similar precedence rules to Google -- when multiple rules match, the most specific one wins.

GPTBot is OpenAI's crawler for training data and web browsing by ChatGPT. It reads robots.txt using the user-agent token GPTBot. Adding User-agent: GPTBot followed by Disallow: / blocks OpenAI from crawling your site for AI training purposes. A separate token, ChatGPT-User, controls web browsing within ChatGPT conversations. Blocking one does not block the other.

robots.txt vs meta robots vs X-Robots-Tag

These three mechanisms control different things. Understanding which one to use where prevents common mistakes:

Mechanism	Controls	Scope
robots.txt	Crawling (whether a bot fetches a URL)	Entire paths or patterns. Applied before fetching.
meta robots	Indexing, following links, snippet display	Individual HTML pages. Applied after fetching and parsing.
X-Robots-Tag	Same as meta robots, but via HTTP header	Any resource (PDFs, images, APIs). Applied after fetching.

A common pitfall: if you block a URL with robots.txt, the crawler never fetches the page, so it never sees a noindex meta tag on it. That means robots.txt and noindex cannot work together on the same page. If you want to deindex a page, you must allow crawling so the bot can read the noindex directive.

Use robots.txt to manage crawl budget -- keeping bots away from low-value paths like admin panels, staging content, and duplicate parameter URLs. Use meta robots or X-Robots-Tag to control whether a page appears in search results.

Frequently Asked Questions

Does robots.txt block indexing?

No. Robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in search results if Google finds links to it from other pages. To prevent indexing, use a noindex meta tag or X-Robots-Tag HTTP header.

Where should robots.txt be located?

Robots.txt must be at the root of the domain: https://example.com/robots.txt. It cannot be in a subdirectory. Each subdomain needs its own file -- blog.example.com/robots.txt is separate from example.com/robots.txt.

How do I block AI crawlers like GPTBot?

Add a group for each AI crawler you want to block. For OpenAI: User-agent: GPTBot followed by Disallow: /. For ChatGPT web browsing: add User-agent: ChatGPT-User with Disallow: /. For Google's AI training: use User-agent: Google-Extended with Disallow: /. Each requires a separate group.

What happens when multiple rules match a URL?

When multiple Allow and Disallow rules match the same URL, the most specific rule wins. Specificity is determined by the length of the path pattern (ignoring wildcards). If an Allow: /admin/public/ and a Disallow: /admin/ both match /admin/public/page, the Allow wins because its path is longer.

Does Googlebot respect Crawl-delay?

No. Googlebot ignores the Crawl-delay directive entirely. To control Google's crawl rate, use the crawl rate limiter in Google Search Console. Bingbot and Yandex do respect Crawl-delay.

Is my data sent to a server when I use this tool?

No. This robots.txt tester runs entirely in your browser. The robots.txt content, URL paths, and user-agent selections are parsed and tested using JavaScript on your device. Nothing is sent to any server.

Can I test wildcard patterns?

Yes. This tool supports both the * wildcard (matches any sequence of characters) and the $ end-of-URL anchor. These are the same extensions used by Googlebot and Bingbot for pattern matching in robots.txt rules.

Related Free Tools

If you are debugging crawl behavior or site accessibility, these tools help with adjacent tasks:

Bulk HTTP Status Checker -- test up to 100 URLs at once for status codes, redirects, and response times. Useful for verifying that pages blocked by robots.txt are not returning 200 when they should return 403 or 404.
HTML Formatter & Beautifier -- format and pretty-print HTML to inspect page structure, meta robots tags, and canonical links.
JSON Formatter & Validator -- format and validate API responses, useful when working with sitemap or indexing APIs that return JSON.
Curl Command Generator -- convert curl commands to Python, JavaScript, Ruby, or PHP code for automating HTTP requests and crawl checks.