
2026-06-24
Robots.txt Explained: What It Can and Cannot Do
Robots.txt controls crawling, not indexing — the #1 SEO misconception. Learn the syntax, the noindex trap, real examples, and the mistakes that deindex entire sites.
robots.txt is a plain text file with outsized power: one wrong character can make Google abandon your entire site. Yet the most dangerous thing about it is a misconception — robots.txt controls crawling, not indexing. Understanding that distinction prevents the two most damaging SEO mistakes a site can make.
What robots.txt is
A text file at your domain root — example.com/robots.txt, no other location works — that tells crawlers which URL paths they may fetch. It is the first thing Googlebot requests when visiting your site.
User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml
- User-agent — which bot the rules address (
*= all) - Disallow — path prefixes not to crawl
- Allow — exceptions within disallowed paths
- Sitemap — where your page list lives (always include this)
Build a correct file in seconds with our Robots.txt Generator.
The crawling vs indexing trap
Here is the counterintuitive part: Disallow does not remove a page from Google. If other sites link to a blocked page, Google can index the bare URL anyway — shown with "No information is available for this page."
Worse, the two controls sabotage each other:
| Goal | Right tool | Common mistake |
|---|---|---|
| Stop crawling (save crawl budget) | robots.txt Disallow | — |
| Remove from search results | noindex meta tag, page crawlable | Blocking it in robots.txt too |
If you block a page in robots.txt and add noindex, Google never crawls the page, never sees the noindex, and the URL can stay indexed indefinitely. To deindex: allow crawling, add noindex, wait for recrawl, then optionally block.
What robots.txt cannot do
- Security. It is a public file that merely requests good behaviour. Malicious bots ignore it — and read it as a map of interesting paths. Never list secret URLs; protect private content with authentication.
- Guaranteed compliance. Reputable crawlers (Google, Bing) obey; scrapers don't.
- Per-page nuance. For that, use meta robots tags or
X-Robots-Tagheaders.
The mistakes that hurt sites
Disallow: /left over from staging. The classic launch-day disaster — the entire site becomes uncrawlable. Check your live file after every deploy.- Blocking CSS/JS directories. Google renders pages; blocked assets mean it sees broken layouts, which harms rankings.
- Using Disallow to deindex (the trap above).
- Wrong location or case. Must be
/robots.txtat the root, lowercase — subdirectory copies are ignored. - No Sitemap line. Free crawl-discovery help, skipped by half the web.
Sensible defaults
Most sites need very little:
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
Block only genuine crawl-waste: infinite calendar pages, faceted search combinations, cart/checkout flows. When in doubt, leave it open — an over-permissive robots.txt is harmless; an over-restrictive one is a slow-motion catastrophe.
Frequently asked questions
How do I check what my robots.txt is doing? Google Search Console → Settings → robots.txt report shows the fetched file, parse errors and which rules block which URLs.
Can I block AI crawlers like GPTBot?
Yes — major AI crawlers respect robots.txt: User-agent: GPTBot + Disallow: /. Weigh the trade-off: blocking AI crawlers also removes you from AI-powered answer engines that could cite and link you.
Does crawl-delay work? Google ignores it (use Search Console's crawl settings); Bing and Yandex respect it. Rarely needed on modern hosting.
How fast do robots.txt changes take effect? Google re-fetches the file roughly every 24 hours. Do not expect instant behaviour changes.