The common belief is that a broken robots.txt is a minor issue that Google handles gracefully. In reality, each error type triggers a fundamentally different crawl behavior — a 404 grants full access, a 5xx causes Googlebot to halt all crawling after an extended cache period, and a timeout falls somewhere between depending on duration and frequency. Misunderstanding these distinctions has caused site-wide crawl shutdowns during server incidents and unintended full-site exposure during CDN migrations.
A 404 robots.txt grants Googlebot unrestricted crawl access
RFC 9309, the Robots Exclusion Protocol specification, defines the behavior clearly: if a robots.txt request returns a 404 or 410 status code, the crawler must treat the site as having no crawl restrictions. Google follows this specification exactly. A 404 on robots.txt means every URL on the site is crawlable, regardless of what directives existed in any previously cached version.
The design rationale makes sense from the protocol perspective. A site that has never published a robots.txt file (returning 404 by default) should be fully crawlable. The specification cannot distinguish between “never existed” and “accidentally deleted.” It treats both cases identically.
The practical scenarios that cause accidental 404 responses are numerous. CDN misconfigurations are the most common: when migrating to a CDN or updating CDN rules, robots.txt may not be included in the cached or proxied content, causing the CDN edge to return 404. Platform migrations where the new platform does not automatically serve robots.txt at the root path. Container deployment errors where the web server configuration does not map the /robots.txt path to the correct file. DNS failover where the failover server serves a minimal site that does not include robots.txt.
The consequences of a temporary 404 extend beyond the 404 duration. During the window where robots.txt returns 404, Googlebot discovers and queues URLs from previously blocked sections. Even after robots.txt is restored with correct directives, those discovered URLs remain in Googlebot’s URL inventory. They may continue to receive crawl requests and could be indexed before the restored directives take effect on the next cache refresh.
For sites that rely heavily on robots.txt to block internal tools, staging content, or parameter-heavy URL spaces, a 404 on robots.txt is a full exposure event. Treating robots.txt availability as a critical infrastructure requirement, not a convenience, is the appropriate operational posture.
A 5xx robots.txt triggers conservative crawl restriction after cache expiry
When robots.txt returns a 5xx error, Google’s behavior follows a multi-phase timeline.
Phase 1: Cache persistence (hours to days). Google continues using the previously cached robots.txt. Crawling proceeds normally based on the cached directives. No immediate change in crawl behavior is visible.
Phase 2: Extended cache (days to weeks). If 5xx errors persist and normal cache TTL expires, Google extends the cache lifetime rather than discarding it. Google’s documentation states that caching extends “in situations where refreshing the cached version isn’t possible.” During this phase, crawling continues but at a potentially reduced rate as Google recognizes the server instability.
Phase 3: Full restriction (approximately 30 days). If robots.txt remains unreachable for approximately 30 consecutive days, Google treats the site as fully restricted. All crawling stops. This is Google’s safety mechanism: it assumes the site owner intended to block crawling but cannot communicate directives due to server failure. The alternative (allowing unrestricted crawling of a potentially broken site) would be worse.
Phase 4: Recovery. Once robots.txt returns a valid 200 response, Google resumes crawling. The resumption is not instantaneous. Google ramps up crawl rate gradually over days, rebuilding confidence that the server is stable and the directives are reliable. A site that lost 30 days of crawling due to robots.txt 5xx errors may take an additional 1-2 weeks to return to pre-incident crawl volumes.
Google’s 30-day threshold is an implementation-specific extension of RFC 9309. The standard itself recommends treating unreachable robots.txt with a “reasonable cache period” and then reverting to the last known directives. Google’s interpretation of “reasonable” is approximately 30 days before escalating to full restriction.
Timeout behavior and its effect on robots.txt cache state
Timeout handling occupies a gray zone between 404 and 5xx behavior. The crawl implications depend on whether the timeout is complete (no response received at all) or partial (connection established but response not completed).
Complete timeouts (connection refused, DNS failure, no response within Google’s timeout window) are treated similarly to 5xx errors. Google falls back to the cached robots.txt, extends cache lifetime, and follows the same multi-phase progression toward full restriction if timeouts persist.
Partial timeouts (connection established, headers partially received, response truncated) create ambiguity. Google may attempt to parse whatever data was received. If the partial response contains valid robots.txt directives, Google may use them. If the response is too truncated to parse, it falls back to cache.
Intermittent timeouts produce the most problematic crawl behavior. When robots.txt sometimes loads correctly and sometimes times out, Google receives inconsistent signals. On successful fetches, it updates the cache with current directives. On timeouts, it extends the cache from the last successful fetch. The result is that Googlebot may operate on directives that are sometimes current and sometimes stale, depending on the timing of its re-fetch attempts relative to the server’s availability windows.
This inconsistency is difficult to diagnose because it does not produce a clean error signal. Search Console may show no robots.txt errors if Google’s monitoring happens to hit the server during a healthy window. Server logs may show successful robots.txt fetches interleaved with timeouts, making it unclear which version Google is using at any given moment.
The 30-Day Extended Unavailability Full Crawl Block
The 30-day full-restriction trigger deserves specific attention because it represents a severe, cascading failure that is difficult to reverse.
The progression is silent. Google does not send a notification in Search Console when robots.txt enters the extended cache phase. There is no alert when the 30-day threshold is reached. The first visible symptom is a dramatic drop in crawl stats, followed by pages disappearing from the index as Google stops recrawling them and existing indexed pages gradually become stale.
Recovery requires more than simply restoring robots.txt. After the 30-day full-restriction state activates, Google must re-establish trust in the site’s robots.txt reliability. The ramp-up period involves Google testing with low-frequency crawls, verifying that robots.txt continues to respond correctly, and gradually increasing crawl rate. Sites with millions of pages may take weeks to recover full crawl coverage after a 30-day outage.
The monitoring implication: robots.txt availability should be monitored with the same urgency as site uptime. A 5xx error on robots.txt that persists for even 24 hours should trigger an alert, because the cascading consequences are disproportionate to what appears to be a single failed URL.
Monitoring and alerting configuration for robots.txt availability
Robots.txt failure has outsized consequences compared to any other single URL. A dedicated monitoring configuration is the appropriate response.
Health check frequency. Monitor robots.txt availability every 5 minutes from multiple geographic locations. Googlebot crawls from data centers worldwide; a robots.txt that responds in one region but fails in another produces inconsistent behavior.
Alert thresholds. Alert on any 5xx response. Alert on response times exceeding 2 seconds (timeout risk). Alert on 404 responses (full exposure risk). Alert on response body changes that differ from the expected directives (potential hijacking or misconfiguration).
HTTP methods. Test with both GET and HEAD requests. Googlebot uses GET to fetch robots.txt. Some server configurations respond differently to HEAD requests, which can mask issues that affect the actual GET fetch.
Content verification. A 200 response does not guarantee correct content. Server misconfigurations can serve HTML error pages (custom 404 pages, maintenance pages) with a 200 status code. The monitoring should verify that the response body contains expected robots.txt directives, not just that the status code is 200. Google’s parser will attempt to extract valid rules from HTML content but will discard everything it cannot parse.
Integration with deployment pipelines. Add a robots.txt verification step to deployment pipelines. After each deployment, automatically fetch robots.txt and compare against the expected version. This catches deployment errors that overwrite or remove the file before they affect Googlebot.
Does a 301 redirect on robots.txt to a different domain cause Googlebot to follow it?
Googlebot follows redirects on robots.txt requests, including cross-domain redirects, up to a limit of five hops. If the final destination returns a valid robots.txt file, Google applies those rules to the original host. This creates a risk when a domain migration redirects /robots.txt to the new domain’s robots.txt, which may contain rules not intended for the old domain’s URL structure. Verifying that redirected robots.txt rules match the source domain’s needs prevents unintended crawl access or blocking.
Does serving robots.txt from a subdomain affect how Googlebot applies the rules to the root domain?
Robots.txt is domain-specific. Each subdomain requires its own robots.txt file at its root path. Rules in www.example.com/robots.txt do not apply to blog.example.com, and vice versa. A missing robots.txt on any subdomain results in full crawl access to that subdomain. Sites using multiple subdomains must ensure each one has a correctly configured robots.txt file, or accept that unprotected subdomains will be crawled without restrictions.
Does a robots.txt file that returns a 200 status with an empty body block all crawling or allow all crawling?
An empty robots.txt file returned with a 200 status is treated as having no rules, which grants Googlebot unrestricted crawl access to the entire domain. This is functionally identical to a 404 response. The absence of disallow directives means nothing is blocked. Sites intending to restrict crawling must ensure the file contains explicit rules, as both a missing file and an empty file result in full access.
Sources
- How Google Interprets the robots.txt Specification — Google’s documentation on robots.txt caching, 404 handling (full access), and 5xx handling (extended cache to full restriction)
- How HTTP Status Codes Affect Google’s Crawlers — Google’s error handling documentation for crawler requests including robots.txt
- Robots.txt Introduction and Guide — Google’s foundational robots.txt specification covering file requirements and error states
- Google’s robots.txt Parser (GitHub) — Google’s open-source robots.txt parser implementation showing how it handles parsing edge cases