Why crawl budget becomes a real problem on real sites
Crawl budget only really makes sense the first time you watch Googlebot burn through pages you’d never want in the index, while the pages that actually matter sit there untouched for weeks. On a small business site, that can be the difference between a new service page ranking this month versus “at some point”. On bigger ecommerce or multi-location sites, it’s the difference between Google treating your catalogue as current or quietly letting it go stale.
It’s also not a score you “boost” with a plugin. Crawl budget is the practical outcome of two constraints: how much Google is prepared to crawl, based on how valuable and healthy your site looks, and how much crawling your server can tolerate before performance drops. When either side tightens, Google has to prioritise. If your internal linking, redirects, parameters, and duplicate URLs are messy, Google’s priorities won’t line up with yours.
What Google actually means by crawl budget
Google defines crawl budget as the number of URLs Googlebot can and wants to crawl. In the real world, it behaves like a queue.
First, there’s the crawl rate limit: the ceiling set by your server’s responses. If pages start timing out, slowing down, or throwing errors, Googlebot backs off. Then there’s crawl demand, Google deciding how much attention your site deserves based on signals like popularity, freshness, and how often it finds meaningful changes.
So when you publish a new landing page and it doesn’t get crawled, it’s rarely because “Google hates my page”. More often, Googlebot is tied up re-crawling faceted URLs, old redirected paths, tag archives, thin internal search pages, or endless calendar pages, because the site keeps presenting them as crawlable and, by implication, important.
When crawl budget matters (and when it doesn’t)
If you’ve got a tidy 20 page brochure site, you can usually park the worry. Google will crawl it often enough unless your server is unreliable or you’ve accidentally created thousands of URLs via filters, tags, or parameters.
Crawl budget starts to bite when you’ve got scale or churn. Ecommerce with thousands of products, real estate listings, automotive inventory, multi-location service pages, big blog archives, or any site that produces lots of near duplicate URLs will hit the wall sooner. It matters even more when changes need to be picked up quickly, product availability, pricing, promotions, or new location pages.
A very common “small business” crawl budget issue is a site that looks small on the surface but has a massive URL footprint underneath. WordPress is notorious for this via tag pages, author archives, date archives, internal search results, and parameterised URLs from plugins. Shopify and similar platforms do it with collection filters, sorting parameters, and multiple URL paths to the same product.
The crawl budget killers we see in the wild
Duplicate and near-duplicate URLs
Google will crawl duplicates all day if you keep publishing them. The catch is that your important pages end up competing for attention. Tracking parameters, sort orders, filter combinations, session IDs, and multiple category paths are the usual suspects.
Canonical tags help, but they’re a hint, not a promise. If you’re generating millions of combinations, canonicals won’t rescue you on their own. You either need to stop creating crawlable URLs in the first place, or block them properly.
Redirect chains and soft 404s
Every redirect hop costs crawl resources. A single clean 301 is fine. A chain of 301s, mixed 302s, or redirects that dump users onto irrelevant pages adds up quickly. Soft 404s are another quiet drain, pages that return a 200 status but effectively say “not found”. Google still has to fetch, render, and assess them.
Internal linking that promotes the wrong pages
Googlebot follows links. If your navigation, footer, and internal modules heavily promote low value pages, tag archives, expired promos, thin location variations, you’re effectively telling Google, “these are important”.
This is where architecture and crawl efficiency overlap. If you want the deeper mechanics, our piece on how search engines crawl and understand website architecture explains how bots discover and prioritise URLs based on site structure.
Bloated XML sitemaps
Sitemaps aren’t a magic wand. They’re a list of suggestions. If your sitemap is packed with redirected URLs, canonicals you don’t want indexed, parameter pages, or thin content, you’re handing Google a noisy, unhelpful to do list. The best sitemaps are boring: only indexable, canonical URLs you genuinely want appearing in search.
Server performance and caching gaps
If Googlebot runs into timeouts, 5xx errors, or consistently slow responses, it reduces crawl rate. This is one of the few crawl budget problems that can hurt even small sites. You can do everything “right” on page and still struggle to get timely crawling if hosting is underpowered, caching is misconfigured, or the site does heavy work on every request.
Core Web Vitals aren’t a direct “crawl budget metric”, but poor performance often travels with poor crawlability because both point to the same underlying issue, a site that’s expensive to fetch and render.
How to diagnose crawl budget issues properly
Start with Google Search Console crawl stats
Search Console’s crawl stats report is as close as you’ll get to a crawl budget dashboard. Look for trends, not one off spikes. If crawl requests are high but your key pages aren’t being indexed or refreshed, Googlebot is probably spending its time on the wrong URLs. If crawl requests are low and you’re seeing server errors or timeouts, you’re likely dealing with a crawl rate limit problem.
Use server logs when the stakes are high
When it’s serious, nothing beats log analysis. Search Console aggregates and samples, logs show exactly what Googlebot requested, how often, and what status codes you returned. Waste becomes obvious fast: repeated hits to parameter URLs, old redirect targets, and thin templates that never should have been crawlable.
Logs also uncover the awkward problems, like Googlebot getting trapped in an infinite URL space created by calendar widgets, internal search, or faceted navigation.
Compare “discovered URLs” to “indexable URLs”
A healthy site keeps the gap between what exists and what you actually want indexed reasonably tight. If your crawler (Screaming Frog, Site bulb, etc.) finds 50,000 URLs but you only have 2,000 pages worth indexing, you’ve got a crawl budget problem, whether it’s hurting you yet or not.
Fixes that actually move the needle
Reduce crawlable URL variants at the source
The biggest win is usually upstream: stop generating crawlable duplicates. That might mean changing how filters work so they don’t create indexable URLs, limiting which facets produce URLs, or handling sorting and tracking parameters in a way that doesn’t spawn crawl paths. On some platforms it’s configuration, on others it’s proper development work.
Get ruthless with indexation signals
Use robots.txt to prevent crawling of URL patterns you never want fetched, especially infinite spaces like internal search results or filter combinations. Use noindex for pages you still want accessible to users but don’t want indexed. Remember, robots.txt blocks crawling, which can stop Google from ever seeing a noindex tag on those pages, so be deliberate about which tool you use where.
Canonicals still have their place, particularly for unavoidable duplicates, but they work best as clean up not as your main defence.
Fix redirect hygiene
Collapse redirect chains, remove internal links pointing to redirected URLs, and keep your sitemap free of anything that isn’t a final destination. When we take over sites, it’s common to find internal links still pointing at URLs that were redirected years ago. Googlebot ends up re learning the same lesson, again and again.
Make internal linking reflect business priorities
If a page is commercially important, it shouldn’t be buried three clicks deep behind thin category pages and tag archives. This is less about “link juice” and more about crawl pathing, Googlebot spends more time where your site consistently points it.
If you want a practical framework for structural checks, our technical SEO checklist for structurally sound websites pairs well with crawl budget work because it forces you to fix the boring issues that create crawl waste.
Improve server response under bot load
Make sure caching is actually doing its job, particularly on pages that don’t change often. On platforms where every request triggers heavy database work, Googlebot can accidentally turn into a load test. Better hosting, correct caching headers, a CDN, and trimming third party scripts can all lift the crawl rate limit ceiling.
A practical way to think about it
Googlebot is a cost centre. Every fetch, render, and evaluation consumes compute. Your job is to make the easiest URLs to discover the ones worth spending that compute on.
When crawl budget is healthy, new pages are discovered quickly, updated pages get refreshed, and low value URLs don’t multiply. When it’s unhealthy, the symptoms repeat: key pages stuck in “Discovered, currently not indexed”, stale snippets, slow rollout of changes, and a widening gap between what you publish and what Google actually processes.
Fixing crawl budget isn’t glamorous, but it’s one of the few technical SEO jobs where cleaning up the mess can deliver a measurable lift without writing a single new page.
Sources & Further Reading
- Google Search Central: Crawl budget overview
- Google Search Central: Manage crawling of your site (robots.txt and controls)
- Google Search Console Help: Crawl stats report
- Google Search Central: Sitemaps overview
- Google Search Central: Consolidate duplicate URLs (canonical)
- Google Search Central: Crawl Budget
- Moz: What Is Crawl Budget & How to Optimize It
- Google Search Central Blog: How Google Crawls the Web
- HubSpot: What Is Crawl Budget & How to Improve It
- Search Engine Journal: Crawl Budget Optimization Tips
Need help untangling crawl budget waste?
We can audit crawling and indexation issues and fix the technical causes holding your pages back.
Get in TouchComments
No comments yet. Be the first to join the conversation!
Leave a Comment
Your email address will not be published. Required fields are marked *