bot-blockingscrapinglandscape

AI Bot Blocking Landscape: Status Report May 2026

35%+ of top 1M websites block AI bots. Cloudflare, Akamai, and GoDaddy partnership driving adoption. Impact on developers.

7 min

As of May 2026, over 35% of the top 1 million websites actively block AI bot traffic. The blocking landscape has shifted from robots.txt honor-system to aggressive technical enforcement via Cloudflare, Akamai, and custom WAF rules.

Current blocking by provider

  • Cloudflare: AI Bot blocking toggle available to all plans, default-on for Business+
  • Akamai: Bot Manager now includes AI crawler category with automatic blocking
  • Fastly: Signal Sciences integration detects and blocks LLM training crawlers
  • GoDaddy: partnered with Cloudflare for one-click AI blocking on 21M+ domains
  • Vercel: Edge Middleware templates include AI bot blocking rules

Who is blocking and why

The motivations vary by industry:

  • Publishers: protecting content from LLM training (NYT, Reddit, Stack Overflow)
  • E-commerce: preventing price scraping by competitors
  • SaaS: stopping feature scraping and competitive intelligence bots
  • Government: security compliance requirements
  • Small businesses: Cloudflare defaults they never turned off

Detection methods in use

Modern bot detection goes far beyond user-agent checking:

  • TLS fingerprinting (JA3/JA4 hashes) identifying headless browsers
  • JavaScript environment probing (navigator properties, WebGL, Canvas)
  • Behavioral analysis (request timing, mouse movement, scroll patterns)
  • IP reputation databases flagging known datacenter and proxy ranges
  • HTTP/2 fingerprinting detecting non-browser HTTP clients

Impact on developers

If your application scrapes websites directly, expect failure rates of 30-60% across a random sample of domains. The failure rate is higher for popular sites and e-commerce. This makes direct scraping unreliable for any production workload.

Python
import requests

# Test blocking rate across 100 random domains
blocked = 0
total = 100
for domain in random_domains[:total]:
    try:
        resp = requests.get(
            f"https://{domain}",
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=10,
        )
        if resp.status_code == 403 or "challenge" in resp.text[:500].lower():
            blocked += 1
    except Exception:
        blocked += 1

print(f"Blocked: {blocked}/{total} ({blocked}%)")
# Typical result in May 2026: 35-45% blocked

The structured API alternative

Search APIs query search engine indexes, not target websites directly. Bot blocking on target sites is irrelevant because your application never touches them.

Python
import os, requests

# Always works regardless of target site bot blocking
resp = requests.post(
    "https://api.scavio.dev/api/v1/search",
    headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
    json={
        "query": "site:example.com pricing",
        "num_results": 10,
    },
)
# Returns indexed data even if example.com blocks all bots
results = resp.json().get("organic_results", [])
for r in results:
    print(f"{r['title']}: {r['snippet']}")

Predictions for the rest of 2026

  • Blocking rate will reach 50% of top 1M sites by December 2026
  • More CDN providers will add AI-specific blocking as a default
  • robots.txt will become less relevant as technical enforcement replaces it
  • Search APIs and MCP-based data access will become the standard integration pattern

What to do now

Audit your scraping dependencies. Identify which targets are already blocking you and which are likely to start. Migrate critical data flows to structured APIs now, before the next wave of blocking breaks your production system.