bot-blockingscrapinglandscape

AI Bot Blocking Landscape: Status Report May 2026

35%+ of top 1M websites block AI bots. Cloudflare, Akamai, and GoDaddy partnership driving adoption. Impact on developers.

May 14, 2026

7 min

As of May 2026, over 35% of the top 1 million websites actively block AI bot traffic. The blocking landscape has shifted from robots.txt honor-system to aggressive technical enforcement via Cloudflare, Akamai, and custom WAF rules.

Current blocking by provider

Cloudflare: AI Bot blocking toggle available to all plans, default-on for Business+
Akamai: Bot Manager now includes AI crawler category with automatic blocking
Fastly: Signal Sciences integration detects and blocks LLM training crawlers
GoDaddy: partnered with Cloudflare for one-click AI blocking on 21M+ domains
Vercel: Edge Middleware templates include AI bot blocking rules

Who is blocking and why

The motivations vary by industry:

Publishers: protecting content from LLM training (NYT, Reddit, Stack Overflow)
E-commerce: preventing price scraping by competitors
SaaS: stopping feature scraping and competitive intelligence bots
Government: security compliance requirements
Small businesses: Cloudflare defaults they never turned off

Detection methods in use

Modern bot detection goes far beyond user-agent checking:

TLS fingerprinting (JA3/JA4 hashes) identifying headless browsers
JavaScript environment probing (navigator properties, WebGL, Canvas)
Behavioral analysis (request timing, mouse movement, scroll patterns)
IP reputation databases flagging known datacenter and proxy ranges
HTTP/2 fingerprinting detecting non-browser HTTP clients

Impact on developers

If your application scrapes websites directly, expect failure rates of 30-60% across a random sample of domains. The failure rate is higher for popular sites and e-commerce. This makes direct scraping unreliable for any production workload.

Python

import requests

# Test blocking rate across 100 random domains
blocked = 0
total = 100
for domain in random_domains[:total]:
    try:
        resp = requests.get(
            f"https://{domain}",
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=10,
        )
        if resp.status_code == 403 or "challenge" in resp.text[:500].lower():
            blocked += 1
    except Exception:
        blocked += 1

print(f"Blocked: {blocked}/{total} ({blocked}%)")
# Typical result in May 2026: 35-45% blocked

The structured API alternative

Search APIs query search engine indexes, not target websites directly. Bot blocking on target sites is irrelevant because your application never touches them.

Python

import os, requests

# Always works regardless of target site bot blocking
resp = requests.post(
    "https://api.scavio.dev/api/v1/search",
    headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
    json={
        "query": "site:example.com pricing",
        "num_results": 10,
    },
)
# Returns indexed data even if example.com blocks all bots
results = resp.json().get("organic_results", [])
for r in results:
    print(f"{r['title']}: {r['snippet']}")

Predictions for the rest of 2026

Blocking rate will reach 50% of top 1M sites by December 2026
More CDN providers will add AI-specific blocking as a default
robots.txt will become less relevant as technical enforcement replaces it
Search APIs and MCP-based data access will become the standard integration pattern

What to do now

Audit your scraping dependencies. Identify which targets are already blocking you and which are likely to start. Migrate critical data flows to structured APIs now, before the next wave of blocking breaks your production system.