scrapingjavascriptsearch-api

JS-Rendered Directory Data Without Headless Browsers

JS-rendered pages return empty HTML to scrapers. Three options: headless browser, Firecrawl, or search API. Cost and complexity comparison.

May 15, 2026

7 min

JavaScript-rendered directories return empty HTML to standard HTTP requests because their content loads client-side after the initial page shell. You have three options: run a headless browser (complex, resource-heavy), use a rendering service like Firecrawl (simpler but expensive at scale with extraction), or query a search API that already indexes the rendered content (structured, cheap, zero rendering).

Why requests + BeautifulSoup fails

Python

import requests
from bs4 import BeautifulSoup

# This returns an empty shell for JS-rendered pages
resp = requests.get("https://example-directory.com/listings")
soup = BeautifulSoup(resp.text, "html.parser")
listings = soup.select(".listing-card")
print(f"Found: {len(listings)}")  # 0 -- content not in HTML

The server sends a minimal HTML document with a JavaScript bundle. The browser executes the JS, fetches data from an internal API, and renders the listings. Your Python script never runs that JavaScript, so it sees nothing.

Option 1: Headless browser

Python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example-directory.com/listings")
    page.wait_for_selector(".listing-card", timeout=10000)
    cards = page.query_selector_all(".listing-card")
    for card in cards:
        title = card.query_selector("h3").inner_text()
        print(title)
    browser.close()

This works but carries real costs: Playwright/Puppeteer needs Chromium installed (300MB+), each page load consumes 100-500MB RAM, anti-bot systems detect headless browsers through fingerprinting, and you need proxy rotation for any serious volume. Running this in production means managing browser instances, memory limits, and failure recovery.

Option 2: Rendering service (Firecrawl)

Python

import requests

# Firecrawl renders the page and returns content
resp = requests.post(
    "https://api.firecrawl.dev/v1/scrape",
    headers={"Authorization": "Bearer fc-YOUR_KEY"},
    json={
        "url": "https://example-directory.com/listings",
        "formats": ["markdown", "extract"],
        "extract": {
            "schema": {
                "type": "object",
                "properties": {
                    "listings": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "category": {"type": "string"}
                            }
                        }
                    }
                }
            }
        }
    }
)
# Firecrawl pricing: $0.01/page for scrape, extract costs extra
data = resp.json()

Firecrawl handles the rendering and returns clean markdown or structured extraction. The problem: extraction uses LLM calls internally, which adds cost and latency. At 10K pages/month, scrape alone is $100. With extraction, significantly more. For directories where you need structured fields, the cost adds up fast.

Option 3: Search API (skip rendering entirely)

If the directory content is indexed by search engines (most public directories are), you can query a search API instead of rendering the page yourself. Search engines already crawled and rendered the JavaScript. You get structured results without touching a browser.

Python

import requests, os

def search_directory(category, location, count=20):
    """Pull directory data via search API -- no rendering needed."""
    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={
            "query": f"site:example-directory.com {category} {location}",
            "num_results": count
        }
    )
    results = resp.json()["results"]
    listings = []
    for r in results:
        listings.append({
            "title": r["title"],
            "url": r["url"],
            "snippet": r["description"]
        })
    return listings

# Pull restaurant listings in Chicago
restaurants = search_directory("restaurants", "Chicago")
for r in restaurants:
    print(f"{r['title']}")
    print(f"  {r['snippet'][:100]}")
# 1 credit per search query

Cost comparison at 10K pages/month

Headless browser (self-hosted): $50-150 in compute (EC2/Cloud Run), plus engineering time for proxy rotation, anti-bot evasion, failure handling
Firecrawl scrape only: ~$100/month at $0.01/page
Firecrawl scrape + extraction: $200-400/month depending on schema complexity
Scavio search API: $30/month (7K credits on plan, $15 overage for remaining 3K at $0.005/credit)

When each approach makes sense

Headless browsers are necessary when you need to interact with the page: fill forms, click through pagination, handle authentication. If the directory requires login or has infinite scroll that search engines cannot index, you need a real browser.

Rendering services like Firecrawl work best for one-off deep extraction: pull the full content of specific pages where you need every field. They are the right tool for extracting detailed product specs from individual pages.

Search APIs work best for discovery and aggregation: finding what exists across a directory, monitoring new listings, pulling structured summaries. They cover 80% of directory data needs at a fraction of the cost because someone else already solved the rendering problem.

The practical pattern

Python

import requests, os

SCAVIO_KEY = os.environ["SCAVIO_API_KEY"]

def discover_then_extract(query, count=20):
    """Search API for discovery, targeted scraping for depth."""
    # Step 1: Find relevant pages via search (cheap, fast)
    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": SCAVIO_KEY},
        json={"query": query, "num_results": count}
    )
    results = resp.json()["results"]

    # Step 2: Only render the pages that need deep extraction
    high_value = [r for r in results if "pricing" in r["title"].lower()]
    print(f"Found {len(results)} results, {len(high_value)} need deep extraction")

    # Save credits by only scraping the pages that matter
    return results, high_value

results, to_scrape = discover_then_extract(
    "site:saas-directory.com project management tools"
)

Use the search API as a filter. Discover broadly, then render only the pages where you genuinely need full content. This hybrid drops your rendering costs by 80-90% while still getting comprehensive coverage.

JS-Rendered Directory Data Without Headless Browsers

Why requests + BeautifulSoup fails

Option 1: Headless browser

Option 2: Rendering service (Firecrawl)

Option 3: Search API (skip rendering entirely)

Cost comparison at 10K pages/month

When each approach makes sense

The practical pattern

Continue reading

Connect Scavio to Any AI Assistant with MCP

Build a Cross-Platform Product Research Agent with LangGraph