JS-Rendered Directory Data Without Headless Browsers
JS-rendered pages return empty HTML to scrapers. Three options: headless browser, Firecrawl, or search API. Cost and complexity comparison.
JavaScript-rendered directories return empty HTML to standard HTTP requests because their content loads client-side after the initial page shell. You have three options: run a headless browser (complex, resource-heavy), use a rendering service like Firecrawl (simpler but expensive at scale with extraction), or query a search API that already indexes the rendered content (structured, cheap, zero rendering).
Why requests + BeautifulSoup fails
import requests
from bs4 import BeautifulSoup
# This returns an empty shell for JS-rendered pages
resp = requests.get("https://example-directory.com/listings")
soup = BeautifulSoup(resp.text, "html.parser")
listings = soup.select(".listing-card")
print(f"Found: {len(listings)}") # 0 -- content not in HTMLThe server sends a minimal HTML document with a JavaScript bundle. The browser executes the JS, fetches data from an internal API, and renders the listings. Your Python script never runs that JavaScript, so it sees nothing.
Option 1: Headless browser
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example-directory.com/listings")
page.wait_for_selector(".listing-card", timeout=10000)
cards = page.query_selector_all(".listing-card")
for card in cards:
title = card.query_selector("h3").inner_text()
print(title)
browser.close()This works but carries real costs: Playwright/Puppeteer needs Chromium installed (300MB+), each page load consumes 100-500MB RAM, anti-bot systems detect headless browsers through fingerprinting, and you need proxy rotation for any serious volume. Running this in production means managing browser instances, memory limits, and failure recovery.
Option 2: Rendering service (Firecrawl)
import requests
# Firecrawl renders the page and returns content
resp = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={"Authorization": "Bearer fc-YOUR_KEY"},
json={
"url": "https://example-directory.com/listings",
"formats": ["markdown", "extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"listings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"category": {"type": "string"}
}
}
}
}
}
}
}
)
# Firecrawl pricing: $0.01/page for scrape, extract costs extra
data = resp.json()Firecrawl handles the rendering and returns clean markdown or structured extraction. The problem: extraction uses LLM calls internally, which adds cost and latency. At 10K pages/month, scrape alone is $100. With extraction, significantly more. For directories where you need structured fields, the cost adds up fast.
Option 3: Search API (skip rendering entirely)
If the directory content is indexed by search engines (most public directories are), you can query a search API instead of rendering the page yourself. Search engines already crawled and rendered the JavaScript. You get structured results without touching a browser.
import requests, os
def search_directory(category, location, count=20):
"""Pull directory data via search API -- no rendering needed."""
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
json={
"query": f"site:example-directory.com {category} {location}",
"num_results": count
}
)
results = resp.json()["results"]
listings = []
for r in results:
listings.append({
"title": r["title"],
"url": r["url"],
"snippet": r["description"]
})
return listings
# Pull restaurant listings in Chicago
restaurants = search_directory("restaurants", "Chicago")
for r in restaurants:
print(f"{r['title']}")
print(f" {r['snippet'][:100]}")
# 1 credit per search queryCost comparison at 10K pages/month
- Headless browser (self-hosted): $50-150 in compute (EC2/Cloud Run), plus engineering time for proxy rotation, anti-bot evasion, failure handling
- Firecrawl scrape only: ~$100/month at $0.01/page
- Firecrawl scrape + extraction: $200-400/month depending on schema complexity
- Scavio search API: $30/month (7K credits on plan, $15 overage for remaining 3K at $0.005/credit)
When each approach makes sense
Headless browsers are necessary when you need to interact with the page: fill forms, click through pagination, handle authentication. If the directory requires login or has infinite scroll that search engines cannot index, you need a real browser.
Rendering services like Firecrawl work best for one-off deep extraction: pull the full content of specific pages where you need every field. They are the right tool for extracting detailed product specs from individual pages.
Search APIs work best for discovery and aggregation: finding what exists across a directory, monitoring new listings, pulling structured summaries. They cover 80% of directory data needs at a fraction of the cost because someone else already solved the rendering problem.
The practical pattern
import requests, os
SCAVIO_KEY = os.environ["SCAVIO_API_KEY"]
def discover_then_extract(query, count=20):
"""Search API for discovery, targeted scraping for depth."""
# Step 1: Find relevant pages via search (cheap, fast)
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": SCAVIO_KEY},
json={"query": query, "num_results": count}
)
results = resp.json()["results"]
# Step 2: Only render the pages that need deep extraction
high_value = [r for r in results if "pricing" in r["title"].lower()]
print(f"Found {len(results)} results, {len(high_value)} need deep extraction")
# Save credits by only scraping the pages that matter
return results, high_value
results, to_scrape = discover_then_extract(
"site:saas-directory.com project management tools"
)Use the search API as a filter. Discover broadly, then render only the pages where you genuinely need full content. This hybrid drops your rendering costs by 80-90% while still getting comprehensive coverage.