agentsjobsscale

Job Search Agents at 100K Scale: Architecture Guide

Building job search agents that handle 100K listings. Separate search from scoring. The LLM scores batches, not the full corpus.

6 min read

Building a job search agent that handles 100K listings requires separating search from scoring. The search API fetches listings in bulk; the LLM scores and ranks them against your criteria. Trying to have the LLM do both search and scoring at 100K scale blows your token budget in hours.

Architecture for 100K Scale

Stage 1: Broad search across Google (job boards), Reddit (who's hiring threads), and YouTube (company culture videos). Stage 2: Deduplicate and normalize into a standard schema. Stage 3: LLM scores each listing against user preferences. Stage 4: Human reviews top 50. The LLM never sees all 100K listings -- it scores pre-filtered batches.

Stage 1: Multi-Platform Search

Python
import requests, os
from concurrent.futures import ThreadPoolExecutor

H = {"x-api-key": os.environ["SCAVIO_API_KEY"]}

def search_jobs(role, location, platforms=None):
    """Search multiple platforms for job listings."""
    if platforms is None:
        platforms = ["google", "reddit"]
    queries = {
        "google": f"{role} jobs {location} 2026 hiring",
        "reddit": f"{role} hiring {location}",
    }
    results = {}
    def fetch(platform):
        q = queries.get(platform, f"{role} jobs {location}")
        r = requests.post("https://api.scavio.dev/api/v1/search",
            headers=H,
            json={"platform": platform, "query": q},
            timeout=10
        ).json()
        return platform, r.get("organic", [])

    with ThreadPoolExecutor(max_workers=5) as pool:
        for platform, data in pool.map(lambda p: fetch(p), platforms):
            results[platform] = data
    return results

jobs = search_jobs("senior python developer", "remote")
total = sum(len(v) for v in jobs.values())
print(f"Found {total} listings across {len(jobs)} platforms")

Stage 2: Deduplication

Job listings appear on multiple boards. Deduplicate by normalizing URLs (strip tracking parameters) and comparing title + company name. A simple hash-based dedup catches 60-70% of duplicates without fuzzy matching.

Python
from urllib.parse import urlparse
import hashlib

def dedup_listings(all_results):
    """Deduplicate job listings across platforms."""
    seen = set()
    unique = []
    for platform, listings in all_results.items():
        for item in listings:
            # Normalize: domain + title hash
            url = urlparse(item.get("link", ""))
            key = hashlib.md5(
                f"{url.netloc}{item.get('title', '')}".lower().encode()
            ).hexdigest()
            if key not in seen:
                seen.add(key)
                unique.append({
                    "title": item.get("title", ""),
                    "url": item.get("link", ""),
                    "snippet": item.get("snippet", ""),
                    "platform": platform,
                })
    return unique

unique_jobs = dedup_listings(jobs)
print(f"After dedup: {len(unique_jobs)} unique listings")

Stage 3: LLM Scoring

Send batches of 20 listings to the LLM with your scoring criteria. The LLM returns a 1-10 score for each. Process 100K listings in 5000 batches. At $0.01/batch with a fast model, that is $50 total for scoring.

Python
def score_batch(listings, criteria):
    """Score a batch of listings against user criteria.
    Returns list of (listing, score) tuples."""
    prompt = f"""Score each job listing 1-10 against these criteria:
{criteria}

Listings:
"""
    for i, job in enumerate(listings):
        prompt += f"{i+1}. {job['title']} - {job['snippet'][:100]}\n"
    prompt += "\nReturn scores as: 1:score, 2:score, ..."

    # Send to your preferred LLM for scoring
    # scores = llm.generate(prompt)
    # Parse and return scored listings
    return listings  # placeholder

criteria = """
- Remote-first company
- Python/ML focus
- Series B+ funding
- Salary range 150K-250K
"""
# scored = score_batch(unique_jobs[:20], criteria)

Scaling to 100K

At 100K listings, you cannot search once. You need to iterate across role variations, locations, and time windows. Run searches daily, accumulate listings in a database, and score new additions incrementally. The search API cost at 100K queries: roughly $500/month at $0.005/credit.

Reddit as Signal Source

Reddit "Who's Hiring" threads surface roles that never appear on job boards. Companies post directly, salary ranges are often included, and comment threads reveal team culture. Search Reddit monthly hiring threads and extract listings alongside your Google job board results.