agentsjobsgtm

AI Job Search Agent with Live Listings

Build a durable AI job search agent using Google SERP with site operators. Why direct scrapers break and the indirect pattern that lasts.

5 min read

A r/ItaliaCareerAdvice thread showed a Python script that scrapes job listings and filters with an LLM. It works but it is fragile: sites break, patterns change, and the LLM eats the context window on noisy HTML. This post is the durable version of that pattern, using Scavio for the listing discovery layer.

Why DIY Job Scrapers Break

Indeed, LinkedIn, and Glassdoor all fight scrapers aggressively in 2026. A script that works today breaks in two weeks. The usual failure modes: IP blocks, CAPTCHA walls, DOM structure changes, and rate limits. Maintenance eats more time than the agent saves.

The Indirect Pattern

Go through Google SERP with site operators instead of scraping the boards directly. Indeed posts are indexed on Google. LinkedIn job pages are indexed. Glassdoor listings are indexed. A SERP query returns the public part of each listing with title, company, location, and snippet - which is almost all the filtering signal you need.

Python
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']

def search_jobs(role: str, location: str) -> list[dict]:
    sources = [
        f'site:indeed.com {role} {location}',
        f'site:linkedin.com/jobs {role} {location}',
        f'site:glassdoor.com {role} {location}',
    ]
    results = []
    for query in sources:
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers={'x-api-key': API_KEY},
            json={'query': query, 'num_results': 20})
        for x in r.json().get('organic_results', []):
            results.append({
                'source': query.split()[0],
                'title': x['title'],
                'url': x['link'],
                'snippet': x.get('snippet', '')
            })
    return results

The LLM Filter Step

Once listings are collected, an LLM classifies each against the user's preferences. Unlike the noisy HTML approach, the LLM now has clean structured input and can focus on matching.

Python
import anthropic
client = anthropic.Anthropic()

def filter_jobs(listings: list[dict], preferences: str) -> list[dict]:
    relevant = []
    for job in listings:
        prompt = f'''Preferences: {preferences}

Job: {job['title']}
Snippet: {job['snippet']}

Does this job match the preferences? Respond YES or NO followed by
a one-sentence reason.'''

        msg = client.messages.create(
            model='claude-haiku-4-5-20251001',
            max_tokens=100,
            messages=[{'role': 'user', 'content': prompt}])

        answer = msg.content[0].text
        if answer.startswith('YES'):
            job['why'] = answer.split('\n', 1)[-1]
            relevant.append(job)
    return relevant

The Company Research Step

For each matched job, the agent runs a second pass: Reddit mentions, recent news, Glassdoor reviews via SERP. This is the signal the candidate actually wants. "Is this company a dumpster fire?" is the question a cold job listing cannot answer.

Python
def research_company(domain: str) -> dict:
    reddit = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': domain, 'platform': 'reddit'}).json()

    news = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': f'{domain} layoffs OR funding OR CEO',
              'time_range': 'month'}).json()

    reviews = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': f'site:glassdoor.com {domain} reviews'}).json()

    return {
        'reddit_threads': reddit.get('posts', [])[:5],
        'recent_news': news.get('organic_results', [])[:5],
        'review_snippets': reviews.get('organic_results', [])[:3]
    }

The Daily Loop

Schedule the full flow to run daily at 7 AM. New listings land in the candidate's inbox with the company research attached. Two weeks of running this turns a passive job search into an informed pipeline.

Why This Beats Hand-Crafted Scrapers

Three wins. One, resilience: Google SERP does not break every week. Two, coverage: Scavio's multi-source SERP + Reddit + news in one API beats a stack of separate scrapers. Three, maintenance: a single schedule runs the entire pipeline, no per-site fixes required.

Where the Pattern Fails

Two places. One, jobs that do not get indexed on Google. These are typically small-company roles on ATS platforms. Consider the impact per role: if the candidate wants FAANG, Google is fine. If the candidate wants niche early-stage, a separate Wellfound/Hacker News scraper helps. Two, same-day postings: Google indexes with a lag, and new listings sometimes take 24 hours to appear.

Operational Cost

Daily run of 30 queries across 3 sources = 90 SERP queries + 20 company enrichments = roughly 200 credits per day. At $30/mo for 7,000 credits, that is well within the plan with room for weekend deep-dives. Haiku classification at ~5 tokens per job is negligible. Total pipeline cost: under $35/mo all-in.

The use case page is at ai-career-agent-data-api and the solution architecture at ai-job-search-agent.