hiringcafejob-searchagents

The HiringCafe Pattern for Job Search Agents

An r/hiringcafe thread surfaced the pattern. Career-page extraction is the easy part. Ranking is the actual product.

April 30, 2026

5 min read

An r/hiringcafe thread on the AI Job Search Agent (87 upvotes, 26 comments) surfaced the pattern that defines this product category in 2026: pull from real employer career pages, AI-summarize each role, surface salary upfront. The pattern is portable; the hard part is the ranking.

Why HiringCafe-style aggregators win

Two structural advantages over Indeed/LinkedIn:

Real employer postings only. No sponsored, no recruiter spam, no duplicates. The signal-to-noise ratio is the product.
Salary surfaced upfront. Most listings hide compensation behind a click; HiringCafe extracts and shows it on the card.

Both depend on extracting clean structured data from career-page HTML at scale.

The data layer: discovery + extract

Three sources for career pages:

Direct employer pages (company.com/careers)
Greenhouse boards (boards.greenhouse.io/company)
Lever boards (jobs.lever.co/company)

Discovery is dorked search; extract is markdown conversion. Both fit Scavio cleanly:

Python

import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

DORKS = [
    'site:{d}/careers',
    'site:{d}/jobs',
    'site:jobs.lever.co/{d}',
    'site:boards.greenhouse.io/{d}',
]

def find_career_urls(domain):
    name = domain.split('.')[0]
    urls = []
    for tpl in DORKS:
        q = tpl.format(d=name)
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}).json()
        urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(urls))

def extract(url):
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')

The LLM extraction step

Career pages have wildly varying structure. The LLM normalizes them into a consistent JSON shape:

Python

PROMPT = '''Extract job postings from this careers page.
For each, return JSON with:
- title, team, location, remote (bool), salary_min, salary_max (null if not shown), apply_url, summary (2 sentences).
Return a JSON list.

Page:
{md}'''

import json
def parse(md):
    raw = llm.complete(PROMPT.format(md=md))
    try:
        return json.loads(raw)
    except:
        return []

Dedupe across aggregators

The same role appears on the company's direct page, Greenhouse, and possibly LinkedIn. Dedupe on (employer, title, location) collapses these.

Python

def dedupe(roles):
    seen = set()
    out = []
    for r in roles:
        key = (r.get('employer'), r.get('title'), r.get('location'))
        if key not in seen:
            seen.add(key)
            out.append(r)
    return out

Where the product effort actually goes

The data layer ships in a weekend. The product differentiation is downstream:

Ranking quality. Top-5 jobs that the user actually wants to apply to. This is taste, ML, and product iteration.
Salary normalization. Cross-currency, full-time-equivalent, base + bonus + equity disambiguation.
Notification UX. Push/email/Slack with exactly the right cadence to keep users engaged without spamming.
Filtering precision. "Remote ok", "mid-level", "ML focus" need to actually filter, not be approximations.

Cost math at MVP scale

500 target employers, daily refresh:

Discovery: 4 dorks * 500 employers = 2,000 search calls/day = 60K/mo
Extract: ~20% new pages/day = 100 extracts/day = 3K/mo
LLM parse: 100 LLM calls/day for parsing = 3K/mo

Scavio Bootstrap tier ($100/mo for 28K credits) covers the discovery in steady state; pair with batched extraction. LLM tokens for parsing: ~$10-30/mo depending on average page size.

Legal reality check

Scraping LinkedIn directly is legally hot — multiple cases have gone against scrapers in 2024-2025. A HiringCafe-style aggregator should stay on public career pages and ATS-public boards (Greenhouse, Lever). These are publicly indexed by Google and explicitly meant to be discoverable.

The honest take

The data layer is the easy part. The HiringCafe edge — the reason the r/hiringcafe thread is positive — is the relevance ranking. Build the data layer fast with Scavio + LLM; spend the saved time on the ranking, which is the actual product.

The HiringCafe Pattern for Job Search Agents

Why HiringCafe-style aggregators win

The data layer: discovery + extract

The LLM extraction step

Dedupe across aggregators

Where the product effort actually goes

Cost math at MVP scale

Legal reality check

The honest take

Continue reading

Connect Scavio to Any AI Assistant with MCP

Build a Cross-Platform Product Research Agent with LangGraph