hiringcafejob-searchagents

The HiringCafe Pattern for Job Search Agents

An r/hiringcafe thread surfaced the pattern. Career-page extraction is the easy part. Ranking is the actual product.

5 min read

An r/hiringcafe thread on the AI Job Search Agent (87 upvotes, 26 comments) surfaced the pattern that defines this product category in 2026: pull from real employer career pages, AI-summarize each role, surface salary upfront. The pattern is portable; the hard part is the ranking.

Why HiringCafe-style aggregators win

Two structural advantages over Indeed/LinkedIn:

  • Real employer postings only. No sponsored, no recruiter spam, no duplicates. The signal-to-noise ratio is the product.
  • Salary surfaced upfront. Most listings hide compensation behind a click; HiringCafe extracts and shows it on the card.

Both depend on extracting clean structured data from career-page HTML at scale.

The data layer: discovery + extract

Three sources for career pages:

  • Direct employer pages (company.com/careers)
  • Greenhouse boards (boards.greenhouse.io/company)
  • Lever boards (jobs.lever.co/company)

Discovery is dorked search; extract is markdown conversion. Both fit Scavio cleanly:

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

DORKS = [
    'site:{d}/careers',
    'site:{d}/jobs',
    'site:jobs.lever.co/{d}',
    'site:boards.greenhouse.io/{d}',
]

def find_career_urls(domain):
    name = domain.split('.')[0]
    urls = []
    for tpl in DORKS:
        q = tpl.format(d=name)
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}).json()
        urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(urls))

def extract(url):
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')

The LLM extraction step

Career pages have wildly varying structure. The LLM normalizes them into a consistent JSON shape:

Python
PROMPT = '''Extract job postings from this careers page.
For each, return JSON with:
- title, team, location, remote (bool), salary_min, salary_max (null if not shown), apply_url, summary (2 sentences).
Return a JSON list.

Page:
{md}'''

import json
def parse(md):
    raw = llm.complete(PROMPT.format(md=md))
    try:
        return json.loads(raw)
    except:
        return []

Dedupe across aggregators

The same role appears on the company's direct page, Greenhouse, and possibly LinkedIn. Dedupe on (employer, title, location) collapses these.

Python
def dedupe(roles):
    seen = set()
    out = []
    for r in roles:
        key = (r.get('employer'), r.get('title'), r.get('location'))
        if key not in seen:
            seen.add(key)
            out.append(r)
    return out

Where the product effort actually goes

The data layer ships in a weekend. The product differentiation is downstream:

  • Ranking quality. Top-5 jobs that the user actually wants to apply to. This is taste, ML, and product iteration.
  • Salary normalization. Cross-currency, full-time-equivalent, base + bonus + equity disambiguation.
  • Notification UX. Push/email/Slack with exactly the right cadence to keep users engaged without spamming.
  • Filtering precision. "Remote ok", "mid-level", "ML focus" need to actually filter, not be approximations.

Cost math at MVP scale

500 target employers, daily refresh:

  • Discovery: 4 dorks * 500 employers = 2,000 search calls/day = 60K/mo
  • Extract: ~20% new pages/day = 100 extracts/day = 3K/mo
  • LLM parse: 100 LLM calls/day for parsing = 3K/mo

Scavio Bootstrap tier ($100/mo for 28K credits) covers the discovery in steady state; pair with batched extraction. LLM tokens for parsing: ~$10-30/mo depending on average page size.

Legal reality check

Scraping LinkedIn directly is legally hot — multiple cases have gone against scrapers in 2024-2025. A HiringCafe-style aggregator should stay on public career pages and ATS-public boards (Greenhouse, Lever). These are publicly indexed by Google and explicitly meant to be discoverable.

The honest take

The data layer is the easy part. The HiringCafe edge — the reason the r/hiringcafe thread is positive — is the relevance ranking. Build the data layer fast with Scavio + LLM; spend the saved time on the ranking, which is the actual product.