The HiringCafe Pattern for Job Search Agents
An r/hiringcafe thread surfaced the pattern. Career-page extraction is the easy part. Ranking is the actual product.
An r/hiringcafe thread on the AI Job Search Agent (87 upvotes, 26 comments) surfaced the pattern that defines this product category in 2026: pull from real employer career pages, AI-summarize each role, surface salary upfront. The pattern is portable; the hard part is the ranking.
Why HiringCafe-style aggregators win
Two structural advantages over Indeed/LinkedIn:
- Real employer postings only. No sponsored, no recruiter spam, no duplicates. The signal-to-noise ratio is the product.
- Salary surfaced upfront. Most listings hide compensation behind a click; HiringCafe extracts and shows it on the card.
Both depend on extracting clean structured data from career-page HTML at scale.
The data layer: discovery + extract
Three sources for career pages:
- Direct employer pages (
company.com/careers) - Greenhouse boards (
boards.greenhouse.io/company) - Lever boards (
jobs.lever.co/company)
Discovery is dorked search; extract is markdown conversion. Both fit Scavio cleanly:
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
DORKS = [
'site:{d}/careers',
'site:{d}/jobs',
'site:jobs.lever.co/{d}',
'site:boards.greenhouse.io/{d}',
]
def find_career_urls(domain):
name = domain.split('.')[0]
urls = []
for tpl in DORKS:
q = tpl.format(d=name)
r = requests.post('https://api.scavio.dev/api/v1/search',
headers=H, json={'query': q}).json()
urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
return list(set(urls))
def extract(url):
return requests.post('https://api.scavio.dev/api/v1/extract',
headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')The LLM extraction step
Career pages have wildly varying structure. The LLM normalizes them into a consistent JSON shape:
PROMPT = '''Extract job postings from this careers page.
For each, return JSON with:
- title, team, location, remote (bool), salary_min, salary_max (null if not shown), apply_url, summary (2 sentences).
Return a JSON list.
Page:
{md}'''
import json
def parse(md):
raw = llm.complete(PROMPT.format(md=md))
try:
return json.loads(raw)
except:
return []Dedupe across aggregators
The same role appears on the company's direct page, Greenhouse, and possibly LinkedIn. Dedupe on (employer, title, location) collapses these.
def dedupe(roles):
seen = set()
out = []
for r in roles:
key = (r.get('employer'), r.get('title'), r.get('location'))
if key not in seen:
seen.add(key)
out.append(r)
return outWhere the product effort actually goes
The data layer ships in a weekend. The product differentiation is downstream:
- Ranking quality. Top-5 jobs that the user actually wants to apply to. This is taste, ML, and product iteration.
- Salary normalization. Cross-currency, full-time-equivalent, base + bonus + equity disambiguation.
- Notification UX. Push/email/Slack with exactly the right cadence to keep users engaged without spamming.
- Filtering precision. "Remote ok", "mid-level", "ML focus" need to actually filter, not be approximations.
Cost math at MVP scale
500 target employers, daily refresh:
- Discovery: 4 dorks * 500 employers = 2,000 search calls/day = 60K/mo
- Extract: ~20% new pages/day = 100 extracts/day = 3K/mo
- LLM parse: 100 LLM calls/day for parsing = 3K/mo
Scavio Bootstrap tier ($100/mo for 28K credits) covers the discovery in steady state; pair with batched extraction. LLM tokens for parsing: ~$10-30/mo depending on average page size.
Legal reality check
Scraping LinkedIn directly is legally hot — multiple cases have gone against scrapers in 2024-2025. A HiringCafe-style aggregator should stay on public career pages and ATS-public boards (Greenhouse, Lever). These are publicly indexed by Google and explicitly meant to be discoverable.
The honest take
The data layer is the easy part. The HiringCafe edge — the reason the r/hiringcafe thread is positive — is the relevance ranking. Build the data layer fast with Scavio + LLM; spend the saved time on the ranking, which is the actual product.