Tutorial

How to Build a HiringCafe-Style Job Aggregator

An r/hiringcafe thread surfaced the pattern: pull from career pages, AI-summarize, surface salary. Walk-through with Scavio + LLM.

An r/hiringcafe thread shared the AI Job Search Agent pattern: pull from real employer career pages, AI-summarize each role, surface salary upfront. This walks a HiringCafe-style aggregator with Scavio + LLM.

Prerequisites

  • Scavio API key
  • An LLM API key
  • A list of target employers (or a way to discover them)

Walkthrough

Step 1: Discover career-page URLs via dorked search

site:company.com careers + jobs.lever.co + boards.greenhouse.io patterns.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
DORKS = [
    'site:{domain}/careers',
    'site:{domain}/jobs',
    'site:jobs.lever.co/{domain}',
    'site:boards.greenhouse.io/{domain}',
]
def find_career_urls(domain):
    out = []
    for d in DORKS:
        q = d.format(domain=domain.replace('.com',''))
        r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
        out.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(out))

Step 2: Extract listing pages as markdown

Scavio /extract turns the careers page into clean markdown.

Python
def extract(url):
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')

Step 3: Parse roles with an LLM

Structured extraction: title, location, salary if shown, summary.

Python
PROMPT = '''Extract job postings from this careers page. For each, return JSON with:
- title, team, location, remote (bool), salary_min, salary_max (null if not shown), apply_url, summary (2 sentences).
Return a JSON list.
Page:
{md}'''
result = llm.complete(PROMPT.format(md=markdown))

Step 4: Dedupe by (employer, title, location)

Same role on multiple aggregators = one record.

Python
def dedupe(roles):
    seen = set(); out = []
    for r in roles:
        key = (r['employer'], r['title'], r['location'])
        if key not in seen:
            seen.add(key); out.append(r)
    return out

Step 5: Rank by salary + recency + match score

User-input filters drive the surface.

Python
def rank(roles, user_skills):
    for r in roles:
        match = sum(1 for s in user_skills if s.lower() in (r['summary'] + r['title']).lower())
        r['score'] = (r.get('salary_max') or 0) * 0.3 + match * 100
    return sorted(roles, key=lambda x: -x['score'])

Python Example

Python
# Per-employer cost: ~3 dorked searches + 1 extract + 1 LLM call = ~$0.02-0.05

JavaScript Example

JavaScript
// Same flow in TS.

Expected Output

JSON
JSON list of jobs with title, salary, summary, apply_url. Dedupes across aggregators. Ranks by user skills + salary. The hard part remains the relevance ranking; the data layer is the easy part.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Scavio API key. An LLM API key. A list of target employers (or a way to discover them). A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

An r/hiringcafe thread surfaced the pattern: pull from career pages, AI-summarize, surface salary. Walk-through with Scavio + LLM.