An r/dataengineering post documented months trying to solve company-name-to-website. This walks the Scavio recipe: search + knowledge_graph + /extract verification, with confidence scoring.
Prerequisites
- Scavio API key
- Python or Node + HTTP client
- A list of company names to resolve
Walkthrough
Step 1: Start with one company name
Build the pipeline on one before going to 10K.
name = 'Stripe'Step 2: POST to Scavio search
Search for the official site signal.
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': f'"{name}" official site'}).json()Step 3: Pick candidate domain
Prefer knowledge_graph.website; fall back to organic_results[0].link.
kg = r.get('knowledge_graph', {})
candidate = kg.get('website') or (r.get('organic_results') or [{}])[0].get('link')Step 4: Verify via /extract
Fetch the candidate domain home page; check that the company name appears.
page = requests.post('https://api.scavio.dev/api/v1/extract', headers=H, json={'url': candidate}).json()
text = (page.get('text') or '').lower()
verified = name.lower() in textStep 5: Compute confidence score
knowledge_graph hit + name-in-text + non-generic-host = high.
confidence = 'high' if kg.get('website') and verified else ('medium' if verified else 'low')Step 6: Route low-confidence to human review
Honest about the 4-8% edge cases.
if confidence == 'low': enqueue_for_review(name, candidate)Step 7: Batch the pipeline
Parallelize across CRM rows.
# asyncio.gather(*[resolve(n) for n in names]) — concurrency 10-20Python Example
# Per-record: ~2 Scavio calls = ~$0.001-0.005 in credits at Project tier.JavaScript Example
// Same shape in TypeScript with fetch + Promise.all batching.Expected Output
Per record: { name, candidate_url, verified: true|false, confidence: 'high'|'medium'|'low' }. ~92-96% accuracy on clean names.