enrichmentscaviodata-engineering

Company Name to Website Enrichment: An Honest Guide (2026)

An r/dataengineering post: months of pain. Three steps: search + verify + score. Scavio search + knowledge_graph + /extract gets ~92-96% accuracy.

May 2, 2026

5 min read

An r/dataengineering post in May 2026 documented months of pain at one team trying to solve a deceptively simple problem: company name in, canonical website out. Existing solutions were "garbage" per the OP. This is the honest guide for the same problem in 2026.

Why this is harder than it looks

Naive approach: search the company name + "official site", take the top result. This fails on roughly 5-15% of records depending on the list. Common failure modes:

Recent rebrand: legal name vs current marketing name.
Acquired companies redirecting to a parent domain.
Generic names colliding with more famous unrelated brands.
Companies whose website is a subdomain or path of a parent (consulting groups, divisions).
Stale registry data on company-info aggregators.

The pragmatic recipe

Three steps: search, verify, score. Skip any step and accuracy drops.

Python

import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

def resolve(name):
    # Step 1: search with knowledge graph signal
    s = requests.post('https://api.scavio.dev/api/v1/search',
                     headers=H,
                     json={'query': f'"{name}" official site'}).json()
    kg = s.get('knowledge_graph', {})

    # Pick candidate: prefer KG website over top organic
    candidate = kg.get('website') or (s.get('organic_results') or [{}])[0].get('link')
    if not candidate:
        return {'match': False, 'reason': 'no_candidate'}

    # Step 2: verify via /extract
    page = requests.post('https://api.scavio.dev/api/v1/extract',
                         headers=H,
                         json={'url': candidate}).json()
    text = (page.get('text') or '').lower()
    verified = name.lower() in text

    # Step 3: confidence score
    if kg.get('website') and verified:
        confidence = 'high'
    elif verified:
        confidence = 'medium'
    else:
        confidence = 'low'

    return {
        'name': name,
        'website': candidate,
        'verified': verified,
        'confidence': confidence,
        'kg_aliases': kg.get('aliases', []),
        'kg_parent': kg.get('parent_organization'),
    }

Why knowledge graph aliases matter

Google's knowledge graph often surfaces former names and parent-organization relationships. When a company rebranded, the KG entry frequently still includes the old name as an alias. That's the cheapest signal you can get for the rebrand pain point.

Verification beats blind picking

Top-1 search picking accuracy is roughly 80% on clean B2B names. Verifying that the company name appears in the candidate domain's home-page text takes accuracy to roughly 92-96%. The /extract step is doing real work; don't skip it.

The 4-8% honest residual

No tool hits 100% on messy CRM exports. The residual edge cases:

Companies sharing names with celebrities or fictional brands.
Holding companies whose subsidiaries each have their own websites.
Stealth-mode startups with a placeholder domain.
Recently-acquired companies whose old domain redirects.

Route low-confidence to human review. The honest answer is "this 4-8% needs manual touch" not "trust me bro".

Per-record cost economics

Each resolution is roughly two Scavio calls (search + extract) plus an optional LLM judge. At Project tier ($30/mo for 7K credits), that's about $0.001-0.005 per record. A 50K-row CRM enrichment is roughly $50-250 in Scavio cost — sustainable economics where Apollo and ZoomInfo at $0.05-0.50/record price out at scale.

What about Potarix and other vendors?

Potarix Enricher (the OP's tool) is one option. Apollo and ZoomInfo bundle this with their B2B contact data. Clearbit and People Data Labs offer enrichment APIs. Each has tradeoffs around price, coverage, and per-seat cost. The DIY shape via Scavio is the cheapest at scale and the most flexible for non-standard records.

Quarterly rebrand detection

The CRM you enrich today gets stale. Run the resolver as a quarterly cron over the full base; flag rebrands and update domains before outbound or enrichment pipelines break.

Verified-online May 2026 against the Scavio API spec and the r/dataengineering source post.