contentmonitoringdmca

Detect Content Theft with SERP API Monitoring

Find where your content has been scraped by searching for exact phrases via SERP API.

7 min

Content theft detection starts with automated search: extract unique phrases from your articles, search for them via API, and flag any domain that is not yours. Offshore scraper sites ignore DMCA takedowns, so the real bottleneck is discovering the copies exist at all.

Why Manual Checks Fail

Googling a sentence from your article once a week catches maybe 5% of scraped copies. Content thieves rehost hundreds of articles across dozens of domains. They rotate domain names, use different TLDs, and sometimes lightly rewrite intros while keeping the body verbatim. Automated monitoring at scale is the only approach that works.

The Detection Strategy

Pick 3-5 unique phrases from each article -- sentences specific enough that no one else would write them independently. Search for each phrase in quotes. Any result on a domain you do not control is a potential scraper. The more phrases that match, the higher confidence the page is a copy.

Extracting Fingerprint Phrases

Python
import requests, os, re
from urllib.parse import urlparse

API_KEY = os.environ["SCAVIO_API_KEY"]
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
SEARCH_URL = "https://api.scavio.dev/api/v1/search"

YOUR_DOMAINS = {"yourdomain.com", "www.yourdomain.com"}

def extract_fingerprints(text, count=5):
    """Pick unique mid-article sentences as fingerprints."""
    sentences = [s.strip() for s in re.split(r'[.!?]', text)
                 if len(s.strip()) > 60]
    # Skip first and last -- intros/outros are generic
    middle = sentences[2:-2] if len(sentences) > 6 else sentences
    return [f'"{s}"' for s in middle[:count]]

Searching for Copies

Python
def search_for_copies(phrase):
    """Search Google for an exact phrase and return non-owned results."""
    resp = requests.post(SEARCH_URL, headers=HEADERS, json={
        "platform": "google",
        "query": phrase,
        "country": "us"
    })
    results = resp.json().get("data", {}).get("organic", [])
    copies = []
    for r in results:
        domain = urlparse(r["link"]).netloc.replace("www.", "")
        if domain not in YOUR_DOMAINS:
            copies.append({
                "title": r["title"],
                "url": r["link"],
                "domain": domain,
                "snippet": r.get("snippet", "")
            })
    return copies

Full Detection Pipeline

Python
def detect_theft(article_text, article_url):
    """Run full theft detection on one article."""
    fingerprints = extract_fingerprints(article_text)
    all_copies = {}

    for phrase in fingerprints:
        copies = search_for_copies(phrase)
        for copy in copies:
            domain = copy["domain"]
            if domain not in all_copies:
                all_copies[domain] = {
                    "url": copy["url"],
                    "title": copy["title"],
                    "phrase_matches": 0
                }
            all_copies[domain]["phrase_matches"] += 1

    # Sort by confidence (more phrase matches = more likely a copy)
    ranked = sorted(all_copies.values(),
                    key=lambda x: x["phrase_matches"], reverse=True)

    print(f"Checked: {article_url}")
    print(f"Fingerprints searched: {len(fingerprints)}")
    print(f"Suspect domains found: {len(ranked)}")
    for r in ranked:
        confidence = r["phrase_matches"] / len(fingerprints) * 100
        print(f"  [{confidence:.0f}%] {r['url']}")

    return ranked

# Cost: 5 fingerprints x 1 credit = 5 credits ($0.025) per article
detect_theft(open("my-article.txt").read(), "https://yourdomain.com/my-article")

Scheduling Regular Scans

Run the detection pipeline weekly on your top-performing content. A site with 50 articles costs 250 credits per weekly scan (5 searches per article). On Scavio's free tier (250 credits/month) you can monitor 50 articles monthly. The $30/month plan (7,000 credits) covers 350 articles scanned weekly.

Python
import json
from datetime import datetime

def weekly_scan(articles):
    """Scan all articles and save results."""
    report = {"date": datetime.now().isoformat(), "findings": []}
    for article in articles:
        copies = detect_theft(article["text"], article["url"])
        if copies:
            report["findings"].append({
                "article": article["url"],
                "copies": copies
            })

    with open(f"theft-report-{datetime.now().strftime('%Y-%m-%d')}.json", "w") as f:
        json.dump(report, f, indent=2)

    high_confidence = [f for f in report["findings"]
                       if any(c["phrase_matches"] >= 3 for c in f["copies"])]
    print(f"High-confidence theft detected on {len(high_confidence)} articles")
    return report

What to Do With Detected Copies

For sites that respect DMCA: file a takedown with the hosting provider (find it via WHOIS). For Cloudflare-fronted sites: use Cloudflare's abuse form. For Google: submit a DMCA removal request to deindex the page. For offshore sites that ignore everything: the Google deindex route is often the only practical option -- they keep the content but lose the traffic.

Cost Breakdown

Each article scan uses 5 credits at $0.005/credit = $0.025. Monitoring 100 articles weekly = 500 credits/week = 2,000/month. That fits comfortably in the $30/month plan. Compare to Copyscape Premium at $0.03/search (similar per-article cost but no API access for automation) or Brand24 content monitoring starting at $119/month.