Detect Content Theft with SERP API Monitoring
Find where your content has been scraped by searching for exact phrases via SERP API.
Content theft detection starts with automated search: extract unique phrases from your articles, search for them via API, and flag any domain that is not yours. Offshore scraper sites ignore DMCA takedowns, so the real bottleneck is discovering the copies exist at all.
Why Manual Checks Fail
Googling a sentence from your article once a week catches maybe 5% of scraped copies. Content thieves rehost hundreds of articles across dozens of domains. They rotate domain names, use different TLDs, and sometimes lightly rewrite intros while keeping the body verbatim. Automated monitoring at scale is the only approach that works.
The Detection Strategy
Pick 3-5 unique phrases from each article -- sentences specific enough that no one else would write them independently. Search for each phrase in quotes. Any result on a domain you do not control is a potential scraper. The more phrases that match, the higher confidence the page is a copy.
Extracting Fingerprint Phrases
import requests, os, re
from urllib.parse import urlparse
API_KEY = os.environ["SCAVIO_API_KEY"]
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
SEARCH_URL = "https://api.scavio.dev/api/v1/search"
YOUR_DOMAINS = {"yourdomain.com", "www.yourdomain.com"}
def extract_fingerprints(text, count=5):
"""Pick unique mid-article sentences as fingerprints."""
sentences = [s.strip() for s in re.split(r'[.!?]', text)
if len(s.strip()) > 60]
# Skip first and last -- intros/outros are generic
middle = sentences[2:-2] if len(sentences) > 6 else sentences
return [f'"{s}"' for s in middle[:count]]Searching for Copies
def search_for_copies(phrase):
"""Search Google for an exact phrase and return non-owned results."""
resp = requests.post(SEARCH_URL, headers=HEADERS, json={
"platform": "google",
"query": phrase,
"country": "us"
})
results = resp.json().get("data", {}).get("organic", [])
copies = []
for r in results:
domain = urlparse(r["link"]).netloc.replace("www.", "")
if domain not in YOUR_DOMAINS:
copies.append({
"title": r["title"],
"url": r["link"],
"domain": domain,
"snippet": r.get("snippet", "")
})
return copiesFull Detection Pipeline
def detect_theft(article_text, article_url):
"""Run full theft detection on one article."""
fingerprints = extract_fingerprints(article_text)
all_copies = {}
for phrase in fingerprints:
copies = search_for_copies(phrase)
for copy in copies:
domain = copy["domain"]
if domain not in all_copies:
all_copies[domain] = {
"url": copy["url"],
"title": copy["title"],
"phrase_matches": 0
}
all_copies[domain]["phrase_matches"] += 1
# Sort by confidence (more phrase matches = more likely a copy)
ranked = sorted(all_copies.values(),
key=lambda x: x["phrase_matches"], reverse=True)
print(f"Checked: {article_url}")
print(f"Fingerprints searched: {len(fingerprints)}")
print(f"Suspect domains found: {len(ranked)}")
for r in ranked:
confidence = r["phrase_matches"] / len(fingerprints) * 100
print(f" [{confidence:.0f}%] {r['url']}")
return ranked
# Cost: 5 fingerprints x 1 credit = 5 credits ($0.025) per article
detect_theft(open("my-article.txt").read(), "https://yourdomain.com/my-article")Scheduling Regular Scans
Run the detection pipeline weekly on your top-performing content. A site with 50 articles costs 250 credits per weekly scan (5 searches per article). On Scavio's free tier (250 credits/month) you can monitor 50 articles monthly. The $30/month plan (7,000 credits) covers 350 articles scanned weekly.
import json
from datetime import datetime
def weekly_scan(articles):
"""Scan all articles and save results."""
report = {"date": datetime.now().isoformat(), "findings": []}
for article in articles:
copies = detect_theft(article["text"], article["url"])
if copies:
report["findings"].append({
"article": article["url"],
"copies": copies
})
with open(f"theft-report-{datetime.now().strftime('%Y-%m-%d')}.json", "w") as f:
json.dump(report, f, indent=2)
high_confidence = [f for f in report["findings"]
if any(c["phrase_matches"] >= 3 for c in f["copies"])]
print(f"High-confidence theft detected on {len(high_confidence)} articles")
return reportWhat to Do With Detected Copies
For sites that respect DMCA: file a takedown with the hosting provider (find it via WHOIS). For Cloudflare-fronted sites: use Cloudflare's abuse form. For Google: submit a DMCA removal request to deindex the page. For offshore sites that ignore everything: the Google deindex route is often the only practical option -- they keep the content but lose the traffic.
Cost Breakdown
Each article scan uses 5 credits at $0.005/credit = $0.025. Monitoring 100 articles weekly = 500 credits/week = 2,000/month. That fits comfortably in the $30/month plan. Compare to Copyscape Premium at $0.03/search (similar per-article cost but no API access for automation) or Brand24 content monitoring starting at $119/month.