Workflow

n8n Directory Pagination and Data Extraction Workflow

Automate paginated directory scraping via n8n. Search business directories page by page, extract company data, and deduplicate into a master list.

Overview

Online directories like Clutch, G2, and Capterra have hundreds of pages of listings. Manually browsing is slow and incomplete. This n8n workflow automates paginated search queries to extract all listings in a category, deduplicates results, and builds a master prospect list. Each page of search results costs $0.005.

Trigger

Weekly cron on Monday at 3 AM UTC or on-demand for new categories.

Schedule

Weekly (Monday 3 AM UTC)

Workflow Steps

1

Configure Directory and Category Targets

Define which directories to search and which categories to extract. Each target includes the directory domain and category keywords.

2

Execute Paginated Search Queries

For each directory-category pair, run multiple search queries with page offsets to capture all listings. Continue until results are empty or reach max pages.

3

Extract Company Data from Results

Parse company names, descriptions, and URLs from organic results. Extract additional signals from snippets (ratings, review counts, specialties).

4

Deduplicate Against Master List

Compare new results against the existing master list. Add only new companies. Flag companies that appeared in previous runs but are now missing.

5

Export to Google Sheets or CRM

Append new companies to the master spreadsheet or create new CRM contacts. Tag with directory source, category, and extraction date.

Python Implementation

Python
import requests, os, json

API_KEY = os.environ["SCAVIO_API_KEY"]

def paginated_directory_search(directory: str, category: str, max_pages: int = 5) -> list:
    """Search a directory with pagination."""
    all_results = []
    for page in range(max_pages):
        resp = requests.post(
            "https://api.scavio.dev/api/v1/search",
            headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
            json={"query": f"site:{directory} {category}", "country_code": "us", "start": page * 10},
            timeout=15,
        )
        data = resp.json()
        results = data.get("organic_results", [])
        if not results:
            break
        for r in results:
            all_results.append({"title": r.get("title", ""), "url": r.get("link", ""), "snippet": r.get("snippet", "")})
    # Deduplicate by URL
    seen = set()
    unique = []
    for r in all_results:
        if r["url"] not in seen:
            seen.add(r["url"])
            unique.append(r)
    return unique

listings = paginated_directory_search("clutch.co", "seo agencies", max_pages=5)
print(f"Extracted {len(listings)} unique listings from Clutch")

JavaScript Implementation

JavaScript
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
async function paginatedSearch(directory, category, maxPages=5) {
  const all = [];
  for (let page=0; page<maxPages; page++) {
    const r = await fetch('https://api.scavio.dev/api/v1/search', {method:'POST', headers:H, body:JSON.stringify({query:'site:'+directory+' '+category, country_code:'us', start:page*10})});
    const d = await r.json();
    if (!(d.organic_results||[]).length) break;
    d.organic_results.forEach(r => all.push({title:r.title, url:r.link, snippet:r.snippet}));
  }
  const seen = new Set();
  return all.filter(r => { if (seen.has(r.url)) return false; seen.add(r.url); return true; });
}
const listings = await paginatedSearch('clutch.co', 'seo agencies', 5);
console.log(listings.length + ' unique listings extracted');

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Frequently Asked Questions

Online directories like Clutch, G2, and Capterra have hundreds of pages of listings. Manually browsing is slow and incomplete. This n8n workflow automates paginated search queries to extract all listings in a category, deduplicates results, and builds a master prospect list. Each page of search results costs $0.005.

This workflow uses a weekly cron on monday at 3 am utc or on-demand for new categories.. Weekly (Monday 3 AM UTC).

This workflow uses the following Scavio platforms: google. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 250 credits per month with no credit card required. That is enough to test and validate this workflow before scaling it.

n8n Directory Pagination and Data Extraction Workflow

Automate paginated directory scraping via n8n. Search business directories page by page, extract company data, and deduplicate into a master list.