Workflow

LLM Wiki Ingestion Workflow

Daily ingestion of new sources for a Karpathy-style LLM Wiki. Scavio search across web/Reddit/YouTube + extract + embed.

Overview

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

Trigger

Daily cron at 6am for the active topic list

Schedule

Daily at 6am

Workflow Steps

1

Iterate active topics

Pull topic list from a Postgres table or YAML config.

2

Per topic: Scavio search across 3 surfaces

search, reddit_search, youtube_search calls in parallel.

3

Dedupe candidate URLs against Qdrant payload index

Skip URLs already ingested.

4

Per new URL: Scavio /extract for markdown

Cleaner than raw HTML; saves embedding tokens.

5

Chunk + embed + upsert

Chunk to 500-token blocks, embed via your embedding model, upsert to Qdrant with URL as payload.

6

Log new-doc count + per-topic cost

Cost-budget guardrail per topic.

Python Implementation

Python
import requests, os
from qdrant_client import QdrantClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
qdrant = QdrantClient(url=os.environ['QDRANT_URL'])

def discover(topic):
    results = []
    for endpoint in ['search', 'reddit/search', 'youtube/search']:
        r = requests.post(f'https://api.scavio.dev/api/v1/{endpoint}', headers=H, json={'query': topic}).json()
        results.extend(r.get('organic_results', []) + r.get('posts', []) + r.get('videos', []))
    return results

def ingest_topic(topic):
    candidates = discover(topic)
    for c in candidates:
        url = c.get('link') or c.get('url')
        if not url or already_ingested(url): continue
        md = requests.post('https://api.scavio.dev/api/v1/extract',
            headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')
        store(url, topic, md)

JavaScript Implementation

JavaScript
// Same flow in TS via Qdrant JS client + Scavio fetch calls.

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Reddit

Community, posts & threaded comments from any subreddit

YouTube

Video search with transcripts and metadata

Frequently Asked Questions

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

This workflow uses a daily cron at 6am for the active topic list. Daily at 6am.

This workflow uses the following Scavio platforms: google, reddit, youtube. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 500 credits per month with no credit card required. That is enough to test and validate this workflow before scaling it.

LLM Wiki Ingestion Workflow

Daily ingestion of new sources for a Karpathy-style LLM Wiki. Scavio search across web/Reddit/YouTube + extract + embed.