Overview
Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.
Trigger
Daily cron at 6am for the active topic list
Schedule
Daily at 6am
Workflow Steps
Iterate active topics
Pull topic list from a Postgres table or YAML config.
Per topic: Scavio search across 3 surfaces
search, reddit_search, youtube_search calls in parallel.
Dedupe candidate URLs against Qdrant payload index
Skip URLs already ingested.
Per new URL: Scavio /extract for markdown
Cleaner than raw HTML; saves embedding tokens.
Chunk + embed + upsert
Chunk to 500-token blocks, embed via your embedding model, upsert to Qdrant with URL as payload.
Log new-doc count + per-topic cost
Cost-budget guardrail per topic.
Python Implementation
import requests, os
from qdrant_client import QdrantClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
qdrant = QdrantClient(url=os.environ['QDRANT_URL'])
def discover(topic):
results = []
for endpoint in ['search', 'reddit/search', 'youtube/search']:
r = requests.post(f'https://api.scavio.dev/api/v1/{endpoint}', headers=H, json={'query': topic}).json()
results.extend(r.get('organic_results', []) + r.get('posts', []) + r.get('videos', []))
return results
def ingest_topic(topic):
candidates = discover(topic)
for c in candidates:
url = c.get('link') or c.get('url')
if not url or already_ingested(url): continue
md = requests.post('https://api.scavio.dev/api/v1/extract',
headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')
store(url, topic, md)JavaScript Implementation
// Same flow in TS via Qdrant JS client + Scavio fetch calls.Platforms Used
Web search with knowledge graph, PAA, and AI overviews
Community, posts & threaded comments from any subreddit
YouTube
Video search with transcripts and metadata