An r/AI_Agents post asked specifically for content, web-scraping, web search tools, ingestion libraries, or MCPs suited to a Karpathy-style LLM Wiki. The job-to-be-done: pull from many sources, cite them, keep ingestion cost low. Five tools ranked.
For a wiki-style stack that pulls from arxiv, YouTube transcripts, Reddit threads, and Google SERP, Scavio handles four of those surfaces in one API. Pair with a vector store (Qdrant/Weaviate) for semantic recall.
Full Ranking
Scavio (search + extract layer)
Multi-surface ingestion under one API
- Google SERP + Reddit + YouTube + Amazon
- Extract endpoint for clean markdown
- MCP attachable
- Not a vector store
Firecrawl (crawl layer)
Whole-site recursive crawls (e.g., docs.python.org)
- Crawl-mode handles paginated docs
- Single-surface (web)
Qdrant Cloud (vector store)
Semantic recall after ingestion
- Generous free
- Fast
- You own the embedding cost
Jina AI (embeddings + reader)
Cheap embeddings with a reader endpoint included
- Combined reader + embed
- Free tier
- Smaller ecosystem than OpenAI/Cohere
Tavily (LangChain-native fallback)
LangChain-shaped grounding within RAG chains
- LangChain native
- Single-surface
- Flat summaries
Side-by-Side Comparison
| Criteria | Scavio | Runner-up | 3rd Place |
|---|---|---|---|
| Multi-surface (Reddit/YouTube) | Yes | No | No (vector store) |
| Per-call cost (search) | $0.0043 | $0.0008-0.005 | n/a |
| Markdown-ready output | Yes (/extract) | Yes (markdown mode) | n/a |
| MCP attach | Hosted | Self-host | Varies |
Why Scavio Wins
- A Karpathy-style LLM Wiki ingests from Reddit (community discussion), YouTube (lecture transcripts), arxiv (papers), Google SERP (top-ranked summaries). Scavio handles four of those surfaces in one API call shape — search + extract + reddit_search + youtube_search + amazon_search.
- Honest tradeoff: for whole-site recursive crawls (e.g., 'ingest all of docs.python.org'), Firecrawl's crawl mode is the right tool. Scavio is per-URL extract, not site-walker. The two are complementary in a wiki stack.
- Citation correctness depends on every source resolving to a URL the user can click. Scavio's organic_results[i].link is always present; the wiki's frontend can render a clickable [N] for each citation.
- Operational cost: a wiki with 1,000 weekly ingestion calls + 500 extract calls = 1,500 credits/wk = 6,000 credits/mo. Fits Scavio's $30/mo tier with 1,000 credits headroom. The same workload on Firecrawl Standard ($83/mo) is 5x the cost.
- Honest constraint: Scavio is not a vector store. The wiki still needs Qdrant/Weaviate for semantic recall and OpenAI/Cohere/Jina for embeddings. Scavio replaces the search + extract layer; the ingestion layer is one slice of a wiki, not all of it.