The Problem
Valuable knowledge is locked inside YouTube videos with no structured way to search across transcripts. Developers want to build internal knowledge bases from tutorials, conference talks, and educational content but YouTube has no transcript search API.
The Scavio Solution
Three-step pipeline: (1) Use Scavio YouTube search to discover relevant videos by topic, (2) Extract transcripts using youtube-transcript-api, (3) Index transcripts in MongoDB with weighted text indexes. MongoDB text search with 10x weight on title and 1x on transcript ensures short title matches rank above incidental transcript keyword hits.
Before
Before the knowledge base, a developer education team manually searched YouTube for tutorials, watched parts of each video, and kept links in a spreadsheet. Finding 'which video covered React Server Components error handling' required checking 20 bookmarks.
After
After building the knowledge base, the team searches 'React Server Components error handling' and gets 8 matching transcripts ranked by relevance, with direct links. New videos are discovered daily via automated topic searches. The knowledge base contains 2,400 indexed transcripts after 3 months of daily ingestion.
Who It Is For
Developer education teams, knowledge management programs, and content researchers who want searchable access to YouTube video content.
Key Benefits
- YouTube search discovers relevant videos automatically
- MongoDB text indexes enable full-text search across transcripts
- Weighted indexes prioritize title matches over transcript noise
- Daily pipeline adds new videos without manual curation
- 50 daily discovery queries cost $0.25
Python Example
import requests, os, json
from pymongo import MongoClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
client = MongoClient(os.environ.get('MONGO_URI', 'mongodb://localhost:27017'))
db = client['youtube_kb']
collection = db['transcripts']
def discover_videos(topic: str) -> list:
"""Find relevant YouTube videos via search."""
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
json={'platform': 'youtube', 'query': topic}, timeout=10).json()
return [{'title': o.get('title'), 'url': o.get('link'),
'channel': o.get('channel')}
for o in r.get('organic_results', [])[:10]]
def index_transcript(video: dict, transcript_text: str):
collection.update_one(
{'url': video['url']},
{'$set': {'title': video['title'], 'transcript': transcript_text,
'channel': video.get('channel')}},
upsert=True)
# Create weighted text index
collection.create_index([('title', 'text'), ('transcript', 'text')],
weights={'title': 10, 'transcript': 1})
videos = discover_videos('react server components tutorial')
for v in videos[:3]:
print(f"{v['title']}: {v['url']}")JavaScript Example
const H = { 'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json' };
async function discoverVideos(topic) {
const r = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: H,
body: JSON.stringify({ platform: 'youtube', query: topic })
}).then(r => r.json());
return (r.organic_results || []).slice(0, 10).map(o => ({
title: o.title, url: o.link, channel: o.channel
}));
}
const videos = await discoverVideos('react server components tutorial');
videos.slice(0, 3).forEach(v => console.log(`${v.title}: ${v.url}`));Platforms Used
YouTube
Video search with transcripts and metadata