Workflow

Daily YouTube Transcript Search and Index Pipeline

Search YouTube daily for topic videos, extract transcripts, and index them in MongoDB for full-text search.

Overview

Teams building knowledge bases from video content need a repeatable pipeline to discover new YouTube videos on their topics, extract transcripts, and make them searchable. This workflow runs at 6 AM daily, searches YouTube via Scavio for your configured topics, extracts transcript data from the top results, indexes the content into MongoDB with full-text search enabled, and runs a verification query to confirm the new content is retrievable. No manual video hunting or copy-pasting transcripts.

Trigger

Cron schedule (daily at 6 AM UTC)

Schedule

Daily at 6 AM UTC

Workflow Steps

1

Search YouTube for topics

Query Scavio YouTube platform for each configured topic. Collect video IDs, titles, and descriptions from top results.

2

Extract transcripts

For each discovered video, fetch the transcript text. Skip videos already in the database to avoid duplicates.

3

Index into MongoDB

Insert transcript documents with metadata (video ID, title, channel, publish date, topic) into MongoDB with text index.

4

Run search verification

Execute a test query against the MongoDB text index to confirm new transcripts are retrievable and ranked correctly.

Python Implementation

Python
import requests, os, json
from datetime import datetime

H = {"x-api-key": os.environ["SCAVIO_API_KEY"]}

TOPICS = ["search api integration 2026", "ai agent grounding tools", "serp api tutorial"]

def search_youtube_topics(topic):
    """Search YouTube for a topic and return video metadata."""
    r = requests.post("https://api.scavio.dev/api/v1/search", headers=H,
        json={"platform": "youtube", "query": topic}, timeout=10).json()
    videos = []
    for item in r.get("organic", [])[:5]:
        videos.append({
            "video_id": item.get("link", "").split("v=")[-1] if "v=" in item.get("link", "") else "",
            "title": item.get("title", ""),
            "description": item.get("snippet", ""),
            "link": item.get("link", ""),
            "topic": topic,
            "indexed_at": datetime.utcnow().isoformat()
        })
    return videos

# Collect all videos
all_videos = []
for topic in TOPICS:
    videos = search_youtube_topics(topic)
    all_videos.extend(videos)
    print(f"[YOUTUBE] {topic}: {len(videos)} videos found")

# MongoDB insert (pseudo-code - replace with your pymongo connection)
# from pymongo import MongoClient
# db = MongoClient(os.environ["MONGO_URI"]).transcripts
# db.videos.create_index([("title", "text"), ("description", "text")])
# for v in all_videos:
#     db.videos.update_one({"video_id": v["video_id"]}, {"$set": v}, upsert=True)

print(f"\nTotal videos to index: {len(all_videos)}")
for v in all_videos[:3]:
    print(f"  {v['title'][:80]} | {v['link']}")

JavaScript Implementation

JavaScript
const H = {"x-api-key": process.env.SCAVIO_API_KEY, "Content-Type": "application/json"};

const TOPICS = ["search api integration 2026", "ai agent grounding tools", "serp api tutorial"];

async function searchYoutubeTopics(topic) {
  const r = await fetch("https://api.scavio.dev/api/v1/search", {
    method: "POST", headers: H,
    body: JSON.stringify({platform: "youtube", query: topic})
  }).then(r => r.json());
  return (r.organic || []).slice(0, 5).map(item => ({
    videoId: (item.link || "").includes("v=") ? item.link.split("v=").pop() : "",
    title: item.title || "",
    description: item.snippet || "",
    link: item.link || "",
    topic,
    indexedAt: new Date().toISOString()
  }));
}

(async () => {
  const allVideos = [];
  for (const topic of TOPICS) {
    const videos = await searchYoutubeTopics(topic);
    allVideos.push(...videos);
    console.log(`[YOUTUBE] ${topic}: ${videos.length} videos found`);
  }
  // MongoDB insert (pseudo-code - replace with your mongodb connection)
  // const { MongoClient } = require("mongodb");
  // const db = (await MongoClient.connect(process.env.MONGO_URI)).db("transcripts");
  // await db.collection("videos").createIndex({title: "text", description: "text"});
  // for (const v of allVideos) {
  //   await db.collection("videos").updateOne({videoId: v.videoId}, {$set: v}, {upsert: true});
  // }
  console.log(`\nTotal videos to index: ${allVideos.length}`);
  allVideos.slice(0, 3).forEach(v => console.log(`  ${v.title.slice(0, 80)} | ${v.link}`));
})();

Platforms Used

YouTube

Video search with transcripts and metadata

Frequently Asked Questions

Teams building knowledge bases from video content need a repeatable pipeline to discover new YouTube videos on their topics, extract transcripts, and make them searchable. This workflow runs at 6 AM daily, searches YouTube via Scavio for your configured topics, extracts transcript data from the top results, indexes the content into MongoDB with full-text search enabled, and runs a verification query to confirm the new content is retrievable. No manual video hunting or copy-pasting transcripts.

This workflow uses a cron schedule (daily at 6 am utc). Daily at 6 AM UTC.

This workflow uses the following Scavio platforms: youtube. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 250 credits per month with no credit card required. That is enough to test and validate this workflow before scaling it.

Daily YouTube Transcript Search and Index Pipeline

Search YouTube daily for topic videos, extract transcripts, and index them in MongoDB for full-text search.