How This Works

youtube-to-docs automates the conversion of YouTube videos into structured documentation and multimodal assets. This document explains the technical workflow and core components of the system.

System Architecture

The tool operates as a pipeline that ingests YouTube content, processes it through various AI models, and outputs structured data and files.

1. Input Resolution

The entry point (youtube_to_docs/main.py) accepts flexible inputs: - YouTube URLs: e.g. https://www.youtube.com/watch?v=atmGAHYpf_c - Video IDs: Direct processing (e.g. atmGAHYpf_c). - Playlists: Resolves all video IDs within a playlist (e.g. PL8ZxoInteClyHaiReuOHpv6Z4SPrXtYtW). - Channel Handles: Resolves a channel's "Uploads" playlist via its handle (e.g. @mga-hgo1740). - Comma-separated lists: Processes multiple videos at once (e.g. KuPc06JgI_A,GalhDyf3F8g).

Key Component: youtube_to_docs.transcript.resolve_video_ids uses the YouTube Data API to fetch lists of videos when a Playlist or Channel is provided.

2. Transcript Fetching

For each video, the tool fetches or generates a transcript:

YouTube Source (default): Fetches the existing transcript (auto-captions or manual) directly from YouTube.
AI STT Source: If an AI model name is specified (e.g. gemini-3-flash-preview, gcp-chirp3, aws-transcribe), the tool extracts audio from the video via yt-dlp and passes it to a Speech-to-Text model for a fresh transcript.
- For gcp- models (Cloud Speech-to-Text V2), GOOGLE_CLOUD_PROJECT is required and YTD_GCS_BUCKET_NAME is recommended.
- For aws-transcribe, YTD_S3_BUCKET_NAME is required.
- When --no-youtube-summary is set, the secondary summary pass from the YouTube transcript is skipped.
SRT Generation: For both YouTube and AI sources, the system generates a standardized .srt file. This is crucial for accessibility and provides the raw timing data used for precision Q&A alignment.

Note on Auto-Captions: Automatic captions are generated by speech recognition and may have accuracy issues. They are not always immediately available.

3. The LLM Pipeline

Text processing is handled by Large Language Models (LLMs) defined in youtube_to_docs/llms.py. The pipeline is model-agnostic, supporting: - Google Gemini / Gemma (direct API, models prefixed gemini- or gemma-) - GCP Vertex AI (prefixed with vertex-, e.g. vertex-claude-haiku-4-5@20251001) - AWS Bedrock (prefixed with bedrock-, e.g. bedrock-nova-2-lite-v1) - Azure Foundry (prefixed with foundry-, e.g. foundry-gpt-5-mini)

You can also pass a comma-separated list to run multiple models simultaneously, or use the --all flag with a named suite (e.g. gemini-flash, gemini-pro, gemini-flash-pro-image, gcp-pro) to set the model, TTS, infographic model, and transcript source all at once.

For each video, the specified model performs these tasks:

Speaker Extraction:
- Input: Full transcript.
- Task: Identify speakers and their professional titles/roles.
- Output: A structured list (e.g., "Speaker 1 (Host)").
Q&A Generation:
- Input: Full transcript + Identified Speakers + Timing Reference (SRT).
- Task: Extract key questions and answers discussed in the video.
- Precision Timing: If YouTube SRT is available, it is passed to the LLM as a "Timing Reference". The model uses this to align the high-quality speech identification of the AI transcript with the pinpoint accuracy of YouTube's timing.
- Output: A Markdown table with columns for Questioner, Question, Responder, Answer, Timestamp, and Timestamp URL (formatted as a markdown hyperlink).
Summarization:
- Input: Full transcript + Video Metadata.
- Task: Create a concise, comprehensive summary of the content.
- Output: A Markdown-formatted summary.
Tag Generation:
- Input: Full transcript.
- Task: Generate up to 5 comma-separated tags for the transcript.
- Output: A comma-separated string of tags.

4. Translation Support

The --translate {model}-{language} argument enables multilingual output (e.g., --translate gemini-3-flash-preview-es). Use aws-translate-{language} (e.g., --translate aws-translate-es) to use the AWS Translate service directly, or gcp-translate-{language} (e.g., --translate gcp-translate-es) to use Google Cloud Translation API directly. Large texts are automatically chunked to respect per-request limits.

All content is generated in English first, then the tool iterates over the target language:

Transcript: Tries to fetch a native YouTube transcript in the target language. Falls back to translating the English transcript using the specified model.
SRT: The SRT file is also translated alongside the transcript.
LLM outputs: Summaries, one-sentence summaries, Q&A, and tags are generated fresh from the translated transcript using the same model.
Infographic & TTS: When -i or --tts are also set, assets are produced in both English and the target language.
Video: When --combine-infographic-audio is also set, one video is produced per language.

File names use ({model}-{language}) (e.g., (gemini-3-flash-preview-es), (aws-translate-es), or (gcp-translate-es)) to identify the translation source. CSV column headers use the shorter (lang) suffix (e.g., (es)).

5. Multimodal Generation

Beyond text, the tool creates audio and visual assets:

Text-to-Speech (TTS):
- Uses TTS models (e.g. gemini-2.5-flash-preview-tts-Kore, gcp-chirp3, aws-polly-Ruth) to convert the generated summary into an audio file.
- This allows users to "listen" to the video summary.
Infographics:
- Uses image generation models to create a visual representation of the summary.
- Supported Providers:
  - Google: Gemini, Imagen.
  - AWS Bedrock: Titan Image Generator, Nova Canvas (requires AWS_BEARER_TOKEN_BEDROCK).
  - Azure Foundry: GPT Image models (requires AZURE_FOUNDRY_ENDPOINT and AZURE_FOUNDRY_API_KEY).
- The prompt includes the video title and the generated summary text to ensure relevance.
Multimodal Alt Text:
- Once an infographic is generated, an AI model processes the image bytes directly to generate descriptive alt text.
- Post-processing: The tool automatically strips common prefixes like "Alt text: " to ensure the output is clean and ready for accessibility use.
- This ensures infographics are accessible and searchable.
Video Generation:
- Combines the generated infographic (visual) and TTS audio (sound) into a single MP4 video file.
- Uses static-ffmpeg to perform the merging, ensuring no external FFmpeg installation is required.
- This provides a shareable "video summary" format.

6. Post-Processing

The --post-process argument accepts a JSON string of operations to run against the transcript after it is fetched. Results are added as additional columns in the output CSV.

Currently supported operations: - word count: Counts case-insensitive occurrences of one or more words. * '{"word count": "apple"}' → adds a Post-process: word count(apple) column. * '{"word count": ["apple", "banana"]}' → adds one column per word.

7. Suggested Corrected Captions

The --suggest-corrected-captions argument uses an LLM to review an SRT file and suggest WCAG 2.1 Level AA / Section 508 compliant corrections. Output is saved to suggested-corrected-caption-files/ as a diff-style file (changed segments only).

Format: {model} or {model}-{source} (e.g., gemini-3-flash-preview, gemini-3-flash-preview-youtube, gemini-3-flash-preview-gcp-chirp3). If source is omitted, the most recent AI-generated SRT is used automatically. If speaker extraction has been run, speaker labels are included on speaker changes.

8. Output & Storage

The --outfile argument controls where the output CSV is saved: - Local path (default): youtube-to-docs-artifacts/youtube-docs.csv. - workspace / w: Saves to Google Drive (requires google-auth-oauthlib). - sharepoint / s: Saves to Microsoft SharePoint (requires msal and xlsxwriter). - memory / m: Keeps artifacts in memory (no files on disk). Useful for the web app where users want results returned directly without local file side-effects. - none / n: Skips saving to a file; results are available in the log output only.

Storage is abstracted via youtube_to_docs/storage.py (LocalStorage, GoogleDriveStorage, M365Storage, MemoryStorage, NullStorage), so the rest of the pipeline is storage-agnostic.

9. Cost Tracking

The system includes a pricing engine (youtube_to_docs/prices.py) that tracks token usage for every API call. - It calculates costs for input and output tokens based on the specific model used. - Costs are aggregated for speaker extraction, Q&A, summarization, and infographic generation. - These estimates are saved directly into the output CSV.

Interfaces

The tool exposes three interfaces that all share the same core main() function:

CLI

youtube-to-docs <url> [options]

Run youtube-to-docs --help for a full list of options.

MCP Server

youtube_to_docs/mcp_server.py exposes a process_video tool via the Model Context Protocol. This allows AI assistants (like Claude) to call the tool directly. Start with:

uv run --all-extras youtube_to_docs/mcp_server.py

Web App

A FastAPI web application (youtube_to_docs/app.py) provides a browser-based UI for submitting videos and monitoring progress. Start with:

youtube-to-docs-app
# App available at: http://localhost:8000

The app exposes: - POST /api/process — submits a video processing job and returns a job_id. - GET /api/jobs/{job_id} — polls job status, output log, and artifact list. - GET /api/jobs/{job_id}/stream — streams job output in real time via Server-Sent Events (SSE). - GET /api/artifacts/{path} — downloads a generated artifact file. - GET /api/model-suites — returns available model suite definitions.

Jobs run in the background via asyncio.to_thread, and artifacts are scanned from all known output directories once a job completes.

Data Organization

The final output is a structured CSV file (managed via polars) containing metadata, file paths, and AI outputs. Corresponding files are organized into subdirectories within a central artifacts folder:

youtube-to-docs-artifacts/
├── youtube-docs.csv              # The main data file
├── transcript-files/             # Raw text transcripts (single long strings)
├── srt-files/                    # Standardized SRT transcript files
├── audio-files/                  # Extracted audio files (for AI transcription)
├── speaker-extraction-files/     # Identified speakers lists
├── qa-files/                     # Markdown Q&A tables with timestamps
├── summary-files/                # Markdown summaries
├── one-sentence-summary-files/   # Concisely summarized content
├── tag-files/                    # AI-generated tags files
├── infographic-files/            # Generated infographic images
├── infographic-alt-text/         # Multimodal alt text for infographics
├── suggested-corrected-caption-files/  # LLM-suggested WCAG 2.1 corrected SRT files
└── video-files/                  # Combined infographic + audio videos

This structure ensures that while the CSV provides a high-level data view, the actual content is easily accessible as standalone files.