Skip to content

Usage Guide

youtube-to-docs is a versatile tool designed to convert YouTube content into structured documentation, including transcripts, summaries, audio, and infographics. It is primarily designed as a Command Line Interface (CLI) tool but can also be used as a Python library.

Installation

The recommended way to run youtube-to-docs is using uvx.

Basic Usage

For basic usage (YouTube data fetching, local CSV output, standard Gemini models):

uvx youtube-to-docs --help

Optional Features (Extras)

To keep the installation footprint small, many features are optional. You can enable them by installing specific "extras":

Extra Description Dependencies
audio Recommended. Required for downloading audio (for TTS/transcription). yt-dlp
video Required for generating video files (combining audio & infographic). static-ffmpeg
workspace Required for saving to Google Drive. google-auth-oauthlib, pypandoc, pypandoc_binary
m365 Required for saving to SharePoint/OneDrive. msal, fastexcel, xlsxwriter, pypandoc, pypandoc_binary
aws AWS Bedrock support. boto3, botocore[crt]
azure Required if using Azure OpenAI models. openai
gcp Required if using Google Gemini, Gemma, or Vertex AI models. google-genai, google-cloud-speech, google-cloud-storage, google-cloud-texttospeech, google-cloud-translate
app Web app with browser-based UI. fastapi, uvicorn
all Installs all of the above. All optional dependencies.

How to use extras with uvx:

Use the --with flag followed by the package name and extras in brackets.

Example 1: Audio and Video support (Common)

uvx --with "youtube-to-docs[audio,video]" youtube-to-docs ...

Example 2: Google Drive + Audio support

uvx --with "youtube-to-docs[workspace,audio]" youtube-to-docs ...

Example 3: Everything (Full feature set)

uvx --with "youtube-to-docs[all]" youtube-to-docs ...

Installing uv

If you don't have uv installed, you can install it using the following commands:

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Setup

Before running the tool, ensure your environment is correctly configured with the necessary API keys and authentication files.

1. Environment Variables

Set the following environment variables based on the AI providers you intend to use.

Variable Description Required For
YOUTUBE_DATA_API_KEY API key for the YouTube Data API v3. Fetching video metadata.
GEMINI_API_KEY API key for Google Gemini/Gemma models. Gemini/Gemma models (-m gemini..., -m gemma...).
PROJECT_ID Google Cloud Project ID. GCP Vertex models (-m vertex...), GCP STT (-t gcp...) and GCP TTS (--tts gcp...).
YTD_GCS_BUCKET_NAME Google Cloud Storage bucket name (write access). GCP STT models (-t gcp...) for temp audio storage.
AWS_BEARER_TOKEN_BEDROCK AWS Bearer Token. AWS Bedrock models (-m bedrock...).
AZURE_FOUNDRY_ENDPOINT Azure Foundry Endpoint URL. Azure Foundry models (-m foundry...).
AZURE_FOUNDRY_API_KEY Azure Foundry API Key. Azure Foundry models (-m foundry...).

2. Storage Authentication (Optional)

If you plan to save outputs to Google Drive (workspace) or Microsoft SharePoint/OneDrive (sharepoint), you need to configure authentication files in your home directory.

Google Drive (Workspace)

Create a file at ~/.google_client_secret.json with your Google Cloud OAuth 2.0 Client Secret JSON.

{
  "installed": {
    "client_id": "YOUR_CLIENT_ID.apps.googleusercontent.com",
    "project_id": "your-project-id",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_secret": "YOUR_CLIENT_SECRET",
    "redirect_uris": ["http://localhost"]
  }
}
  • First Run: The tool will open a browser window to authenticate and generate a ~/.google_client_token.json file for future non-interactive use.

Microsoft 365 (SharePoint/OneDrive)

Create a file at ~/.azure_client.json with your Azure App Registration details.

{
  "client_id": "YOUR_CLIENT_ID",
  "authority": "https://login.microsoftonline.com/consumers"
}
  • Authority: Use .../consumers for personal accounts or .../YOUR_TENANT_ID for organizational accounts.
  • First Run: The tool will attempt to authenticate (silently or interactively) and cache the token in ~/.msal_token_cache.json.

Command Line Interface (CLI)

The main command is youtube-to-docs.

Basic Usage

Running the command without arguments processes a default video:

youtube-to-docs

Arguments

Argument Description Default Example
video_id The YouTube content to process. Can be a YouTube URL, Video ID, Playlist ID (starts with PL), Channel Handle (starts with @), or a comma-separated list of Video IDs. atmGAHYpf_c youtube-to-docs @mychannel
-o, --outfile Path to save the output CSV file.
- Local path: my-data.csv
- Google Workspace: workspace or w (saves to Drive folder youtube-to-docs-artifacts) or a specific Folder ID.
- SharePoint/OneDrive: sharepoint or s (saves to youtube-to-docs-artifacts).
- Memory: memory or m (keeps artifacts in memory, no files on disk).
- No-op: none or n (skips saving to a file, results are printed to the console).
youtube-to-docs-artifacts/youtube-docs.csv -o n
-t, --transcript The transcript source to use. Can be 'youtube' (default) to fetch existing YouTube transcripts, or an AI model name to perform STT on extracted audio (e.g. gemini... for Gemini API, gcp-chirp3 for GCP Speech-to-Text V2). youtube -t gemini-2.0-flash-exp
-m, --model The LLM(s) to use for speaker extraction, Q&A generation, tag generation, and summarization. Supports models from Google (Gemini), Vertex AI, AWS Bedrock, and Azure Foundry. Can be a comma-separated list. None (Transcript only) -m gemini-3-flash-preview,vertex-claude-haiku-4-5@20251001
--tts The TTS model and voice to use for generating audio summaries. Format: {model}-{voice}. Supports Gemini models (e.g., gemini-3.1-flash-tts-preview-Kore, gemini-2.5-flash-preview-tts-Kore) and GCP Cloud TTS (e.g., gcp-chirp3-Kore). None --tts gemini-3.1-flash-tts-preview-Kore
-i, --infographic The image model to use for generating a visual summary. Supports models from Google (Gemini, Imagen), AWS Bedrock (Titan, Nova Canvas), and Azure Foundry. None --infographic gemini-3.1-flash-image-preview
--alt-text-model The LLM model to use for generating multimodal alt text for the infographic. Defaults to the summary model. None --alt-text-model gemini-3-flash-preview
-nys, --no-youtube-summary If set, skips generating a secondary summary from the YouTube transcript when using an AI model for the primary transcript. False --no-youtube-summary
-tr, --translate Translate all outputs to a target language after generating in English. Format: {model}-{language} e.g. gemini-3-flash-preview-es, or aws-translate-{language} / gcp-translate-{language} to use AWS Translate or Google Cloud Translation directly (e.g. aws-translate-es, gcp-translate-es). The tool first tries to fetch a native YouTube transcript in the target language; if unavailable, it translates the English transcript. Summaries, Q&A, tags, one-sentence summaries, transcripts, and SRT files are all translated. When combined with --tts or --infographic, assets are produced in both English and the target language. None -tr gemini-3-flash-preview-es, -tr aws-translate-es, -tr gcp-translate-es
-cia, --combine-infographic-audio Combine the infographic and audio summary into a video file (MP4). Requires both --tts and --infographic to be effective. When used with --translate, one video is created per language. False --combine-infographic-audio
--all Shortcut to use a specific model suite for everything. Supported: 'gemini-flash', 'gemini-pro', 'gemini-flash-pro-image', 'gcp-pro', 'anthropic-opus'. Sets models for summary, TTS, and infographic, and enables --no-youtube-summary. None --all gemini-flash
-scc, --suggest-corrected-captions Suggest WCAG 2.1 Level AA compliant caption corrections for an SRT file, per Section 508 guidance. Format: {model} or {model}-{source}. See Suggested Corrected Captions for full source rules. None -scc gemini-3-flash-preview-youtube
-pp, --post-process Post-process the transcript with JSON operations. Supported operations: word count (case-insensitive, whole-word). Values can be a single string or a list. Results are added as new columns in the output CSV (e.g. Post-process: word count(apple)). None -pp '{"word count": ["apple", "banana"]}'
--verbose Enable verbose output. False --verbose

Examples

1. Summarize the default video using a single model:

youtube-to-docs

2. Generate a transcript using Gemini 2.0 Flash and summarize (using default model):

youtube-to-docs -t gemini-2.0-flash-exp

3. Process the default video and save to a custom CSV:

youtube-to-docs -o my-docs.csv

4. Summarize a Playlist using multiple models (Gemini and Vertex):

youtube-to-docs PLGKTTEqwhiHHWO-jdxM1KtzTbWo6h0Ycl -m gemini-3-flash-preview,vertex-claude-haiku-4-5@20251001

5. Process a Channel with Summaries, TTS, and Infographics:

youtube-to-docs @mga-othercommittees6625 -m vertex-claude-haiku-4-5@20251001 --tts gemini-2.5-flash-preview-tts-Kore --infographic gemini-3.1-flash-image-preview

6. Generate an Infographic using AWS Bedrock:

youtube-to-docs atmGAHYpf_c --infographic bedrock-titan-image-generator-v2:0

7. Create a Video (Infographic + Audio Summary):

youtube-to-docs atmGAHYpf_c -m gemini-3-flash-preview --tts gemini-2.5-flash-preview-tts-Kore --infographic gemini-3.1-flash-image-preview --combine-infographic-audio

8. Translate all outputs to Spanish:

youtube-to-docs atmGAHYpf_c -m gemini-3-flash-preview -tr gemini-3-flash-preview-es

9. Full multilingual pipeline — English + Spanish summaries, audio, infographics, and videos:

youtube-to-docs atmGAHYpf_c -m gemini-3-flash-preview -tr gemini-3-flash-preview-es --tts gemini-2.5-flash-preview-tts-Kore --infographic gemini-3.1-flash-image-preview --combine-infographic-audio

10. Suggest corrected captions for YouTube SRT:

youtube-to-docs atmGAHYpf_c -scc gemini-3-flash-preview-youtube

11. Run STT and correct the generated captions in one pass:

youtube-to-docs atmGAHYpf_c -t gcp-chirp3 -scc gemini-3-flash-preview

12. Correct captions and add speaker labels in one pass:

youtube-to-docs atmGAHYpf_c -m gemini-3-flash-preview -scc gemini-3-flash-preview-youtube

13. Post-process a transcript to count word occurrences:

youtube-to-docs atmGAHYpf_c -pp '{"word count": ["apple", "banana"]}'

Suggested Corrected Captions

The -scc / --suggest-corrected-captions flag uses an LLM to suggest WCAG 2.1 Level AA compliant corrections to an existing SRT file, following Section 508 guidance.

What it corrects

  • Punctuation — adds or fixes commas, periods, question marks, em dashes, and ellipses.
  • Capitalisation — fixes sentence-initial capitals and proper nouns.
  • Filler words — retains filler words (e.g. "uh", "um") where they aid accessible meaning.
  • Speaker labels — when speaker extraction has been run (via -m), inserts [Name] at the start of each segment where the speaker changes.
  • Segment merging — if a grammatically complete sentence naturally spans two consecutive segments, those segments are merged into one, using the start time of the first and the end time of the last. No new segments are ever added.

Output

Corrections are saved to suggested-corrected-caption-files/ and contain only the changed segments — unchanged segments are omitted. If nothing needs fixing the file contains NO_CHANGES. The output path is recorded in the CSV under Suggested Corrected Captions File ({model}).

Source argument rules

The format is -scc {model} or -scc {model}-{source}.

Source Behaviour Example
Omitted Automatically uses the most recent non-YouTube AI SRT in the row. Useful when -t {stt-model} is set in the same run or a prior run. -scc gemini-3-flash-preview
youtube Uses the YouTube-generated SRT explicitly. -scc gemini-3-flash-preview-youtube
STT model name Uses the SRT produced by that specific STT run. -scc gemini-3-flash-preview-gcp-chirp3

Tip: If you have more than one AI SRT in the row (e.g. from both gcp-chirp3 and aws-transcribe), the default picks the first one found. Use the explicit {model}-{source} form to target a specific SRT.

Common combinations

# Correct YouTube captions only
youtube-to-docs VIDEO_ID -scc gemini-3-flash-preview-youtube

# STT with GCP Chirp3, then correct the generated captions
youtube-to-docs VIDEO_ID -t gcp-chirp3 -scc gemini-3-flash-preview

# Explicitly target the gcp-chirp3 SRT from a previous run
youtube-to-docs VIDEO_ID -scc gemini-3-flash-preview-gcp-chirp3

# Speaker extraction + caption correction in one run
youtube-to-docs VIDEO_ID -m gemini-3-flash-preview -scc gemini-3-flash-preview-youtube

CSV Column Reference

The output CSV file contains a variety of columns depending on the arguments provided. Below is a reference of the possible columns:

Base Metadata

  • URL: The full YouTube video URL.
  • Title: The title of the video.
  • Description: The video description.
  • Data Published: The date the video was published.
  • Channel: The name of the YouTube channel.
  • Tags: Video tags from YouTube (comma-separated).
  • Tags {Transcript} {Model} model: Up to 5 AI-generated tags based on the transcript.
  • Duration: The duration of the video.
  • Transcript characters from youtube: The total number of characters in the YouTube transcript.
  • Transcript characters from {model}: The total number of characters in the AI-generated transcript (if applicable).
  • Audio File: Path to the extracted audio file (used for AI transcription).

Files

  • Transcript File {type}: Path to the saved transcript file. {type} is either youtube generated, human generated, or {model} generated. Suffix (lang) added for non-English.
  • SRT File {type}: Path to the saved SRT transcript file. Standardized timestamps for accessibility and timing.
  • Speakers File {model}: Path to the saved speaker extraction text file.
  • QA File {model}: Path to the saved Q&A Markdown file. Includes Timestamp and Timestamp URL (markdown hyperlink) columns. Suffix (lang) added for non-English.
  • Summary File {model}: Path to the Markdown summary file generated by a specific model. Suffix (lang) added for non-English.
  • One Sentence Summary File {model}: Path to the one-sentence summary file.
  • Tags File {model}: Path to the AI-generated tags file.
  • Summary Infographic File {model} {infographic_model}: Path to the generated infographic image.
  • Summary Infographic Alt Text File {model} {infographic_model}: Path to the generated alt text file.
  • Summary Audio File {model} {tts_model} File: Path to the generated TTS audio file. Suffix (lang) added for non-English.
  • Video File: Path to the generated MP4 video combining the English infographic and audio.
  • Video File (lang): Path to the translated language video (e.g., Video File (es)). Only present when --translate and --combine-infographic-audio are both used.
  • Suggested Corrected Captions File ({model}): Path to the suggested-corrections SRT file produced by -scc. Contains only the changed segments, or NO_CHANGES if no corrections were needed.
  • Post-process: word count({word}): Count of occurrences of {word} in the transcript (case-insensitive, whole-word match). One column per word when using -pp.

AI Outputs & Costs

  • Speakers {model}: The extracted list of speakers and their roles.
  • {normalized_model} Speaker extraction cost ($): The estimated API cost for speaker extraction.
  • QA Text {model}: The full text of the Q&A pairs (also saved to the Q&A file).
  • {normalized_model} QA cost ($): The estimated API cost for Q&A generation.
  • Summary Text {model}: The full text of the summary (also saved to the summary file).
  • One Sentence Summary {model}: The one-sentence summary text.
  • Summary Infographic Alt Text {model} {infographic_model}: The full multimodal alt text for the infographic.
  • {normalized_model} summary cost ($): The total estimated API cost for both speaker extraction and summarization.
  • Summary Infographic Cost {model} {infographic_model} ($): The estimated API cost for infographic generation.
  • Summary Infographic Alt Text Cost {model} {infographic_model} ($): The estimated API cost for alt text generation.
  • {normalized_model} tags cost from {transcript} ($): The estimated API cost for AI tag generation.
  • {normalized_model} STT cost ($): The estimated API cost for Speech-to-Text generation.

Note: {normalized_model} refers to the model name with prefixes (like vertex-) and date suffixes removed for cleaner column headers.

Speaker Extraction

When a model is specified using the -m or --model argument, the tool automatically performs speaker extraction before generating a summary.

  • Model Matching: The extraction uses the same model as the summary. If multiple models are provided, each will perform its own extraction.
  • Structured Output: It identifies speakers and their professional titles or roles (e.g., "Speaker 1 (Senator Katie Fry Hester, Co-Chair)").
  • Cost Tracking: The cost of speaker extraction is tracked separately in the {model} Speaker extraction cost ($) column and included in the total {model} summary cost ($).
  • Unknowns: If a speaker or title cannot be identified, the tool uses the placeholder UNKNOWN. If no speakers are detected at all, the field is set to NaN.

Web App

youtube-to-docs includes a browser-based web interface powered by FastAPI. It provides the same functionality as the CLI in an interactive UI with real-time log streaming.

Running the Web App

uvx --with "youtube-to-docs[app]" youtube-to-docs-app

This starts a local server at http://localhost:8000.

Options

Flag Description Default
--host Host to bind to. 0.0.0.0
--port Port to bind to. 8000
--reload Enable auto-reload for dev. False

Example: Run on a custom port with auto-reload

uvx --with "youtube-to-docs[app]" youtube-to-docs-app --port 3000 --reload

Features

  • Model suite dropdown — Select from pre-configured suites (gemini-flash, gemini-pro, etc.) or configure individual models.
  • All CLI parameters — Transcript source, TTS, infographic, translation, corrected captions, post-processing, and output file are all configurable.
  • Real-time log streaming — Processing output streams to the browser via Server-Sent Events (SSE) as the job runs.
  • Artifact browser — After processing completes, generated files (summaries, Q&A, SRT files, infographics, audio, video) are listed with download links.

How It Works

  1. Enter a YouTube URL or Video ID and select your options.
  2. Click Process Video — this starts a background job on the server.
  3. Logs stream to the browser in real time.
  4. When the job completes, any generated artifacts are listed with links to view or download them.

The web app builds CLI arguments internally and calls the same main() function as the CLI, so behavior is identical. Jobs run in background threads, allowing the server to handle multiple requests.

API Endpoints

The web app exposes a REST API that the frontend uses:

Method Endpoint Description
GET / Serves the HTML frontend.
GET /api/model-suites Returns available model suite configs.
POST /api/process Starts a processing job. Returns job ID.
GET /api/jobs/{job_id} Returns job status, output, artifacts.
GET /api/jobs/{job_id}/stream SSE stream of real-time log output.
GET /api/artifacts/{path} Serves a generated artifact file.

MCP Server

This tool also functions as a Model Context Protocol (MCP) server, allowing it to be used as a tool by AI agents (like the Gemini CLI).

The server exposes a process_video tool that mirrors the CLI functionality.

Configuration

The repository includes a gemini-extension.json file at the root, which configures the MCP server for use with the Gemini CLI.

Usage

Once the extension is registered with your agent, you can ask it to process videos using natural language:

"Save a summary of https://www.youtube.com/watch?v=KuPc06JgI_A"

The agent will prompt you for any necessary details (like the model to use) and then execute the tool.

Install as a Gemini CLI extension

gemini extensions install https://github.com/DoIT-Artificial-Intelligence/youtube-to-docs.git

Library Usage

While primarily a CLI, you can import core functions for custom workflows.

from youtube_to_docs.transcript import fetch_transcript

video_id = "atmGAHYpf_c"
transcript, is_generated = fetch_transcript(video_id)

Why use youtube-to-docs?

You might find other tools that download YouTube transcripts, but youtube-to-docs distinguishes itself in several ways:

  1. Multimodal Output: It doesn't just stop at text.

    • Summaries: Uses state-of-the-art LLMs to create concise summaries.
    • Speaker Extraction: Automatically identifies speakers and their titles/roles from the transcript.
    • Audio (TTS): Converts summaries into audio files, perfect for listening on the go.
    • Tags: Automatically generates up to 5 relevant tags for the video content.
    • Visuals (Infographics): Generates AI-created infographics to visually represent the content.
    • Timestamps & SRT: Automatically generates .srt files and includes precision timestamps in Q&A tables, cross-referencing YouTube's own timing data for pinpoint accuracy when using AI transcripts.
    • Videos: Combines infographics and audio into a shareable video summary.
  2. Structured Data (CSV/Polars):

    • Instead of loose files, metadata and paths are organized into a robust CSV file using polars.
    • This makes it incredibly easy to import the data into Google Sheets, Excel, or a database for further analysis or publishing.
  3. Batch Processing:

    • Seamlessly handles individual videos, entire playlists, or full channels with a single command.
  4. Multi-Provider Support:

    • Agnostic to the LLM provider. Whether you use Google Gemini, Vertex AI, AWS Bedrock, or Azure Foundry, you can plug in your preferred model.
  5. Cost Awareness:

    • When using paid API models, it tracks and estimates the cost of summarization, saving it directly to your data file.