Agent Guide

tubebrain works best when an agent treats transcript retrieval as a staged decision instead of a single blind tool call.

This page is the canonical guide for in-session tool ordering, language handling, transcript-source interpretation, and error-handling patterns.

Use the tools in this order unless you have a reason not to:

  1. get_metadata
  2. list_languages
  3. get_transcript_section for timestamped prompts, or get_transcript for full-video work

Why this order works:

  • get_metadata is the lightest validation path for a URL, title, duration, and caption availability
  • list_languages tells you whether strict language selection is realistic before you promise a specific output
  • get_transcript_section keeps timestamped prompts focused on the relevant window instead of flooding the calling model with the full video
  • get_transcript is the full heavy path, so it should come after basic grounding

When To Call Each Tool

get_metadata

Use get_metadata first when you need to:

  • verify that a URL resolves to a public video
  • inspect title, channel, and duration for routing or summarization
  • check whether captions exist before asking for a transcript
  • cheaply validate that the client and MCP server are wired correctly

list_languages

Use list_languages when language choice matters:

  • before promising the user a transcript in a specific language
  • before treating lang=en as guaranteed
  • before deciding whether to ask the user which caption track they want

get_transcript

Use get_transcript once you know you actually want the transcript payload.

For most agent workflows, format=json is the best default because it keeps the language, source, and timestamp fields available for downstream reasoning.

Use other formats when the output target is already known:

  • markdown for direct human review inside a chat
  • srt or vtt for subtitle export
  • text for lightweight plain-text summarization

get_transcript_section

Use get_transcript_section when the user gives a timestamped YouTube URL or asks about a specific part of a video. If at_s is omitted, the tool reads YouTube timestamps such as t=2449s from the URL. The default window is 120 seconds before and 600 seconds after the anchor, with original segment timestamps preserved.

For agent research workflows, prefer format=json so the calling harness can see anchor_ms, window_start_ms, window_end_ms, and the filtered segments.

Experimental Live Sessions

Use start_stream only for live or direct audio-stream URLs, not normal YouTube VODs. It resolves YouTube Live through the watch page HLS manifest and opens a session that can be polled with poll_stream.

Live sessions run a background HLS/direct-audio fetch task. Each fetched chunk is handed to the stream transcription boundary. HLS fMP4 init maps are handled before transcription, inherited HLS byte-range offsets are normalized before fetch, and direct HTTP audio streams are buffered into bounded chunks instead of raw network packet boundaries. MP3 streams flush on complete frame boundaries when possible, and content-type or URL-extension hints are passed to the decoder for MP3/AAC/MP4 probing. Whisper builds also demux MPEG-TS live chunks to ADTS AAC before Symphonia decode. Default builds use a no-op chunk transcriber and mark live STT unavailable with health: "degraded" plus a last_error on poll_stream; builds with the whisper feature attempt local Whisper transcription over 15-second live windows with 5 seconds of overlap and duplicate filtering. Whisper-backed sessions seed last_diagnostic at startup with model/window/overlap posture before the first transcript segment is ready. Full live STT quality remains experimental, so a valid Whisper-backed stream session can still return empty chunks. Check health, last_diagnostic, and last_error on poll_stream responses before assuming silence means success.

The local live Whisper default is base.en with a 15-second window and 5-second overlap. For validation or local tuning, set TUBEBRAIN_WHISPER_MODEL=tiny.en|base.en|small.en, TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS, and TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS. Use tiny.en for quick smoke tests, base.en as the default local posture, and small.en only when the local machine can absorb the extra model load.

Whisper builds install the whisper.cpp tracing trampoline before local model initialization. Native inference logs therefore flow through the normal Rust tracing subscriber on stderr instead of writing directly to process stdout/stderr. Keep MCP stdout reserved for protocol traffic; use RUST_LOG=tubebrain=debug,whisper_rs=debug only when you need native Whisper diagnostics during local validation.

For YouTube Live HLS, the ingestion path solves player n challenges before fetching manifests or segments. It fetches the watch-page player JS, verifies pinned yt-dlp EJS solver scripts, and runs Node to rewrite challenged URLs. Set TUBEBRAIN_YOUTUBE_N_NODE to override the Node binary. The EJS scripts are fetched from the pinned yt-dlp/ejs release on first use and cached in-process; a missing Node runtime, unavailable EJS asset, or hash mismatch fails the stream with an explicit diagnostic. In po-token builds, GVS/session PoToken attachment is available only when TUBEBRAIN_ATTACH_YOUTUBE_GVS_POT=1 is set; it is not attached by default because the current validated parity path does not require it. Stream diagnostics redact PoTokens and signed Googlevideo URL values.

Live Smoke Validation

Normal test runs stay hermetic. To validate against a real public YouTube Live source, provide a live URL explicitly and run the ignored smoke tests:

TUBEBRAIN_LIVE_SMOKE_URL='https://www.youtube.com/watch?v=VIDEO_ID' \
  cargo test --locked --test live_smoke -- --ignored

The resolver smoke fetches the watch page, resolves the HLS audio endpoint, and parses the selected media playlist. To also fetch at least one live HLS audio chunk through AudioIngestionEngine, add TUBEBRAIN_LIVE_SMOKE_INGEST=1. This ingestion variant is the regression gate for real segment authorization. With --features whisper, the ignored decode smoke also verifies that a fetched YouTube Live chunk demuxes and decodes to PCM without downloading a Whisper model. To run actual local Whisper inference, also set TUBEBRAIN_LIVE_SMOKE_WHISPER=1; this defaults to tiny.en and can be overridden with TUBEBRAIN_LIVE_SMOKE_WHISPER_MODEL=base.en or small.en. The smoke transcriber honors TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS and TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS for local timing experiments. For model-matrix checks, set TUBEBRAIN_LIVE_SMOKE_WHISPER_MATRIX=1 and optional TUBEBRAIN_LIVE_SMOKE_WHISPER_MODELS=tiny.en,base.en,small.en; the ignored test prints each model's elapsed time, first segment timestamp, segment count, and text snippet when run with --nocapture.

As of 2026-04-28, the public Lofi Girl and FRANCE 24 live sources pass the live-byte path by solving the HLS n challenge once on the manifest path, carrying the solved value into segment URLs, and fetching live Googlevideo audio segments. FRANCE 24 also passes the decode smoke and the opt-in live Whisper smoke. As of 2026-04-29, the FRANCE 24 matrix smoke has produced transcript segments with both tiny.en and base.en; small.en remains a heavier local quality check rather than a default gate.

Language Handling

The lang argument is a preference, not a hard contract.

What matters in practice:

  • the returned language field is authoritative
  • lang=en does not guarantee an English transcript
  • if the requested language is missing, tubebrain may fall back to another available language instead of hard-failing

If strict language choice matters:

  1. call list_languages first
  2. confirm the target language code is present
  3. inspect the language field in the transcript response before presenting the result as final

Transcript Source Handling

The source field tells the agent how the transcript was produced.

  • caption_manual: human-authored captions; usually the highest-confidence path
  • caption_auto_generated: machine-generated captions; usable but more error-prone
  • whisper_local: local speech-to-text fallback; slower and usually noisier than captions

Treat source as part of the answer quality, not just an implementation detail. If the source is not caption_manual, say so in the response when quality or attribution matters.

Error-Handling Pattern

Use these decision rules in agent sessions:

  • invalid YouTube URL or video ID: stop and ask for a valid watch URL or video ID
  • video is private or requires authentication: stop; the default path does not support it
  • video is age-restricted: stop; treat it as unsupported for the default path
  • no captions available ... timedtext returned empty after fallback attempts: explain that metadata and language listing may still work, but the caption fetch path is blocked
  • PoToken required but botguard sidecar not available: suggest a po-token build
  • whisper feature not enabled: suggest a whisper build only if the user needs captionless fallback

If a transcript fetch fails, do not throw away the metadata path. get_metadata and list_languages can still provide useful grounding.

MCP Client Hygiene

When wiring tubebrain into an MCP client:

  • prefer pointing the client directly at the tubebrain binary
  • if the binary is not on PATH, use an absolute path
  • do not wrap tubebrain in a shell script that prints to stdout
  • restart the client after config changes unless the client explicitly hot-reloads MCP

stdout is reserved for MCP protocol traffic. Any banner text or wrapper noise can break the session.

Prompting Patterns

Useful instructions to give an agent:

  • "Run get_metadata first. If language choice matters, run list_languages. For a timestamped prompt, fetch get_transcript_section in JSON; otherwise fetch get_transcript in JSON."
  • "If the transcript source is not caption_manual, tell me that in the answer."
  • "If the requested language is unavailable, say which language was returned instead of assuming English."
  • "If transcript fetch fails, still tell me whether captions exist and which languages are listed."