Agent Guide¶
tubebrain works best when an agent treats transcript retrieval as a staged
decision instead of a single blind tool call.
This page is the canonical guide for in-session tool ordering, language handling, transcript-source interpretation, and error-handling patterns.
Recommended Tool Order¶
Use the tools in this order unless you have a reason not to:
get_metadatalist_languagesget_transcript_sectionfor timestamped prompts, orget_transcriptfor full-video work
Why this order works:
get_metadatais the lightest validation path for a URL, title, duration, and caption availabilitylist_languagestells you whether strict language selection is realistic before you promise a specific outputget_transcript_sectionkeeps timestamped prompts focused on the relevant window instead of flooding the calling model with the full videoget_transcriptis the full heavy path, so it should come after basic grounding
When To Call Each Tool¶
get_metadata¶
Use get_metadata first when you need to:
- verify that a URL resolves to a public video
- inspect title, channel, and duration for routing or summarization
- check whether captions exist before asking for a transcript
- cheaply validate that the client and MCP server are wired correctly
list_languages¶
Use list_languages when language choice matters:
- before promising the user a transcript in a specific language
- before treating
lang=enas guaranteed - before deciding whether to ask the user which caption track they want
get_transcript¶
Use get_transcript once you know you actually want the transcript payload.
For most agent workflows, format=json is the best default because it keeps
the language, source, and timestamp fields available for downstream reasoning.
Use other formats when the output target is already known:
markdownfor direct human review inside a chatsrtorvttfor subtitle exporttextfor lightweight plain-text summarization
get_transcript_section¶
Use get_transcript_section when the user gives a timestamped YouTube URL or
asks about a specific part of a video. If at_s is omitted, the tool reads
YouTube timestamps such as t=2449s from the URL. The default window is 120
seconds before and 600 seconds after the anchor, with original segment
timestamps preserved.
For agent research workflows, prefer format=json so the calling harness can
see anchor_ms, window_start_ms, window_end_ms, and the filtered segments.
Experimental Live Sessions¶
Use start_stream only for live or direct audio-stream URLs, not normal
YouTube VODs. It resolves YouTube Live through the watch page HLS manifest and
opens a session that can be polled with poll_stream.
Live sessions run a background HLS/direct-audio fetch task. Each fetched chunk
is handed to the stream transcription boundary. HLS fMP4 init maps are handled
before transcription, inherited HLS byte-range offsets are normalized before
fetch, and direct HTTP audio streams are buffered into bounded chunks instead of
raw network packet boundaries. MP3 streams flush on complete frame boundaries
when possible, and content-type or URL-extension hints are passed to the decoder
for MP3/AAC/MP4 probing. Whisper builds also demux MPEG-TS live chunks to ADTS
AAC before Symphonia decode. Default builds use a no-op chunk transcriber and
mark live STT unavailable with health: "degraded" plus a last_error on
poll_stream; builds with the whisper feature attempt local Whisper
transcription over 15-second live windows with 5 seconds of overlap and
duplicate filtering. Whisper-backed sessions seed last_diagnostic at startup
with model/window/overlap posture before the first transcript segment is ready.
Full live STT quality remains experimental, so a valid Whisper-backed stream
session can still return empty chunks. Check health, last_diagnostic, and
last_error on poll_stream responses before assuming silence means success.
The local live Whisper default is base.en with a 15-second window and
5-second overlap. For validation or local tuning, set
TUBEBRAIN_WHISPER_MODEL=tiny.en|base.en|small.en,
TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS, and
TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS. Use tiny.en for quick smoke tests,
base.en as the default local posture, and small.en only when the local
machine can absorb the extra model load.
Whisper builds install the whisper.cpp tracing trampoline before local model
initialization. Native inference logs therefore flow through the normal Rust
tracing subscriber on stderr instead of writing directly to process
stdout/stderr. Keep MCP stdout reserved for protocol traffic; use
RUST_LOG=tubebrain=debug,whisper_rs=debug only when you need native Whisper
diagnostics during local validation.
For YouTube Live HLS, the ingestion path solves player n challenges before
fetching manifests or segments. It fetches the watch-page player JS, verifies
pinned yt-dlp EJS solver scripts, and runs Node to rewrite challenged URLs. Set
TUBEBRAIN_YOUTUBE_N_NODE to override the Node binary. The EJS scripts are fetched
from the pinned yt-dlp/ejs release on first use and cached in-process; a
missing Node runtime, unavailable EJS asset, or hash mismatch fails the stream
with an explicit diagnostic. In po-token builds, GVS/session PoToken
attachment is available only when
TUBEBRAIN_ATTACH_YOUTUBE_GVS_POT=1 is set; it is not attached by default because
the current validated parity path does not require it. Stream diagnostics redact
PoTokens and signed Googlevideo URL values.
Live Smoke Validation¶
Normal test runs stay hermetic. To validate against a real public YouTube Live source, provide a live URL explicitly and run the ignored smoke tests:
TUBEBRAIN_LIVE_SMOKE_URL='https://www.youtube.com/watch?v=VIDEO_ID' \
cargo test --locked --test live_smoke -- --ignored
The resolver smoke fetches the watch page, resolves the HLS audio endpoint, and
parses the selected media playlist. To also fetch at least one live HLS audio
chunk through AudioIngestionEngine, add TUBEBRAIN_LIVE_SMOKE_INGEST=1. This
ingestion variant is the regression gate for real segment authorization. With
--features whisper, the ignored decode smoke also verifies that a fetched
YouTube Live chunk demuxes and decodes to PCM without downloading a Whisper
model. To run actual local Whisper inference, also set
TUBEBRAIN_LIVE_SMOKE_WHISPER=1; this defaults to tiny.en and can be overridden
with TUBEBRAIN_LIVE_SMOKE_WHISPER_MODEL=base.en or small.en. The smoke
transcriber honors TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS and
TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS for local timing experiments.
For model-matrix checks, set TUBEBRAIN_LIVE_SMOKE_WHISPER_MATRIX=1 and optional
TUBEBRAIN_LIVE_SMOKE_WHISPER_MODELS=tiny.en,base.en,small.en; the ignored test
prints each model's elapsed time, first segment timestamp, segment count, and
text snippet when run with --nocapture.
As of 2026-04-28, the public Lofi Girl and FRANCE 24 live sources pass the
live-byte path by solving the HLS n challenge once on the manifest path,
carrying the solved value into segment URLs, and fetching live Googlevideo audio
segments. FRANCE 24 also passes the decode smoke and the opt-in live Whisper
smoke. As of 2026-04-29, the FRANCE 24 matrix smoke has produced transcript
segments with both tiny.en and base.en; small.en remains a heavier local
quality check rather than a default gate.
Language Handling¶
The lang argument is a preference, not a hard contract.
What matters in practice:
- the returned
languagefield is authoritative lang=endoes not guarantee an English transcript- if the requested language is missing,
tubebrainmay fall back to another available language instead of hard-failing
If strict language choice matters:
- call
list_languagesfirst - confirm the target language code is present
- inspect the
languagefield in the transcript response before presenting the result as final
Transcript Source Handling¶
The source field tells the agent how the transcript was produced.
caption_manual: human-authored captions; usually the highest-confidence pathcaption_auto_generated: machine-generated captions; usable but more error-pronewhisper_local: local speech-to-text fallback; slower and usually noisier than captions
Treat source as part of the answer quality, not just an implementation detail.
If the source is not caption_manual, say so in the response when quality or
attribution matters.
Error-Handling Pattern¶
Use these decision rules in agent sessions:
invalid YouTube URL or video ID: stop and ask for a valid watch URL or video IDvideo is private or requires authentication: stop; the default path does not support itvideo is age-restricted: stop; treat it as unsupported for the default pathno captions available ... timedtext returned empty after fallback attempts: explain that metadata and language listing may still work, but the caption fetch path is blockedPoToken required but botguard sidecar not available: suggest apo-tokenbuildwhisper feature not enabled: suggest awhisperbuild only if the user needs captionless fallback
If a transcript fetch fails, do not throw away the metadata path. get_metadata
and list_languages can still provide useful grounding.
MCP Client Hygiene¶
When wiring tubebrain into an MCP client:
- prefer pointing the client directly at the
tubebrainbinary - if the binary is not on
PATH, use an absolute path - do not wrap
tubebrainin a shell script that prints tostdout - restart the client after config changes unless the client explicitly hot-reloads MCP
stdout is reserved for MCP protocol traffic. Any banner text or wrapper noise
can break the session.
Prompting Patterns¶
Useful instructions to give an agent:
- "Run
get_metadatafirst. If language choice matters, runlist_languages. For a timestamped prompt, fetchget_transcript_sectionin JSON; otherwise fetchget_transcriptin JSON." - "If the transcript
sourceis notcaption_manual, tell me that in the answer." - "If the requested language is unavailable, say which language was returned instead of assuming English."
- "If transcript fetch fails, still tell me whether captions exist and which languages are listed."
Related Docs¶
- Integrations for client configuration
- Quickstarts for end-to-end workflows
- Sample Outputs for response shape
- Troubleshooting for concrete failure strings
- Tools for the reference surface