Tools¶
tubebrain exposes four stable VOD MCP tools and an experimental live-stream
session surface.
For the canonical in-session usage pattern, see the Agent Guide. The short version is:
- use
get_metadatafirst - use
list_languageswhen language choice matters - use
get_transcript_sectionfor timestamp-scoped work, orget_transcriptwhen you are ready to consume the full transcript output
get_transcript¶
Fetch a transcript in one of the supported output formats.
Example:
get_transcript(url: "https://youtube.com/watch?v=dQw4w9WgXcQ", format: "markdown")
Parameters:
url(required)lang(default:en)format(default:json)
Supported formats:
jsonmarkdownsrtvtttext
For auto-captioned videos, the VOD resolver starts from the public /watch
player response and may refresh stale timedtext URLs through the Android VR
player response using the same visitor data from the watch page. This keeps the
normal transcript tools working for videos whose web caption URL is present but
returns an empty body.
get_transcript_section¶
Fetch only the transcript window around a timestamp. This is the preferred tool for timestamped research prompts because it keeps the agent focused on the relevant section while preserving original segment timestamps.
Example:
get_transcript_section(url: "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s", format: "json")
Parameters:
url(required)lang(default:en)at_s(optional; defaults to the YouTube timestamp inurl)before_s(default:120)after_s(default:600)format(default:json)
When format=json, the response includes anchor_ms, window_start_ms, and
window_end_ms in addition to the filtered segments. Other formats reuse the
normal transcript renderers with the same filtered segment list.
list_languages¶
List available caption languages for a video.
Example:
list_languages(url: "https://youtube.com/watch?v=dQw4w9WgXcQ")
get_metadata¶
Fetch video title, channel, duration, and caption availability.
Example:
get_metadata(url: "https://youtube.com/watch?v=dQw4w9WgXcQ")
Experimental Stream Tools¶
The stream tools are the TubeBrain live-session surface:
start_streampoll_streamstop_streamlist_streams
start_stream resolves YouTube Live and direct HTTP audio stream URLs into a
live media session. YouTube Live resolution uses the watch page's
hlsManifestUrl and validates the HLS playlist before creating the session.
Live HLS/direct audio ingestion runs in a background session task, fetches newly
advertised audio bytes, and hands audio chunks to the stream transcription
boundary. HLS fMP4 init maps are resolved, cached, and prepended to media
segments before transcription, and inherited HLS byte-range offsets are
normalized before fetch. Direct HTTP audio streams are buffered into bounded
chunks before transcription instead of using raw network packet
boundaries; MP3 streams flush on complete frame boundaries when possible, and
content-type or URL-extension hints are passed to the decoder for MP3/AAC/MP4
probing. Whisper builds also demux MPEG-TS live chunks to ADTS AAC before
Symphonia decode. YouTube HLS ingestion primes the request cookie jar from the
watch page, solves player n challenges with Node and pinned yt-dlp EJS solver
scripts, and applies the solved value to manifest and segment URLs before fetch.
Missing Node, unavailable solver assets, or EJS hash mismatches fail the session
with explicit diagnostics. In po-token builds, GVS/session PoToken attachment
remains available behind TUBEBRAIN_ATTACH_YOUTUBE_GVS_POT=1; it is not attached
by default for the current validated parity path. PoTokens and signed
Googlevideo URL values are redacted from stream diagnostics.
Default builds use a no-op chunk transcriber and immediately report
health: "degraded" with a last_error explaining that live STT requires a
--features whisper build. Builds with the whisper feature attempt local
Whisper transcription over 15-second live windows with 5 seconds of overlap and
duplicate filtering. Whisper-backed sessions seed last_diagnostic at startup
with the active model, window, and overlap so agents can tell local STT is
enabled before the first audio window fills. Live STT quality remains
experimental, so poll_stream may still return an empty segment list for a
successfully started session.
poll_stream includes stream health, last_diagnostic for the latest non-error
ingestion/transcription activity, and last_error for degraded or failed
sessions. Use these fields together to distinguish unavailable live STT, active
buffering, silence, and no-speech detection from actual failures.
Manual proof packets¶
The GTM proof harness emits compact, redacted JSON packets for the two MVP live source classes. These commands are manual network smokes, not default tests:
just stream-proof-radio
just stream-proof-youtube-live
just stream-proof-both
The default radio fixture is SomaFM Groove Salad. Override it with
TUBEBRAIN_PROOF_RADIO_URL or pass --url. The default YouTube Live fixture is
Lofi Girl's public livestream; override it with TUBEBRAIN_PROOF_YOUTUBE_URL or
pass --url.
Default builds prove the session and diagnostic path: the packet should report a
stable diagnostic_observed state explaining that live STT needs a Whisper
build. To prove actual transcript chunks, run:
just stream-proof-whisper --source radio
just stream-proof-whisper --source youtube-live
Proof output includes source kind, redacted source URL, resolved platform, session ID, first-signal latency, first-chunk latency when segments appear, final cursor, health, sample transcript segments, and redacted diagnostics. Proof packets must not expose cookies, signed media URLs, PoTokens, BotGuard internals, or raw audio.
Feature flags¶
Optional features are compile-time gated:
cargo build --features po-token
cargo build --features whisper
cargo build --features po-token,whisper
po-token: BotGuard proof-of-origin token supportwhisper: local Whisper fallback for videos without captions
Whisper builds default to base.en. Set TUBEBRAIN_WHISPER_MODEL to tiny.en,
base.en, or small.en to pick the local model. For live sessions,
TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS and TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS
override the default 15-second window and 5-second overlap. The overlap must be
less than the window, and windows are capped at 120 seconds.
Related docs: