Tools

tubebrain exposes four stable VOD MCP tools and an experimental live-stream session surface.

For the canonical in-session usage pattern, see the Agent Guide. The short version is:

  1. use get_metadata first
  2. use list_languages when language choice matters
  3. use get_transcript_section for timestamp-scoped work, or get_transcript when you are ready to consume the full transcript output

get_transcript

Fetch a transcript in one of the supported output formats.

Example:

get_transcript(url: "https://youtube.com/watch?v=dQw4w9WgXcQ", format: "markdown")

Parameters:

  • url (required)
  • lang (default: en)
  • format (default: json)

Supported formats:

  • json
  • markdown
  • srt
  • vtt
  • text

For auto-captioned videos, the VOD resolver starts from the public /watch player response and may refresh stale timedtext URLs through the Android VR player response using the same visitor data from the watch page. This keeps the normal transcript tools working for videos whose web caption URL is present but returns an empty body.

get_transcript_section

Fetch only the transcript window around a timestamp. This is the preferred tool for timestamped research prompts because it keeps the agent focused on the relevant section while preserving original segment timestamps.

Example:

get_transcript_section(url: "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s", format: "json")

Parameters:

  • url (required)
  • lang (default: en)
  • at_s (optional; defaults to the YouTube timestamp in url)
  • before_s (default: 120)
  • after_s (default: 600)
  • format (default: json)

When format=json, the response includes anchor_ms, window_start_ms, and window_end_ms in addition to the filtered segments. Other formats reuse the normal transcript renderers with the same filtered segment list.

list_languages

List available caption languages for a video.

Example:

list_languages(url: "https://youtube.com/watch?v=dQw4w9WgXcQ")

get_metadata

Fetch video title, channel, duration, and caption availability.

Example:

get_metadata(url: "https://youtube.com/watch?v=dQw4w9WgXcQ")

Experimental Stream Tools

The stream tools are the TubeBrain live-session surface:

  • start_stream
  • poll_stream
  • stop_stream
  • list_streams

start_stream resolves YouTube Live and direct HTTP audio stream URLs into a live media session. YouTube Live resolution uses the watch page's hlsManifestUrl and validates the HLS playlist before creating the session.

Live HLS/direct audio ingestion runs in a background session task, fetches newly advertised audio bytes, and hands audio chunks to the stream transcription boundary. HLS fMP4 init maps are resolved, cached, and prepended to media segments before transcription, and inherited HLS byte-range offsets are normalized before fetch. Direct HTTP audio streams are buffered into bounded chunks before transcription instead of using raw network packet boundaries; MP3 streams flush on complete frame boundaries when possible, and content-type or URL-extension hints are passed to the decoder for MP3/AAC/MP4 probing. Whisper builds also demux MPEG-TS live chunks to ADTS AAC before Symphonia decode. YouTube HLS ingestion primes the request cookie jar from the watch page, solves player n challenges with Node and pinned yt-dlp EJS solver scripts, and applies the solved value to manifest and segment URLs before fetch. Missing Node, unavailable solver assets, or EJS hash mismatches fail the session with explicit diagnostics. In po-token builds, GVS/session PoToken attachment remains available behind TUBEBRAIN_ATTACH_YOUTUBE_GVS_POT=1; it is not attached by default for the current validated parity path. PoTokens and signed Googlevideo URL values are redacted from stream diagnostics. Default builds use a no-op chunk transcriber and immediately report health: "degraded" with a last_error explaining that live STT requires a --features whisper build. Builds with the whisper feature attempt local Whisper transcription over 15-second live windows with 5 seconds of overlap and duplicate filtering. Whisper-backed sessions seed last_diagnostic at startup with the active model, window, and overlap so agents can tell local STT is enabled before the first audio window fills. Live STT quality remains experimental, so poll_stream may still return an empty segment list for a successfully started session. poll_stream includes stream health, last_diagnostic for the latest non-error ingestion/transcription activity, and last_error for degraded or failed sessions. Use these fields together to distinguish unavailable live STT, active buffering, silence, and no-speech detection from actual failures.

Manual proof packets

The GTM proof harness emits compact, redacted JSON packets for the two MVP live source classes. These commands are manual network smokes, not default tests:

just stream-proof-radio
just stream-proof-youtube-live
just stream-proof-both

The default radio fixture is SomaFM Groove Salad. Override it with TUBEBRAIN_PROOF_RADIO_URL or pass --url. The default YouTube Live fixture is Lofi Girl's public livestream; override it with TUBEBRAIN_PROOF_YOUTUBE_URL or pass --url.

Default builds prove the session and diagnostic path: the packet should report a stable diagnostic_observed state explaining that live STT needs a Whisper build. To prove actual transcript chunks, run:

just stream-proof-whisper --source radio
just stream-proof-whisper --source youtube-live

Proof output includes source kind, redacted source URL, resolved platform, session ID, first-signal latency, first-chunk latency when segments appear, final cursor, health, sample transcript segments, and redacted diagnostics. Proof packets must not expose cookies, signed media URLs, PoTokens, BotGuard internals, or raw audio.

Feature flags

Optional features are compile-time gated:

cargo build --features po-token
cargo build --features whisper
cargo build --features po-token,whisper
  • po-token: BotGuard proof-of-origin token support
  • whisper: local Whisper fallback for videos without captions

Whisper builds default to base.en. Set TUBEBRAIN_WHISPER_MODEL to tiny.en, base.en, or small.en to pick the local model. For live sessions, TUBEBRAIN_LIVE_WHISPER_WINDOW_SECS and TUBEBRAIN_LIVE_WHISPER_OVERLAP_SECS override the default 15-second window and 5-second overlap. The overlap must be less than the window, and windows are capped at 120 seconds.

Related docs: