RFC: Live Stream Agent Transcription¶
Date: 2026-04-16 Status: exploratory Owner: open
Summary¶
Extend yt-text beyond static YouTube videos into a greenfield MCP feature set
for live audio and live video transcription.
The target outcome is an agent-friendly path from live media input to structured incremental text, suitable for personal AI, accessibility, and monitoring workflows.
Candidate sources include:
- YouTube Live
- internet radio streams
- Twitch
- Kick
- broadcast or generic audio streams
Why This Matters¶
Current yt-text workflows are strongest for:
- public YouTube videos
- caption discovery
- transcript extraction after the fact
That leaves a large adjacent space open:
- live captioning for accessibility agents
- realtime summarization of streams
- background monitoring of audio channels
- alerting or note-taking agents that need structured text as events arrive
This is a materially different utility class than "fetch me the transcript for
an existing video." It turns yt-text into a live data source for agents.
User Stories¶
- As a personal accessibility agent, I want to transcribe a YouTube Live stream into structured text while it is happening.
- As a monitoring agent, I want to watch a radio or broadcast stream and emit timestamped text chunks for summarization and alerting.
- As a research assistant, I want to follow a live Twitch or Kick stream and keep a structured rolling transcript for later search.
- As a workflow agent, I want transcript chunks to arrive incrementally instead of waiting for a complete recording.
Proposed Capability Shape¶
This work likely needs more than a single get_transcript extension.
Possible MCP-facing shapes:
- session-based tools such as
start_live_transcript,poll_live_transcript, andstop_live_transcript - chunked transcript tools that return only new segments after a cursor
- future protocol exploration for push-style or event-style delivery where a client can actually consume incremental updates
For near compatibility, a polling model is probably the safest first design. Many current MCP clients handle request/response tools better than long-lived streams.
Architecture Themes¶
Likely building blocks:
- source adapters for YouTube Live, Twitch, Kick, radio, and generic stream URLs
- audio normalization and chunking
- transcript segmentation with timestamps and source metadata
- buffering and cursoring so agents can ask for only the new material
- clear distinction between upstream captions and locally transcribed audio
- backpressure controls so long-lived sessions do not grow without bound
The current VideoSource and TranscriptSource abstractions suggest a path,
but live sessions will add state and lifecycle concerns that static-video flows
do not have today.
Output Expectations¶
The incremental output should stay structured.
Likely fields per chunk:
- session ID or stream ID
- source platform
- language
- transcript source type
- chunk timestamps
- text content
- confidence or provenance metadata where available
This should remain machine-friendly first, with human-readable formats layered on afterward.
Risks And Open Questions¶
- platform auth and access rules vary sharply across live sources
- long-lived sessions may not map cleanly onto every MCP client
- latency, buffering, and reconnection policy need explicit design
- diarization and speaker attribution may be expensive or unavailable
- local STT on continuous streams may have materially different CPU and model requirements than the current Whisper fallback
- stream moderation, abuse, and retention policies need thought if transcripts become persistent artifacts
Non-Goals For A First Version¶
- hosted SaaS ingestion
- full meeting-assistant product scope
- speaker diarization guarantees
- broad media-library management
- replacing specialized broadcast-caption pipelines
Suggested Next Step¶
Before implementation, break this space into a smaller design spike:
- define the MCP session model
- pick one initial live source, likely YouTube Live or generic audio streams
- prove incremental chunk delivery with a polling-based tool design
- measure latency, CPU cost, and transcript quality against the current static
whisperpath