RFC: Live Stream Agent Transcription¶

Date: 2026-04-16 Status: exploratory Owner: open

Summary¶

Extend yt-text beyond static YouTube videos into a greenfield MCP feature set for live audio and live video transcription.

The target outcome is an agent-friendly path from live media input to structured incremental text, suitable for personal AI, accessibility, and monitoring workflows.

Candidate sources include:

YouTube Live
internet radio streams
Twitch
Kick
broadcast or generic audio streams

Why This Matters¶

Current yt-text workflows are strongest for:

public YouTube videos
caption discovery
transcript extraction after the fact

That leaves a large adjacent space open:

live captioning for accessibility agents
realtime summarization of streams
background monitoring of audio channels
alerting or note-taking agents that need structured text as events arrive

This is a materially different utility class than "fetch me the transcript for an existing video." It turns yt-text into a live data source for agents.

User Stories¶

As a personal accessibility agent, I want to transcribe a YouTube Live stream into structured text while it is happening.
As a monitoring agent, I want to watch a radio or broadcast stream and emit timestamped text chunks for summarization and alerting.
As a research assistant, I want to follow a live Twitch or Kick stream and keep a structured rolling transcript for later search.
As a workflow agent, I want transcript chunks to arrive incrementally instead of waiting for a complete recording.

Proposed Capability Shape¶

This work likely needs more than a single get_transcript extension.

Possible MCP-facing shapes:

session-based tools such as start_live_transcript, poll_live_transcript, and stop_live_transcript
chunked transcript tools that return only new segments after a cursor
future protocol exploration for push-style or event-style delivery where a client can actually consume incremental updates

For near compatibility, a polling model is probably the safest first design. Many current MCP clients handle request/response tools better than long-lived streams.

Architecture Themes¶

Likely building blocks:

source adapters for YouTube Live, Twitch, Kick, radio, and generic stream URLs
audio normalization and chunking
transcript segmentation with timestamps and source metadata
buffering and cursoring so agents can ask for only the new material
clear distinction between upstream captions and locally transcribed audio
backpressure controls so long-lived sessions do not grow without bound

The current VideoSource and TranscriptSource abstractions suggest a path, but live sessions will add state and lifecycle concerns that static-video flows do not have today.

Output Expectations¶

The incremental output should stay structured.

Likely fields per chunk:

session ID or stream ID
source platform
language
transcript source type
chunk timestamps
text content
confidence or provenance metadata where available

This should remain machine-friendly first, with human-readable formats layered on afterward.

Risks And Open Questions¶

platform auth and access rules vary sharply across live sources
long-lived sessions may not map cleanly onto every MCP client
latency, buffering, and reconnection policy need explicit design
diarization and speaker attribution may be expensive or unavailable
local STT on continuous streams may have materially different CPU and model requirements than the current Whisper fallback
stream moderation, abuse, and retention policies need thought if transcripts become persistent artifacts

Non-Goals For A First Version¶

hosted SaaS ingestion
full meeting-assistant product scope
speaker diarization guarantees
broad media-library management
replacing specialized broadcast-caption pipelines

Suggested Next Step¶

Before implementation, break this space into a smaller design spike:

define the MCP session model
pick one initial live source, likely YouTube Live or generic audio streams
prove incremental chunk delivery with a polling-based tool design
measure latency, CPU cost, and transcript quality against the current static whisper path