RFC: Live Stream Agent Transcription

Date: 2026-04-16 Status: exploratory Owner: open

Summary

Extend yt-text beyond static YouTube videos into a greenfield MCP feature set for live audio and live video transcription.

The target outcome is an agent-friendly path from live media input to structured incremental text, suitable for personal AI, accessibility, and monitoring workflows.

Candidate sources include:

  • YouTube Live
  • internet radio streams
  • Twitch
  • Kick
  • broadcast or generic audio streams

Why This Matters

Current yt-text workflows are strongest for:

  • public YouTube videos
  • caption discovery
  • transcript extraction after the fact

That leaves a large adjacent space open:

  • live captioning for accessibility agents
  • realtime summarization of streams
  • background monitoring of audio channels
  • alerting or note-taking agents that need structured text as events arrive

This is a materially different utility class than "fetch me the transcript for an existing video." It turns yt-text into a live data source for agents.

User Stories

  • As a personal accessibility agent, I want to transcribe a YouTube Live stream into structured text while it is happening.
  • As a monitoring agent, I want to watch a radio or broadcast stream and emit timestamped text chunks for summarization and alerting.
  • As a research assistant, I want to follow a live Twitch or Kick stream and keep a structured rolling transcript for later search.
  • As a workflow agent, I want transcript chunks to arrive incrementally instead of waiting for a complete recording.

Proposed Capability Shape

This work likely needs more than a single get_transcript extension.

Possible MCP-facing shapes:

  • session-based tools such as start_live_transcript, poll_live_transcript, and stop_live_transcript
  • chunked transcript tools that return only new segments after a cursor
  • future protocol exploration for push-style or event-style delivery where a client can actually consume incremental updates

For near compatibility, a polling model is probably the safest first design. Many current MCP clients handle request/response tools better than long-lived streams.

Architecture Themes

Likely building blocks:

  • source adapters for YouTube Live, Twitch, Kick, radio, and generic stream URLs
  • audio normalization and chunking
  • transcript segmentation with timestamps and source metadata
  • buffering and cursoring so agents can ask for only the new material
  • clear distinction between upstream captions and locally transcribed audio
  • backpressure controls so long-lived sessions do not grow without bound

The current VideoSource and TranscriptSource abstractions suggest a path, but live sessions will add state and lifecycle concerns that static-video flows do not have today.

Output Expectations

The incremental output should stay structured.

Likely fields per chunk:

  • session ID or stream ID
  • source platform
  • language
  • transcript source type
  • chunk timestamps
  • text content
  • confidence or provenance metadata where available

This should remain machine-friendly first, with human-readable formats layered on afterward.

Risks And Open Questions

  • platform auth and access rules vary sharply across live sources
  • long-lived sessions may not map cleanly onto every MCP client
  • latency, buffering, and reconnection policy need explicit design
  • diarization and speaker attribution may be expensive or unavailable
  • local STT on continuous streams may have materially different CPU and model requirements than the current Whisper fallback
  • stream moderation, abuse, and retention policies need thought if transcripts become persistent artifacts

Non-Goals For A First Version

  • hosted SaaS ingestion
  • full meeting-assistant product scope
  • speaker diarization guarantees
  • broad media-library management
  • replacing specialized broadcast-caption pipelines

Suggested Next Step

Before implementation, break this space into a smaller design spike:

  1. define the MCP session model
  2. pick one initial live source, likely YouTube Live or generic audio streams
  3. prove incremental chunk delivery with a polling-based tool design
  4. measure latency, CPU cost, and transcript quality against the current static whisper path