RFC: TubeBrain Trait Evolution

Date: 2026-04-24 Status: draft Owner: Jess Sullivan

Summary

Evolve tubebrain's two-trait architecture into a multi-platform, stream-capable system under the working name TubeBrain. The core design principle from the original spec — "Layer 1 carries all ToS sensitivity, Layer 2 is clean computation" — extends to cover YouTube Live, Twitch, Kick, HTTP audio streams, microphone input, and audio fingerprinting.

This RFC defines the trait hierarchy, type system, MCP surface expansion, session lifecycle, and PBT contract boundaries.

Motivation

tubebrain v0.1.9 handles YouTube VOD transcripts well. The adjacent space is larger and more valuable:

  • Live captioning for accessibility agents
  • Realtime stream summarization (YouTube Live, Twitch, Kick)
  • Background audio monitoring (radio, podcast feeds, ambient mic)
  • Audio fingerprinting / recognition ("what is playing right now?")
  • Hosted SaaS layer for instant agent bootstrap

All of these share the same computational core — turn audio into structured text — but differ in how media is resolved, how audio arrives, and how transcripts are delivered.

Design Principles

  1. Layer 1 is ToS-sensitive, Layer 2 is clean computation — preserved.
  2. VOD and Live are peers, not parent-child — neither is a special case of the other. They share output types but have different lifecycle contracts.
  3. Session-based streaming uses polling — MCP clients handle request/response better than push streams. Polling with cursors is the pragmatic first design.
  4. Platform adapters are independent crates — each platform adapter is a separate concern with its own auth, rate limiting, and compliance posture.
  5. The Segment is the atom — all paths produce Vec<Segment>. Format rendering happens after, same as today.
  6. PBT contracts at trait boundaries — every trait has testable properties that hold regardless of implementation.

Current Architecture (tubebrain v0.1.9)

VideoSource::resolve(video_id) -> VideoData
    impl: WatchPageSource (YouTube GET /watch)

TranscriptSource::transcribe(TranscriptInput) -> Transcript
    impl: CaptionParser (json3/xml)
    impl: WhisperTranscriber (feature-gated)

fetch_transcript() orchestrates the fallback chain:
    manual captions -> auto captions -> whisper -> error

Types are YouTube-specific: VideoData, CaptionTrack, AudioStream, video_id everywhere.

Proposed Architecture

Layer 0: Media Resolution

A new top-level trait that resolves any supported URI into a media descriptor. This absorbs the current VideoSource role and generalizes it.

/// What kind of media source the user pointed us at.
#[derive(Debug, Clone)]
pub enum MediaUri {
    YouTube(String),        // video ID, URL, or live URL
    Twitch(String),         // channel name or URL
    Kick(String),           // channel name or URL
    HttpAudio(Url),         // Icecast, HLS, podcast, generic audio URL
    AudioDevice(String),    // system audio device identifier
    Unknown(String),        // let resolvers try to claim it
}

/// Resolved media — what we know about the source after resolution.
#[derive(Debug, Clone)]
pub enum MediaDescriptor {
    Vod(VodMedia),
    Live(LiveMedia),
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VodMedia {
    pub id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub duration_ms: u64,
    pub caption_tracks: Vec<CaptionTrack>,
    pub audio_streams: Vec<AudioStream>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LiveMedia {
    pub id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub started_at: Option<u64>,
    pub audio_endpoint: AudioEndpoint,
    pub caption_endpoint: Option<CaptionEndpoint>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Platform {
    YouTube,
    Twitch,
    Kick,
    HttpStream,
    AudioDevice,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum AudioEndpoint {
    DirectUrl { url: String, content_type: String },
    HlsManifest { url: String },
    DashManifest { url: String },
    WebSocket { url: String },
    Device { device_id: String, sample_rate: u32, channels: u16 },
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CaptionEndpoint {
    TimedText { base_url: String, needs_po_token: bool },
    IrcChat { channel: String },
    WebSocketChat { url: String },
}

/// Layer 0 trait: resolve a URI into a media descriptor.
/// This is the ToS-sensitive layer.
#[async_trait]
pub trait MediaResolver: Send + Sync {
    async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError>;
    fn supported_platforms(&self) -> &[Platform];
    fn source_name(&self) -> &'static str;
    fn compliance_note(&self) -> &'static str;
}

Backwards Compatibility Bridge

The existing VideoSource trait becomes a thin wrapper:

/// Bridge: existing VideoSource impls automatically become MediaResolvers.
impl<T: VideoSource> MediaResolver for VideoSourceBridge<T> {
    async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError> {
        let MediaUri::YouTube(ref id_or_url) = uri else {
            return Err(SourceError::UnsupportedPlatform);
        };
        let video_id = extract_video_id(id_or_url)?;
        let video_data = self.inner.resolve(&video_id).await?;
        Ok(MediaDescriptor::Vod(VodMedia::from(video_data)))
    }
    // ...
}

This means WatchPageSource continues to work unchanged. New platform adapters implement MediaResolver directly.

Layer 1: VOD Transcription (unchanged)

The existing TranscriptSource trait stays as-is for VOD work:

#[async_trait]
pub trait TranscriptSource: Send + Sync {
    async fn transcribe(&self, input: TranscriptInput) -> Result<Transcript, TranscribeError>;
    fn source_name(&self) -> &'static str;
}

CaptionParser and WhisperTranscriber continue implementing this.

Layer 2: Stream Transcription (new)

A new trait for session-based, incremental transcription of live sources:

/// A live transcription session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamSession {
    pub session_id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub started_at: u64,        // epoch ms
    pub language: Option<String>,
    pub source: String,         // "live_caption", "live_whisper", etc.
}

/// A chunk of transcript from a live session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamChunk {
    pub session_id: String,
    pub segments: Vec<Segment>,
    pub cursor: u64,            // monotonic, opaque to clients
    pub is_final: bool,         // true when stream ended
    pub buffer_depth_ms: u64,   // how far behind live edge
    pub session_duration_ms: u64,
    pub health: StreamHealth,
    pub last_diagnostic: Option<StreamDiagnostic>,
    pub last_error: Option<StreamDiagnostic>,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum StreamHealth {
    Active,
    Degraded,
    Failed,
    Stopped,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct StreamDiagnostic {
    pub at_ms: u64,
    pub source: String,
    pub message: String,
}

/// Errors specific to stream transcription.
#[derive(Debug, thiserror::Error)]
pub enum StreamError {
    #[error("session not found: {0}")]
    SessionNotFound(String),
    #[error("stream ended")]
    StreamEnded,
    #[error("connection lost: {0}")]
    ConnectionLost(String),
    #[error("platform error: {0}")]
    PlatformError(String),
    #[error("transcription error: {0}")]
    TranscriptionError(String),
    #[error("session limit reached")]
    SessionLimitReached,
    #[error("unsupported platform")]
    UnsupportedPlatform,
}

/// Layer 2 trait: session-based live transcription.
#[async_trait]
pub trait StreamTranscriber: Send + Sync {
    /// Start a new transcription session for a live source.
    async fn start(&self, media: &LiveMedia) -> Result<StreamSession, StreamError>;

    /// Poll for new segments after the given cursor.
    /// Returns immediately with empty segments if nothing new.
    async fn poll(
        &self,
        session_id: &str,
        after_cursor: u64,
    ) -> Result<StreamChunk, StreamError>;

    /// Stop a session and return any remaining buffered segments.
    async fn stop(&self, session_id: &str) -> Result<StreamChunk, StreamError>;

    /// List active sessions.
    async fn list_sessions(&self) -> Vec<StreamSession>;

    fn source_name(&self) -> &'static str;
}

last_diagnostic is non-error status about the latest ingestion or transcription activity, such as a chunk that produced no transcript segments yet. last_error drives degraded/failed health and remains the field agents should inspect for actual failures.

Layer 3: Audio Fingerprinting (new, optional)

For "shazam"-style recognition — orthogonal to transcription:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AudioFingerprint {
    pub fingerprint: Vec<u8>,
    pub duration_ms: u64,
    pub sample_rate: u32,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct RecognitionResult {
    pub title: Option<String>,
    pub artist: Option<String>,
    pub album: Option<String>,
    pub isrc: Option<String>,
    pub confidence: f64,
    pub source: String,         // "chromaprint", "acoustid", etc.
    pub external_ids: HashMap<String, String>,
}

#[async_trait]
pub trait AudioRecognizer: Send + Sync {
    async fn recognize(
        &self,
        audio: &[f32],
        sample_rate: u32,
    ) -> Result<Vec<RecognitionResult>, RecognitionError>;
    fn source_name(&self) -> &'static str;
}

MCP Tool Surface

Existing Tools (unchanged)

Tool Input Output Mode
get_transcript url, lang, format Formatted transcript VOD
get_transcript_section url, lang, at_s, before_s, after_s, format Timestamp-windowed transcript VOD
list_languages url Language list VOD
get_metadata url Video metadata VOD

New Stream Tools

Tool Input Output Mode
start_stream url StreamSession Live
poll_stream session_id, cursor StreamChunk Live
stop_stream session_id StreamChunk (final) Live
list_streams (none) Vec Live

New Recognition Tools

Tool Input Output Mode
recognize_audio url or session_id RecognitionResult Any

Smart Routing

get_transcript and start_stream both accept a URL. The server resolves the URI first, then routes:

  • MediaDescriptor::Vod(_) -> VOD path (get_transcript)
  • MediaDescriptor::Live(_) -> Live path (start_stream)

If a user calls get_transcript on a live URL, the server returns a helpful error suggesting start_stream instead. If they call start_stream on a VOD URL, the server could either error or start a "replay" session.

Platform Adapter Matrix

Platform MediaResolver VOD Live Auth Compliance
YouTube VOD WatchPageSource (bridge) yes PoToken robots.txt OK
YouTube Live YouTubeLiveResolver HLS+chat PoToken same
Twitch TwitchResolver VODs HLS+IRC Client-ID API ToS
Kick KickResolver VODs HLS+WS none (public) minimal
HTTP Audio HttpStreamResolver file URLs Icecast/HLS varies clean
Microphone AudioDeviceResolver CPAL none clean

Each adapter is feature-gated:

[features]
default = []
po-token = [...]          # existing
whisper = [...]            # existing
youtube-live = ["dep:m3u8-rs"]
twitch = ["dep:twitch-irc"]
kick = ["dep:tokio-tungstenite"]
audio-device = ["dep:cpal"]
audio-fingerprint = ["dep:chromaprint"]

Session Lifecycle

                    ┌─────────┐
                    │  Idle   │
                    └────┬────┘
                         │ start_stream(url)
                         ▼
              ┌──────────────────┐
              │  Resolving URI   │  MediaResolver::resolve()
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │   Connecting     │  Open HLS/WS/device stream
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
         ┌───▶│    Streaming     │◀──── poll_stream(session, cursor)
         │    └────────┬─────────┘      returns StreamChunk
         │             │
         │    reconnect│ on transient failure
         │             │
         │    ┌────────▼─────────┐
         └────│   Reconnecting   │  exponential backoff, bounded retries
              └────────┬─────────┘
                       │ stop_stream() or stream ends naturally
                       ▼
              ┌──────────────────┐
              │    Draining      │  flush remaining buffer
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │    Stopped       │  is_final=true, session archived
              └──────────────────┘

Backpressure and Limits

  • Maximum concurrent sessions: configurable (default: 4)
  • Per-session buffer depth: configurable (default: 300s / 5 minutes)
  • Segments older than buffer depth are dropped (agent must poll frequently enough)
  • Session timeout: configurable (default: 3600s / 1 hour with no polls)
  • Buffer memory budget: bounded by (max_sessions * buffer_depth * segment_rate)

Cursor Design

Cursors are monotonically increasing u64 values. They are opaque to clients — the server assigns them. A client polls with after_cursor=0 to get everything buffered, or with the last cursor it received to get only new segments.

Properties (testable via PBT): - cursor(chunk_n+1) > cursor(chunk_n) — strictly monotonic - segments(poll(cursor=0)) is a prefix of the full session transcript - union(all polls) == full session transcript — no gaps, no duplicates - poll(cursor=latest) returns empty segments, not an error

Audio Pipeline (Stream Mode)

Audio Source ──▶ Chunk Buffer ──▶ Normalizer ──▶ STT Engine ──▶ Segment Buffer
  (HLS/WS/      (ring buffer,     (16kHz mono    (Whisper      (cursor-indexed,
   device)        ~10s chunks)      f32 PCM)       streaming)    bounded depth)

For live Whisper, the audio is chunked into overlapping windows: - Window size: 30s (Whisper's sweet spot) - Overlap: 5s (prevents segment boundary artifacts) - Stride: 25s of new audio per inference - On M-series: ~3s inference for 30s audio = ~10x realtime = sustainable

For platforms with native captions (YouTube Live chat captions, Twitch IRC): - Caption text arrives pre-segmented - No STT needed — direct segment construction - Lower latency, lower CPU, higher accuracy

PBT Contracts

MediaResolver Properties

∀ uri, platform:
  P1: resolve(uri) either succeeds or returns a typed error — never panics
  P2: resolve(uri).platform ∈ supported_platforms()
  P3: resolve is idempotent within a session (same URI → same descriptor shape)
  P4: compliance_note() is non-empty

StreamTranscriber Properties

∀ session:
  P5: start() → session_id is unique across all active sessions
  P6: poll(session, 0) returns all buffered segments
  P7: poll(session, cursor_n).cursor > cursor_n (strict monotonicity)
  P8: stop() returns is_final=true
  P9: poll after stop returns SessionNotFound or is_final=true
  P10: segments from sequential polls have no gaps and no duplicates
       (union of all poll results = complete transcript)
  P11: buffer_depth_ms ≤ configured maximum
  P12: concurrent sessions ≤ configured limit

Segment Properties (shared VOD + Live)

∀ segment:
  P13: start_ms ≤ end_ms
  P14: text is valid UTF-8 and non-empty after trim
  P15: serde roundtrip is identity: deserialize(serialize(s)) == s

AudioRecognizer Properties

∀ audio, sample_rate:
  P16: recognize() either succeeds or returns typed error — never panics
  P17: confidence ∈ [0.0, 1.0]
  P18: empty audio → empty results (not an error)

Type Evolution Summary

Preserved (unchanged)

  • Segment { text, start_ms, end_ms } — the universal atom
  • CaptionTrack — YouTube caption metadata
  • AudioStream — YouTube audio stream metadata
  • Transcript — complete VOD transcript
  • TranscriptSource trait — VOD transcription
  • All output formatters (JSON, Markdown, SRT, VTT, text)
  • All parsers (HTML, JSON3, XML)
  • Cache layer (extended to cover stream session metadata)

Added

  • MediaUri — universal media identifier enum
  • MediaDescriptor — resolved media (Vod | Live)
  • VodMedia — generalized VOD metadata (superset of VideoData)
  • LiveMedia — live stream metadata
  • Platform — source platform enum
  • AudioEndpoint — how to get audio from a source
  • CaptionEndpoint — how to get captions from a source
  • StreamSession — live session handle
  • StreamChunk — incremental transcript delivery
  • StreamError — stream-specific errors
  • MediaResolver trait — universal URI resolution
  • StreamTranscriber trait — session-based live transcription
  • AudioRecognizer trait — audio fingerprinting (optional)

Deprecated (kept, bridged)

  • VideoSource trait — bridged to MediaResolver via VideoSourceBridge
  • VideoDataFrom<VodMedia> and Into<VodMedia> impls
  • TranscriptInput — extended with LiveAudio(AudioEndpoint) variant

Implementation Phases

Phase A: Type Foundation

  • Add new types to src/types.rs (MediaUri, Platform, etc.)
  • Add MediaResolver trait to src/traits.rs
  • Bridge WatchPageSource via VideoSourceBridge
  • All existing tests pass unchanged
  • PBT: P1-P4 for the bridge adapter

Phase B: Stream Trait + Session Manager

  • Add StreamTranscriber trait
  • Add StreamSession, StreamChunk, StreamError types
  • Implement SessionManager (in-memory session store, cursor assignment, buffer management, backpressure)
  • No platform adapters yet — test with a mock StreamTranscriber
  • PBT: P5-P12 against mock implementation
  • Fuzz: arbitrary cursor values, concurrent session operations

Phase C: YouTube Live Adapter

  • YouTubeLiveResolver implementing MediaResolver
  • YouTube Live HLS manifest parsing (m3u8-rs)
  • YouTube Live chat caption ingestion
  • Feature-gated behind youtube-live
  • Integration test with MCP Inspector

Phase D: MCP Surface Expansion

  • Add start_stream, poll_stream, stop_stream, list_streams tools
  • Smart routing in get_transcript (detect live vs VOD)
  • Updated get_metadata to report stream status
  • Integration tests for session lifecycle via MCP

Phase E: Platform Adapters

  • Twitch (HLS + IRC captions) — twitch feature
  • Kick (HLS + WebSocket) — kick feature
  • HTTP audio streams (Icecast/HLS) — no extra deps
  • Microphone input (CPAL) — audio-device feature

Phase F: Audio Fingerprinting

  • AudioRecognizer trait
  • Chromaprint integration — audio-fingerprint feature
  • AcoustID lookup service
  • recognize_audio MCP tool

Phase G: Hosted Layer

  • HTTP/SSE API wrapping the FOSS core
  • Auth, metering, billing layer
  • Managed PoToken service
  • Managed STT inference
  • Separate deployment artifact

Crate Structure (Future)

tubebrain/                      # workspace root
  tubebrain-core/               # traits, types, errors (the contract)
  tubebrain-youtube/            # YouTube VOD + Live resolver
  tubebrain-twitch/             # Twitch resolver
  tubebrain-kick/               # Kick resolver
  tubebrain-audio/              # HTTP stream + device resolver
  tubebrain-whisper/            # Whisper STT (VOD + streaming)
  tubebrain-fingerprint/        # audio recognition
  tubebrain-server/             # MCP stdio server (the binary)
  tubebrain-hosted/             # HTTP/SSE hosted service

For now, everything stays in a single crate with feature gates. The workspace split happens when any adapter gets heavy enough to justify it.

Security Considerations

  • Session isolation: each stream session runs in its own task with bounded resources. A misbehaving stream cannot starve other sessions.
  • Buffer limits: hard caps on buffer depth and segment count prevent memory exhaustion from long-running sessions.
  • Auth material: platform credentials (Twitch Client-ID, etc.) are environment variables, never in config files or MCP tool inputs.
  • Audio data: raw audio is processed in-memory and never persisted to disk unless explicitly configured.
  • Rate limiting: per-platform rate limits enforced in each adapter, not in the session manager.
  • Compliance posture: each MediaResolver carries its own compliance_note(). The MCP server can expose a compliance tool or --compliance flag aggregating all active adapters.

Decisions (resolved 2026-04-24)

  1. Rename timing: internal rebrand to TubeBrain now (docs, Linear project, RFC titles). Binary/crate/npm rename deferred to Phase D alongside stream feature launch — one migration, one marketing moment. Reserve the name on crates.io and npm immediately.

  2. Push vs poll: poll transport first. Internal architecture uses tokio::broadcast channels so push (MCP Streamable HTTP, SSE) is a transport swap, not an architecture change. PBT contracts (cursor monotonicity, no gaps, no dupes) hold for both modes.

  3. Whisper streaming mode: chunked batch on overlapping windows (30s window, 5s overlap, 25s stride). ~28s latency from live edge on M-series. Do not use whisper.cpp's experimental streaming. Sub-second latency is the hosted layer's concern (Deepgram/AssemblyAI/GPU inference).

  4. Session persistence: in-memory for FOSS (Arc<DashMap>). The SessionManager is trait-parameterized so the hosted layer can swap in Redis/Postgres at Phase G. No disk persistence in the local binary.

  5. Multi-language live: one language per session. Multiple languages = multiple sessions sharing the same audio pipeline internally. lang: None = detect dominant language at start. lang: Some("en") = force English.

References