RFC: TubeBrain Trait Evolution¶

Date: 2026-04-24 Status: draft Owner: Jess Sullivan

Summary¶

Evolve tubebrain's two-trait architecture into a multi-platform, stream-capable system under the working name TubeBrain. The core design principle from the original spec — "Layer 1 carries all ToS sensitivity, Layer 2 is clean computation" — extends to cover YouTube Live, Twitch, Kick, HTTP audio streams, microphone input, and audio fingerprinting.

This RFC defines the trait hierarchy, type system, MCP surface expansion, session lifecycle, and PBT contract boundaries.

Motivation¶

tubebrain v0.1.9 handles YouTube VOD transcripts well. The adjacent space is larger and more valuable:

Live captioning for accessibility agents
Realtime stream summarization (YouTube Live, Twitch, Kick)
Background audio monitoring (radio, podcast feeds, ambient mic)
Audio fingerprinting / recognition ("what is playing right now?")
Hosted SaaS layer for instant agent bootstrap

All of these share the same computational core — turn audio into structured text — but differ in how media is resolved, how audio arrives, and how transcripts are delivered.

Design Principles¶

Layer 1 is ToS-sensitive, Layer 2 is clean computation — preserved.
VOD and Live are peers, not parent-child — neither is a special case of the other. They share output types but have different lifecycle contracts.
Session-based streaming uses polling — MCP clients handle request/response better than push streams. Polling with cursors is the pragmatic first design.
Platform adapters are independent crates — each platform adapter is a separate concern with its own auth, rate limiting, and compliance posture.
The Segment is the atom — all paths produce Vec<Segment>. Format rendering happens after, same as today.
PBT contracts at trait boundaries — every trait has testable properties that hold regardless of implementation.

Current Architecture (tubebrain v0.1.9)¶

VideoSource::resolve(video_id) -> VideoData
    impl: WatchPageSource (YouTube GET /watch)

TranscriptSource::transcribe(TranscriptInput) -> Transcript
    impl: CaptionParser (json3/xml)
    impl: WhisperTranscriber (feature-gated)

fetch_transcript() orchestrates the fallback chain:
    manual captions -> auto captions -> whisper -> error

Types are YouTube-specific: VideoData, CaptionTrack, AudioStream, video_id everywhere.

Proposed Architecture¶

Layer 0: Media Resolution¶

A new top-level trait that resolves any supported URI into a media descriptor. This absorbs the current VideoSource role and generalizes it.

/// What kind of media source the user pointed us at.
#[derive(Debug, Clone)]
pub enum MediaUri {
    YouTube(String),        // video ID, URL, or live URL
    Twitch(String),         // channel name or URL
    Kick(String),           // channel name or URL
    HttpAudio(Url),         // Icecast, HLS, podcast, generic audio URL
    AudioDevice(String),    // system audio device identifier
    Unknown(String),        // let resolvers try to claim it
}

/// Resolved media — what we know about the source after resolution.
#[derive(Debug, Clone)]
pub enum MediaDescriptor {
    Vod(VodMedia),
    Live(LiveMedia),
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VodMedia {
    pub id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub duration_ms: u64,
    pub caption_tracks: Vec<CaptionTrack>,
    pub audio_streams: Vec<AudioStream>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LiveMedia {
    pub id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub started_at: Option<u64>,
    pub audio_endpoint: AudioEndpoint,
    pub caption_endpoint: Option<CaptionEndpoint>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Platform {
    YouTube,
    Twitch,
    Kick,
    HttpStream,
    AudioDevice,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum AudioEndpoint {
    DirectUrl { url: String, content_type: String },
    HlsManifest { url: String },
    DashManifest { url: String },
    WebSocket { url: String },
    Device { device_id: String, sample_rate: u32, channels: u16 },
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CaptionEndpoint {
    TimedText { base_url: String, needs_po_token: bool },
    IrcChat { channel: String },
    WebSocketChat { url: String },
}

/// Layer 0 trait: resolve a URI into a media descriptor.
/// This is the ToS-sensitive layer.
#[async_trait]
pub trait MediaResolver: Send + Sync {
    async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError>;
    fn supported_platforms(&self) -> &[Platform];
    fn source_name(&self) -> &'static str;
    fn compliance_note(&self) -> &'static str;
}

Backwards Compatibility Bridge¶

The existing VideoSource trait becomes a thin wrapper:

/// Bridge: existing VideoSource impls automatically become MediaResolvers.
impl<T: VideoSource> MediaResolver for VideoSourceBridge<T> {
    async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError> {
        let MediaUri::YouTube(ref id_or_url) = uri else {
            return Err(SourceError::UnsupportedPlatform);
        };
        let video_id = extract_video_id(id_or_url)?;
        let video_data = self.inner.resolve(&video_id).await?;
        Ok(MediaDescriptor::Vod(VodMedia::from(video_data)))
    }
    // ...
}

This means WatchPageSource continues to work unchanged. New platform adapters implement MediaResolver directly.

Layer 1: VOD Transcription (unchanged)¶

The existing TranscriptSource trait stays as-is for VOD work:

#[async_trait]
pub trait TranscriptSource: Send + Sync {
    async fn transcribe(&self, input: TranscriptInput) -> Result<Transcript, TranscribeError>;
    fn source_name(&self) -> &'static str;
}

CaptionParser and WhisperTranscriber continue implementing this.

Layer 2: Stream Transcription (new)¶

A new trait for session-based, incremental transcription of live sources:

/// A live transcription session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamSession {
    pub session_id: String,
    pub platform: Platform,
    pub title: String,
    pub channel: String,
    pub started_at: u64,        // epoch ms
    pub language: Option<String>,
    pub source: String,         // "live_caption", "live_whisper", etc.
}

/// A chunk of transcript from a live session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamChunk {
    pub session_id: String,
    pub segments: Vec<Segment>,
    pub cursor: u64,            // monotonic, opaque to clients
    pub is_final: bool,         // true when stream ended
    pub buffer_depth_ms: u64,   // how far behind live edge
    pub session_duration_ms: u64,
    pub health: StreamHealth,
    pub last_diagnostic: Option<StreamDiagnostic>,
    pub last_error: Option<StreamDiagnostic>,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum StreamHealth {
    Active,
    Degraded,
    Failed,
    Stopped,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct StreamDiagnostic {
    pub at_ms: u64,
    pub source: String,
    pub message: String,
}

/// Errors specific to stream transcription.
#[derive(Debug, thiserror::Error)]
pub enum StreamError {
    #[error("session not found: {0}")]
    SessionNotFound(String),
    #[error("stream ended")]
    StreamEnded,
    #[error("connection lost: {0}")]
    ConnectionLost(String),
    #[error("platform error: {0}")]
    PlatformError(String),
    #[error("transcription error: {0}")]
    TranscriptionError(String),
    #[error("session limit reached")]
    SessionLimitReached,
    #[error("unsupported platform")]
    UnsupportedPlatform,
}

/// Layer 2 trait: session-based live transcription.
#[async_trait]
pub trait StreamTranscriber: Send + Sync {
    /// Start a new transcription session for a live source.
    async fn start(&self, media: &LiveMedia) -> Result<StreamSession, StreamError>;

    /// Poll for new segments after the given cursor.
    /// Returns immediately with empty segments if nothing new.
    async fn poll(
        &self,
        session_id: &str,
        after_cursor: u64,
    ) -> Result<StreamChunk, StreamError>;

    /// Stop a session and return any remaining buffered segments.
    async fn stop(&self, session_id: &str) -> Result<StreamChunk, StreamError>;

    /// List active sessions.
    async fn list_sessions(&self) -> Vec<StreamSession>;

    fn source_name(&self) -> &'static str;
}

last_diagnostic is non-error status about the latest ingestion or transcription activity, such as a chunk that produced no transcript segments yet. last_error drives degraded/failed health and remains the field agents should inspect for actual failures.

Layer 3: Audio Fingerprinting (new, optional)¶

For "shazam"-style recognition — orthogonal to transcription:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AudioFingerprint {
    pub fingerprint: Vec<u8>,
    pub duration_ms: u64,
    pub sample_rate: u32,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct RecognitionResult {
    pub title: Option<String>,
    pub artist: Option<String>,
    pub album: Option<String>,
    pub isrc: Option<String>,
    pub confidence: f64,
    pub source: String,         // "chromaprint", "acoustid", etc.
    pub external_ids: HashMap<String, String>,
}

#[async_trait]
pub trait AudioRecognizer: Send + Sync {
    async fn recognize(
        &self,
        audio: &[f32],
        sample_rate: u32,
    ) -> Result<Vec<RecognitionResult>, RecognitionError>;
    fn source_name(&self) -> &'static str;
}

MCP Tool Surface¶

Existing Tools (unchanged)¶

Tool	Input	Output	Mode
`get_transcript`	url, lang, format	Formatted transcript	VOD
`get_transcript_section`	url, lang, at_s, before_s, after_s, format	Timestamp-windowed transcript	VOD
`list_languages`	url	Language list	VOD
`get_metadata`	url	Video metadata	VOD

New Stream Tools¶

Tool	Input	Output	Mode
`start_stream`	url	StreamSession	Live
`poll_stream`	session_id, cursor	StreamChunk	Live
`stop_stream`	session_id	StreamChunk (final)	Live
`list_streams`	(none)	Vec	Live

New Recognition Tools¶

Tool	Input	Output	Mode
`recognize_audio`	url or session_id	RecognitionResult	Any

Smart Routing¶

get_transcript and start_stream both accept a URL. The server resolves the URI first, then routes:

MediaDescriptor::Vod(_) -> VOD path (get_transcript)
MediaDescriptor::Live(_) -> Live path (start_stream)

If a user calls get_transcript on a live URL, the server returns a helpful error suggesting start_stream instead. If they call start_stream on a VOD URL, the server could either error or start a "replay" session.

Platform Adapter Matrix¶

Platform	MediaResolver	VOD	Live	Auth	Compliance
YouTube VOD	WatchPageSource (bridge)	yes	—	PoToken	robots.txt OK
YouTube Live	YouTubeLiveResolver	—	HLS+chat	PoToken	same
Twitch	TwitchResolver	VODs	HLS+IRC	Client-ID	API ToS
Kick	KickResolver	VODs	HLS+WS	none (public)	minimal
HTTP Audio	HttpStreamResolver	file URLs	Icecast/HLS	varies	clean
Microphone	AudioDeviceResolver	—	CPAL	none	clean

Each adapter is feature-gated:

[features]
default = []
po-token = [...]          # existing
whisper = [...]            # existing
youtube-live = ["dep:m3u8-rs"]
twitch = ["dep:twitch-irc"]
kick = ["dep:tokio-tungstenite"]
audio-device = ["dep:cpal"]
audio-fingerprint = ["dep:chromaprint"]

Session Lifecycle¶

                    ┌─────────┐
                    │  Idle   │
                    └────┬────┘
                         │ start_stream(url)
                         ▼
              ┌──────────────────┐
              │  Resolving URI   │  MediaResolver::resolve()
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │   Connecting     │  Open HLS/WS/device stream
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
         ┌───▶│    Streaming     │◀──── poll_stream(session, cursor)
         │    └────────┬─────────┘      returns StreamChunk
         │             │
         │    reconnect│ on transient failure
         │             │
         │    ┌────────▼─────────┐
         └────│   Reconnecting   │  exponential backoff, bounded retries
              └────────┬─────────┘
                       │ stop_stream() or stream ends naturally
                       ▼
              ┌──────────────────┐
              │    Draining      │  flush remaining buffer
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │    Stopped       │  is_final=true, session archived
              └──────────────────┘

Backpressure and Limits¶

Maximum concurrent sessions: configurable (default: 4)
Per-session buffer depth: configurable (default: 300s / 5 minutes)
Segments older than buffer depth are dropped (agent must poll frequently enough)
Session timeout: configurable (default: 3600s / 1 hour with no polls)
Buffer memory budget: bounded by (max_sessions * buffer_depth * segment_rate)

Cursor Design¶

Cursors are monotonically increasing u64 values. They are opaque to clients — the server assigns them. A client polls with after_cursor=0 to get everything buffered, or with the last cursor it received to get only new segments.

Properties (testable via PBT): - cursor(chunk_n+1) > cursor(chunk_n) — strictly monotonic - segments(poll(cursor=0)) is a prefix of the full session transcript - union(all polls) == full session transcript — no gaps, no duplicates - poll(cursor=latest) returns empty segments, not an error

Audio Pipeline (Stream Mode)¶

Audio Source ──▶ Chunk Buffer ──▶ Normalizer ──▶ STT Engine ──▶ Segment Buffer
  (HLS/WS/      (ring buffer,     (16kHz mono    (Whisper      (cursor-indexed,
   device)        ~10s chunks)      f32 PCM)       streaming)    bounded depth)

For live Whisper, the audio is chunked into overlapping windows: - Window size: 30s (Whisper's sweet spot) - Overlap: 5s (prevents segment boundary artifacts) - Stride: 25s of new audio per inference - On M-series: ~3s inference for 30s audio = ~10x realtime = sustainable

For platforms with native captions (YouTube Live chat captions, Twitch IRC): - Caption text arrives pre-segmented - No STT needed — direct segment construction - Lower latency, lower CPU, higher accuracy

PBT Contracts¶

MediaResolver Properties¶

∀ uri, platform:
  P1: resolve(uri) either succeeds or returns a typed error — never panics
  P2: resolve(uri).platform ∈ supported_platforms()
  P3: resolve is idempotent within a session (same URI → same descriptor shape)
  P4: compliance_note() is non-empty

StreamTranscriber Properties¶

∀ session:
  P5: start() → session_id is unique across all active sessions
  P6: poll(session, 0) returns all buffered segments
  P7: poll(session, cursor_n).cursor > cursor_n (strict monotonicity)
  P8: stop() returns is_final=true
  P9: poll after stop returns SessionNotFound or is_final=true
  P10: segments from sequential polls have no gaps and no duplicates
       (union of all poll results = complete transcript)
  P11: buffer_depth_ms ≤ configured maximum
  P12: concurrent sessions ≤ configured limit

Segment Properties (shared VOD + Live)¶

∀ segment:
  P13: start_ms ≤ end_ms
  P14: text is valid UTF-8 and non-empty after trim
  P15: serde roundtrip is identity: deserialize(serialize(s)) == s

AudioRecognizer Properties¶

∀ audio, sample_rate:
  P16: recognize() either succeeds or returns typed error — never panics
  P17: confidence ∈ [0.0, 1.0]
  P18: empty audio → empty results (not an error)

Type Evolution Summary¶

Preserved (unchanged)¶

Segment { text, start_ms, end_ms } — the universal atom
CaptionTrack — YouTube caption metadata
AudioStream — YouTube audio stream metadata
Transcript — complete VOD transcript
TranscriptSource trait — VOD transcription
All output formatters (JSON, Markdown, SRT, VTT, text)
All parsers (HTML, JSON3, XML)
Cache layer (extended to cover stream session metadata)

Added¶

MediaUri — universal media identifier enum
MediaDescriptor — resolved media (Vod | Live)
VodMedia — generalized VOD metadata (superset of VideoData)
LiveMedia — live stream metadata
Platform — source platform enum
AudioEndpoint — how to get audio from a source
CaptionEndpoint — how to get captions from a source
StreamSession — live session handle
StreamChunk — incremental transcript delivery
StreamError — stream-specific errors
MediaResolver trait — universal URI resolution
StreamTranscriber trait — session-based live transcription
AudioRecognizer trait — audio fingerprinting (optional)

Deprecated (kept, bridged)¶

VideoSource trait — bridged to MediaResolver via VideoSourceBridge
VideoData — From<VodMedia> and Into<VodMedia> impls
TranscriptInput — extended with LiveAudio(AudioEndpoint) variant

Implementation Phases¶

Phase A: Type Foundation¶

Add new types to src/types.rs (MediaUri, Platform, etc.)
Add MediaResolver trait to src/traits.rs
Bridge WatchPageSource via VideoSourceBridge
All existing tests pass unchanged
PBT: P1-P4 for the bridge adapter

Phase B: Stream Trait + Session Manager¶

Add StreamTranscriber trait
Add StreamSession, StreamChunk, StreamError types
Implement SessionManager (in-memory session store, cursor assignment, buffer management, backpressure)
No platform adapters yet — test with a mock StreamTranscriber
PBT: P5-P12 against mock implementation
Fuzz: arbitrary cursor values, concurrent session operations

Phase C: YouTube Live Adapter¶

YouTubeLiveResolver implementing MediaResolver
YouTube Live HLS manifest parsing (m3u8-rs)
YouTube Live chat caption ingestion
Feature-gated behind youtube-live
Integration test with MCP Inspector

Phase D: MCP Surface Expansion¶

Add start_stream, poll_stream, stop_stream, list_streams tools
Smart routing in get_transcript (detect live vs VOD)
Updated get_metadata to report stream status
Integration tests for session lifecycle via MCP

Phase E: Platform Adapters¶

Twitch (HLS + IRC captions) — twitch feature
Kick (HLS + WebSocket) — kick feature
HTTP audio streams (Icecast/HLS) — no extra deps
Microphone input (CPAL) — audio-device feature

Phase F: Audio Fingerprinting¶

AudioRecognizer trait
Chromaprint integration — audio-fingerprint feature
AcoustID lookup service
recognize_audio MCP tool

Phase G: Hosted Layer¶

HTTP/SSE API wrapping the FOSS core
Auth, metering, billing layer
Managed PoToken service
Managed STT inference
Separate deployment artifact

Crate Structure (Future)¶

tubebrain/                      # workspace root
  tubebrain-core/               # traits, types, errors (the contract)
  tubebrain-youtube/            # YouTube VOD + Live resolver
  tubebrain-twitch/             # Twitch resolver
  tubebrain-kick/               # Kick resolver
  tubebrain-audio/              # HTTP stream + device resolver
  tubebrain-whisper/            # Whisper STT (VOD + streaming)
  tubebrain-fingerprint/        # audio recognition
  tubebrain-server/             # MCP stdio server (the binary)
  tubebrain-hosted/             # HTTP/SSE hosted service

For now, everything stays in a single crate with feature gates. The workspace split happens when any adapter gets heavy enough to justify it.

Security Considerations¶

Session isolation: each stream session runs in its own task with bounded resources. A misbehaving stream cannot starve other sessions.
Buffer limits: hard caps on buffer depth and segment count prevent memory exhaustion from long-running sessions.
Auth material: platform credentials (Twitch Client-ID, etc.) are environment variables, never in config files or MCP tool inputs.
Audio data: raw audio is processed in-memory and never persisted to disk unless explicitly configured.
Rate limiting: per-platform rate limits enforced in each adapter, not in the session manager.
Compliance posture: each MediaResolver carries its own compliance_note(). The MCP server can expose a compliance tool or --compliance flag aggregating all active adapters.

Decisions (resolved 2026-04-24)¶

Rename timing: internal rebrand to TubeBrain now (docs, Linear project, RFC titles). Binary/crate/npm rename deferred to Phase D alongside stream feature launch — one migration, one marketing moment. Reserve the name on crates.io and npm immediately.
Push vs poll: poll transport first. Internal architecture uses tokio::broadcast channels so push (MCP Streamable HTTP, SSE) is a transport swap, not an architecture change. PBT contracts (cursor monotonicity, no gaps, no dupes) hold for both modes.
Whisper streaming mode: chunked batch on overlapping windows (30s window, 5s overlap, 25s stride). ~28s latency from live edge on M-series. Do not use whisper.cpp's experimental streaming. Sub-second latency is the hosted layer's concern (Deepgram/AssemblyAI/GPU inference).
Session persistence: in-memory for FOSS (Arc<DashMap>). The SessionManager is trait-parameterized so the hosted layer can swap in Redis/Postgres at Phase G. No disk persistence in the local binary.
Multi-language live: one language per session. Multiple languages = multiple sessions sharing the same audio pipeline internally. lang: None = detect dominant language at start. lang: Some("en") = force English.