RFC: TubeBrain Trait Evolution¶
Date: 2026-04-24 Status: draft Owner: Jess Sullivan
Summary¶
Evolve tubebrain's two-trait architecture into a multi-platform, stream-capable system under the working name TubeBrain. The core design principle from the original spec — "Layer 1 carries all ToS sensitivity, Layer 2 is clean computation" — extends to cover YouTube Live, Twitch, Kick, HTTP audio streams, microphone input, and audio fingerprinting.
This RFC defines the trait hierarchy, type system, MCP surface expansion, session lifecycle, and PBT contract boundaries.
Motivation¶
tubebrain v0.1.9 handles YouTube VOD transcripts well. The adjacent space is larger and more valuable:
- Live captioning for accessibility agents
- Realtime stream summarization (YouTube Live, Twitch, Kick)
- Background audio monitoring (radio, podcast feeds, ambient mic)
- Audio fingerprinting / recognition ("what is playing right now?")
- Hosted SaaS layer for instant agent bootstrap
All of these share the same computational core — turn audio into structured text — but differ in how media is resolved, how audio arrives, and how transcripts are delivered.
Design Principles¶
- Layer 1 is ToS-sensitive, Layer 2 is clean computation — preserved.
- VOD and Live are peers, not parent-child — neither is a special case of the other. They share output types but have different lifecycle contracts.
- Session-based streaming uses polling — MCP clients handle request/response better than push streams. Polling with cursors is the pragmatic first design.
- Platform adapters are independent crates — each platform adapter is a separate concern with its own auth, rate limiting, and compliance posture.
- The Segment is the atom — all paths produce
Vec<Segment>. Format rendering happens after, same as today. - PBT contracts at trait boundaries — every trait has testable properties that hold regardless of implementation.
Current Architecture (tubebrain v0.1.9)¶
VideoSource::resolve(video_id) -> VideoData
impl: WatchPageSource (YouTube GET /watch)
TranscriptSource::transcribe(TranscriptInput) -> Transcript
impl: CaptionParser (json3/xml)
impl: WhisperTranscriber (feature-gated)
fetch_transcript() orchestrates the fallback chain:
manual captions -> auto captions -> whisper -> error
Types are YouTube-specific: VideoData, CaptionTrack, AudioStream,
video_id everywhere.
Proposed Architecture¶
Layer 0: Media Resolution¶
A new top-level trait that resolves any supported URI into a media descriptor.
This absorbs the current VideoSource role and generalizes it.
/// What kind of media source the user pointed us at.
#[derive(Debug, Clone)]
pub enum MediaUri {
YouTube(String), // video ID, URL, or live URL
Twitch(String), // channel name or URL
Kick(String), // channel name or URL
HttpAudio(Url), // Icecast, HLS, podcast, generic audio URL
AudioDevice(String), // system audio device identifier
Unknown(String), // let resolvers try to claim it
}
/// Resolved media — what we know about the source after resolution.
#[derive(Debug, Clone)]
pub enum MediaDescriptor {
Vod(VodMedia),
Live(LiveMedia),
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VodMedia {
pub id: String,
pub platform: Platform,
pub title: String,
pub channel: String,
pub duration_ms: u64,
pub caption_tracks: Vec<CaptionTrack>,
pub audio_streams: Vec<AudioStream>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LiveMedia {
pub id: String,
pub platform: Platform,
pub title: String,
pub channel: String,
pub started_at: Option<u64>,
pub audio_endpoint: AudioEndpoint,
pub caption_endpoint: Option<CaptionEndpoint>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Platform {
YouTube,
Twitch,
Kick,
HttpStream,
AudioDevice,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum AudioEndpoint {
DirectUrl { url: String, content_type: String },
HlsManifest { url: String },
DashManifest { url: String },
WebSocket { url: String },
Device { device_id: String, sample_rate: u32, channels: u16 },
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CaptionEndpoint {
TimedText { base_url: String, needs_po_token: bool },
IrcChat { channel: String },
WebSocketChat { url: String },
}
/// Layer 0 trait: resolve a URI into a media descriptor.
/// This is the ToS-sensitive layer.
#[async_trait]
pub trait MediaResolver: Send + Sync {
async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError>;
fn supported_platforms(&self) -> &[Platform];
fn source_name(&self) -> &'static str;
fn compliance_note(&self) -> &'static str;
}
Backwards Compatibility Bridge¶
The existing VideoSource trait becomes a thin wrapper:
/// Bridge: existing VideoSource impls automatically become MediaResolvers.
impl<T: VideoSource> MediaResolver for VideoSourceBridge<T> {
async fn resolve(&self, uri: &MediaUri) -> Result<MediaDescriptor, SourceError> {
let MediaUri::YouTube(ref id_or_url) = uri else {
return Err(SourceError::UnsupportedPlatform);
};
let video_id = extract_video_id(id_or_url)?;
let video_data = self.inner.resolve(&video_id).await?;
Ok(MediaDescriptor::Vod(VodMedia::from(video_data)))
}
// ...
}
This means WatchPageSource continues to work unchanged. New platform
adapters implement MediaResolver directly.
Layer 1: VOD Transcription (unchanged)¶
The existing TranscriptSource trait stays as-is for VOD work:
#[async_trait]
pub trait TranscriptSource: Send + Sync {
async fn transcribe(&self, input: TranscriptInput) -> Result<Transcript, TranscribeError>;
fn source_name(&self) -> &'static str;
}
CaptionParser and WhisperTranscriber continue implementing this.
Layer 2: Stream Transcription (new)¶
A new trait for session-based, incremental transcription of live sources:
/// A live transcription session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamSession {
pub session_id: String,
pub platform: Platform,
pub title: String,
pub channel: String,
pub started_at: u64, // epoch ms
pub language: Option<String>,
pub source: String, // "live_caption", "live_whisper", etc.
}
/// A chunk of transcript from a live session.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StreamChunk {
pub session_id: String,
pub segments: Vec<Segment>,
pub cursor: u64, // monotonic, opaque to clients
pub is_final: bool, // true when stream ended
pub buffer_depth_ms: u64, // how far behind live edge
pub session_duration_ms: u64,
pub health: StreamHealth,
pub last_diagnostic: Option<StreamDiagnostic>,
pub last_error: Option<StreamDiagnostic>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum StreamHealth {
Active,
Degraded,
Failed,
Stopped,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct StreamDiagnostic {
pub at_ms: u64,
pub source: String,
pub message: String,
}
/// Errors specific to stream transcription.
#[derive(Debug, thiserror::Error)]
pub enum StreamError {
#[error("session not found: {0}")]
SessionNotFound(String),
#[error("stream ended")]
StreamEnded,
#[error("connection lost: {0}")]
ConnectionLost(String),
#[error("platform error: {0}")]
PlatformError(String),
#[error("transcription error: {0}")]
TranscriptionError(String),
#[error("session limit reached")]
SessionLimitReached,
#[error("unsupported platform")]
UnsupportedPlatform,
}
/// Layer 2 trait: session-based live transcription.
#[async_trait]
pub trait StreamTranscriber: Send + Sync {
/// Start a new transcription session for a live source.
async fn start(&self, media: &LiveMedia) -> Result<StreamSession, StreamError>;
/// Poll for new segments after the given cursor.
/// Returns immediately with empty segments if nothing new.
async fn poll(
&self,
session_id: &str,
after_cursor: u64,
) -> Result<StreamChunk, StreamError>;
/// Stop a session and return any remaining buffered segments.
async fn stop(&self, session_id: &str) -> Result<StreamChunk, StreamError>;
/// List active sessions.
async fn list_sessions(&self) -> Vec<StreamSession>;
fn source_name(&self) -> &'static str;
}
last_diagnostic is non-error status about the latest ingestion or
transcription activity, such as a chunk that produced no transcript segments
yet. last_error drives degraded/failed health and remains the field agents
should inspect for actual failures.
Layer 3: Audio Fingerprinting (new, optional)¶
For "shazam"-style recognition — orthogonal to transcription:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AudioFingerprint {
pub fingerprint: Vec<u8>,
pub duration_ms: u64,
pub sample_rate: u32,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct RecognitionResult {
pub title: Option<String>,
pub artist: Option<String>,
pub album: Option<String>,
pub isrc: Option<String>,
pub confidence: f64,
pub source: String, // "chromaprint", "acoustid", etc.
pub external_ids: HashMap<String, String>,
}
#[async_trait]
pub trait AudioRecognizer: Send + Sync {
async fn recognize(
&self,
audio: &[f32],
sample_rate: u32,
) -> Result<Vec<RecognitionResult>, RecognitionError>;
fn source_name(&self) -> &'static str;
}
MCP Tool Surface¶
Existing Tools (unchanged)¶
| Tool | Input | Output | Mode |
|---|---|---|---|
get_transcript |
url, lang, format | Formatted transcript | VOD |
get_transcript_section |
url, lang, at_s, before_s, after_s, format | Timestamp-windowed transcript | VOD |
list_languages |
url | Language list | VOD |
get_metadata |
url | Video metadata | VOD |
New Stream Tools¶
| Tool | Input | Output | Mode |
|---|---|---|---|
start_stream |
url | StreamSession | Live |
poll_stream |
session_id, cursor | StreamChunk | Live |
stop_stream |
session_id | StreamChunk (final) | Live |
list_streams |
(none) | Vec |
Live |
New Recognition Tools¶
| Tool | Input | Output | Mode |
|---|---|---|---|
recognize_audio |
url or session_id | RecognitionResult | Any |
Smart Routing¶
get_transcript and start_stream both accept a URL. The server resolves
the URI first, then routes:
MediaDescriptor::Vod(_)-> VOD path (get_transcript)MediaDescriptor::Live(_)-> Live path (start_stream)
If a user calls get_transcript on a live URL, the server returns a helpful
error suggesting start_stream instead. If they call start_stream on a VOD
URL, the server could either error or start a "replay" session.
Platform Adapter Matrix¶
| Platform | MediaResolver | VOD | Live | Auth | Compliance |
|---|---|---|---|---|---|
| YouTube VOD | WatchPageSource (bridge) | yes | — | PoToken | robots.txt OK |
| YouTube Live | YouTubeLiveResolver | — | HLS+chat | PoToken | same |
| Twitch | TwitchResolver | VODs | HLS+IRC | Client-ID | API ToS |
| Kick | KickResolver | VODs | HLS+WS | none (public) | minimal |
| HTTP Audio | HttpStreamResolver | file URLs | Icecast/HLS | varies | clean |
| Microphone | AudioDeviceResolver | — | CPAL | none | clean |
Each adapter is feature-gated:
[features]
default = []
po-token = [...] # existing
whisper = [...] # existing
youtube-live = ["dep:m3u8-rs"]
twitch = ["dep:twitch-irc"]
kick = ["dep:tokio-tungstenite"]
audio-device = ["dep:cpal"]
audio-fingerprint = ["dep:chromaprint"]
Session Lifecycle¶
┌─────────┐
│ Idle │
└────┬────┘
│ start_stream(url)
▼
┌──────────────────┐
│ Resolving URI │ MediaResolver::resolve()
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Connecting │ Open HLS/WS/device stream
└────────┬─────────┘
│
▼
┌──────────────────┐
┌───▶│ Streaming │◀──── poll_stream(session, cursor)
│ └────────┬─────────┘ returns StreamChunk
│ │
│ reconnect│ on transient failure
│ │
│ ┌────────▼─────────┐
└────│ Reconnecting │ exponential backoff, bounded retries
└────────┬─────────┘
│ stop_stream() or stream ends naturally
▼
┌──────────────────┐
│ Draining │ flush remaining buffer
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Stopped │ is_final=true, session archived
└──────────────────┘
Backpressure and Limits¶
- Maximum concurrent sessions: configurable (default: 4)
- Per-session buffer depth: configurable (default: 300s / 5 minutes)
- Segments older than buffer depth are dropped (agent must poll frequently enough)
- Session timeout: configurable (default: 3600s / 1 hour with no polls)
- Buffer memory budget: bounded by (max_sessions * buffer_depth * segment_rate)
Cursor Design¶
Cursors are monotonically increasing u64 values. They are opaque to
clients — the server assigns them. A client polls with after_cursor=0 to
get everything buffered, or with the last cursor it received to get only
new segments.
Properties (testable via PBT):
- cursor(chunk_n+1) > cursor(chunk_n) — strictly monotonic
- segments(poll(cursor=0)) is a prefix of the full session transcript
- union(all polls) == full session transcript — no gaps, no duplicates
- poll(cursor=latest) returns empty segments, not an error
Audio Pipeline (Stream Mode)¶
Audio Source ──▶ Chunk Buffer ──▶ Normalizer ──▶ STT Engine ──▶ Segment Buffer
(HLS/WS/ (ring buffer, (16kHz mono (Whisper (cursor-indexed,
device) ~10s chunks) f32 PCM) streaming) bounded depth)
For live Whisper, the audio is chunked into overlapping windows: - Window size: 30s (Whisper's sweet spot) - Overlap: 5s (prevents segment boundary artifacts) - Stride: 25s of new audio per inference - On M-series: ~3s inference for 30s audio = ~10x realtime = sustainable
For platforms with native captions (YouTube Live chat captions, Twitch IRC): - Caption text arrives pre-segmented - No STT needed — direct segment construction - Lower latency, lower CPU, higher accuracy
PBT Contracts¶
MediaResolver Properties¶
∀ uri, platform:
P1: resolve(uri) either succeeds or returns a typed error — never panics
P2: resolve(uri).platform ∈ supported_platforms()
P3: resolve is idempotent within a session (same URI → same descriptor shape)
P4: compliance_note() is non-empty
StreamTranscriber Properties¶
∀ session:
P5: start() → session_id is unique across all active sessions
P6: poll(session, 0) returns all buffered segments
P7: poll(session, cursor_n).cursor > cursor_n (strict monotonicity)
P8: stop() returns is_final=true
P9: poll after stop returns SessionNotFound or is_final=true
P10: segments from sequential polls have no gaps and no duplicates
(union of all poll results = complete transcript)
P11: buffer_depth_ms ≤ configured maximum
P12: concurrent sessions ≤ configured limit
Segment Properties (shared VOD + Live)¶
∀ segment:
P13: start_ms ≤ end_ms
P14: text is valid UTF-8 and non-empty after trim
P15: serde roundtrip is identity: deserialize(serialize(s)) == s
AudioRecognizer Properties¶
∀ audio, sample_rate:
P16: recognize() either succeeds or returns typed error — never panics
P17: confidence ∈ [0.0, 1.0]
P18: empty audio → empty results (not an error)
Type Evolution Summary¶
Preserved (unchanged)¶
Segment { text, start_ms, end_ms }— the universal atomCaptionTrack— YouTube caption metadataAudioStream— YouTube audio stream metadataTranscript— complete VOD transcriptTranscriptSourcetrait — VOD transcription- All output formatters (JSON, Markdown, SRT, VTT, text)
- All parsers (HTML, JSON3, XML)
- Cache layer (extended to cover stream session metadata)
Added¶
MediaUri— universal media identifier enumMediaDescriptor— resolved media (Vod | Live)VodMedia— generalized VOD metadata (superset of VideoData)LiveMedia— live stream metadataPlatform— source platform enumAudioEndpoint— how to get audio from a sourceCaptionEndpoint— how to get captions from a sourceStreamSession— live session handleStreamChunk— incremental transcript deliveryStreamError— stream-specific errorsMediaResolvertrait — universal URI resolutionStreamTranscribertrait — session-based live transcriptionAudioRecognizertrait — audio fingerprinting (optional)
Deprecated (kept, bridged)¶
VideoSourcetrait — bridged toMediaResolverviaVideoSourceBridgeVideoData—From<VodMedia>andInto<VodMedia>implsTranscriptInput— extended withLiveAudio(AudioEndpoint)variant
Implementation Phases¶
Phase A: Type Foundation¶
- Add new types to
src/types.rs(MediaUri, Platform, etc.) - Add
MediaResolvertrait tosrc/traits.rs - Bridge
WatchPageSourceviaVideoSourceBridge - All existing tests pass unchanged
- PBT: P1-P4 for the bridge adapter
Phase B: Stream Trait + Session Manager¶
- Add
StreamTranscribertrait - Add
StreamSession,StreamChunk,StreamErrortypes - Implement
SessionManager(in-memory session store, cursor assignment, buffer management, backpressure) - No platform adapters yet — test with a mock
StreamTranscriber - PBT: P5-P12 against mock implementation
- Fuzz: arbitrary cursor values, concurrent session operations
Phase C: YouTube Live Adapter¶
YouTubeLiveResolverimplementingMediaResolver- YouTube Live HLS manifest parsing (m3u8-rs)
- YouTube Live chat caption ingestion
- Feature-gated behind
youtube-live - Integration test with MCP Inspector
Phase D: MCP Surface Expansion¶
- Add
start_stream,poll_stream,stop_stream,list_streamstools - Smart routing in
get_transcript(detect live vs VOD) - Updated
get_metadatato report stream status - Integration tests for session lifecycle via MCP
Phase E: Platform Adapters¶
- Twitch (HLS + IRC captions) —
twitchfeature - Kick (HLS + WebSocket) —
kickfeature - HTTP audio streams (Icecast/HLS) — no extra deps
- Microphone input (CPAL) —
audio-devicefeature
Phase F: Audio Fingerprinting¶
AudioRecognizertrait- Chromaprint integration —
audio-fingerprintfeature - AcoustID lookup service
recognize_audioMCP tool
Phase G: Hosted Layer¶
- HTTP/SSE API wrapping the FOSS core
- Auth, metering, billing layer
- Managed PoToken service
- Managed STT inference
- Separate deployment artifact
Crate Structure (Future)¶
tubebrain/ # workspace root
tubebrain-core/ # traits, types, errors (the contract)
tubebrain-youtube/ # YouTube VOD + Live resolver
tubebrain-twitch/ # Twitch resolver
tubebrain-kick/ # Kick resolver
tubebrain-audio/ # HTTP stream + device resolver
tubebrain-whisper/ # Whisper STT (VOD + streaming)
tubebrain-fingerprint/ # audio recognition
tubebrain-server/ # MCP stdio server (the binary)
tubebrain-hosted/ # HTTP/SSE hosted service
For now, everything stays in a single crate with feature gates. The workspace split happens when any adapter gets heavy enough to justify it.
Security Considerations¶
- Session isolation: each stream session runs in its own task with bounded resources. A misbehaving stream cannot starve other sessions.
- Buffer limits: hard caps on buffer depth and segment count prevent memory exhaustion from long-running sessions.
- Auth material: platform credentials (Twitch Client-ID, etc.) are environment variables, never in config files or MCP tool inputs.
- Audio data: raw audio is processed in-memory and never persisted to disk unless explicitly configured.
- Rate limiting: per-platform rate limits enforced in each adapter, not in the session manager.
- Compliance posture: each
MediaResolvercarries its owncompliance_note(). The MCP server can expose acompliancetool or--complianceflag aggregating all active adapters.
Decisions (resolved 2026-04-24)¶
-
Rename timing: internal rebrand to TubeBrain now (docs, Linear project, RFC titles). Binary/crate/npm rename deferred to Phase D alongside stream feature launch — one migration, one marketing moment. Reserve the name on crates.io and npm immediately.
-
Push vs poll: poll transport first. Internal architecture uses
tokio::broadcastchannels so push (MCP Streamable HTTP, SSE) is a transport swap, not an architecture change. PBT contracts (cursor monotonicity, no gaps, no dupes) hold for both modes. -
Whisper streaming mode: chunked batch on overlapping windows (30s window, 5s overlap, 25s stride). ~28s latency from live edge on M-series. Do not use whisper.cpp's experimental streaming. Sub-second latency is the hosted layer's concern (Deepgram/AssemblyAI/GPU inference).
-
Session persistence: in-memory for FOSS (
Arc<DashMap>). TheSessionManageris trait-parameterized so the hosted layer can swap in Redis/Postgres at Phase G. No disk persistence in the local binary. -
Multi-language live: one language per session. Multiple languages = multiple sessions sharing the same audio pipeline internally.
lang: None= detect dominant language at start.lang: Some("en")= force English.