RFC: Hosted HTTP/SSE API Contract

Date: 2026-05-14 Status: draft Owner: Jess Sullivan Linear: TIN-531

Implementation status: GET /v1/health, POST /v1/transcript/section, and the first single-process stream session endpoints are implemented in the tubebrain-hosted binary. Hosted auth supports account-scoped API key records, and protected-preview metering can persist JSONL usage events for restart-safe quota windows. Durable hosted account storage, database-backed billing, and multi-worker session routing remain follow-up work.

Summary

Define the first hosted tubebrain.ai API surface around the already-shipped local TubeBrain semantics.

The hosted layer is a convenience wrapper over the FOSS core. It must preserve the same data model and tool behavior as the local MCP server while adding:

  • HTTP request/response access for VOD transcripts and timestamped sections
  • polling and SSE delivery for live stream sessions
  • API-key auth, metering, rate limits, and abuse controls
  • explicit privacy and retention boundaries

This RFC is an implementation contract for the first hosted service slice. It does not make hosted execution the canonical path; the local MCP binary remains first-class.

Goals

  • Make the hosted API semantics match local MCP tools closely enough that an agent harness can switch between local and hosted modes with minimal logic.
  • Support the GStack research demo data flow: timestamped YouTube URL -> transcript section -> agent summary/link extraction -> browser actions.
  • Keep auth, rate limits, retention, and cost controls visible from day one.
  • Avoid exposing low-level PoToken or resolver internals as public API fields.

Non-Goals

  • Build billing UI or subscription management in this RFC.
  • Expose a public managed PoToken minting endpoint.
  • Store raw audio by default.
  • Replace local MCP stdio with hosted-only behavior.
  • Promise the audio fingerprinting endpoint in the first deployed MVP.

Versioning

All endpoints live under /v1.

Breaking changes require /v2. Additive fields are allowed in /v1; clients must ignore unknown response fields.

All timestamps are Unix milliseconds unless a field name explicitly ends in _s.

Auth

Request Authentication

Clients authenticate with an API key:

Authorization: Bearer tb_sk_live_...

API keys are opaque. The protected-preview implementation stores only a SHA-256 hash of each API key in memory; future durable key storage must store only hashes or deployment-secret references, never raw keys.

Key Shape

Recommended key prefix:

  • tb_sk_test_ for non-billable development keys
  • tb_sk_live_ for billable production keys

The prefix is informational. Authorization must rely on server-side key records, not prefix parsing alone.

Scopes

Initial scopes:

  • transcript:read - VOD transcript, metadata, language, and section endpoints
  • stream:write - start and stop stream sessions
  • stream:read - poll or subscribe to stream sessions
  • recognize:write - future audio recognition endpoint
  • admin:read - account and usage inspection

The GStack demo requires only transcript:read for the hosted path.

Headers

Successful responses include:

X-Request-Id: req_...
X-RateLimit-Limit: 120
X-RateLimit-Remaining: 119
X-RateLimit-Reset: 1760000000

Clients may send:

Idempotency-Key: user-generated-key

Idempotency-Key is honored for mutating endpoints such as stream start and stop. It is ignored for pure reads.

Error Envelope

Errors use a stable JSON envelope:

{
  "error": {
    "code": "invalid_request",
    "message": "url is required",
    "request_id": "req_01h..."
  }
}

Initial error codes:

Code HTTP Meaning
invalid_request 400 malformed JSON, missing fields, invalid cursor
unauthorized 401 missing or invalid API key
forbidden 403 valid key lacks required scope
not_found 404 unknown session, video, or route
conflict 409 idempotency conflict or terminal session state
rate_limited 429 per-key or per-IP limit exceeded
source_unavailable 502 upstream media source failed
transcription_unavailable 503 STT backend unavailable
internal_error 500 unexpected service error

Endpoint Summary

Endpoint Auth scope Local MCP equivalent MVP
GET /v1/health none none yes
POST /v1/transcript transcript:read get_transcript reserved
POST /v1/transcript/section transcript:read get_transcript_section yes
POST /v1/languages transcript:read list_languages reserved
POST /v1/metadata transcript:read get_metadata reserved
POST /v1/stream/start stream:write start_stream yes
GET /v1/stream/{session_id}/poll stream:read poll_stream yes
POST /v1/stream/{session_id}/stop stream:write stop_stream yes
GET /v1/stream/{session_id}/events stream:read push form of poll_stream yes
GET /v1/stream stream:read list_streams yes
POST /v1/recognize recognize:write recognize_audio future tool deferred

The current hosted implementation intentionally ships transcript/section and stream sessions before the broader VOD metadata/language endpoints because the GTM wedge is live/radio/YouTube source monitoring. Stream sessions are single-process and in-memory until Redis/Postgres account storage and worker routing are added.

Common Request Fields

Transcript endpoints accept these common fields where relevant:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
  "lang": "en",
  "format": "json"
}

format may be json, markdown, srt, vtt, or text. Hosted JSON responses should prefer structured JSON and return rendered text only when a non-JSON format is explicitly requested.

GET /v1/health

Readiness endpoint for load balancers and canaries.

Response:

{
  "status": "ok",
  "service": "tubebrain-hosted",
  "version": "0.1.0",
  "core_version": "0.1.9"
}

status values:

  • ok - service can accept traffic
  • degraded - service can answer some requests but one dependency is impaired
  • unavailable - service should not receive traffic

POST /v1/transcript

Fetch a full structured transcript for a supported VOD URL.

Request:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac",
  "lang": "en",
  "format": "json"
}

JSON response:

{
  "request_id": "req_01h...",
  "transcript": {
    "video_id": "Rzi7oFTzjac",
    "title": "Example title",
    "channel": "Example channel",
    "duration_ms": 4200000,
    "language": "en",
    "source": "caption_auto_generated",
    "segments": [
      {
        "text": "example text",
        "start_ms": 2449000,
        "end_ms": 2453000
      }
    ]
  },
  "cache": {
    "hit": false,
    "ttl_s": 3600
  }
}

The transcript object is the same shape as the local Transcript type.

POST /v1/transcript/section

Fetch a timestamp-windowed transcript section. This is the primary hosted MVP endpoint for agent workflows and the GStack research demo.

Request:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
  "lang": "en",
  "at_s": 2449,
  "before_s": 120,
  "after_s": 600
}

at_s may be omitted when the URL contains a parseable YouTube timestamp.

Response:

{
  "request_id": "req_01h...",
  "section": {
    "video_id": "Rzi7oFTzjac",
    "title": "Example title",
    "channel": "Example channel",
    "duration_ms": 4200000,
    "language": "en",
    "source": "caption_auto_generated",
    "anchor_ms": 2449000,
    "window_start_ms": 2329000,
    "window_end_ms": 3049000,
    "segments": [
      {
        "text": "example text",
        "start_ms": 2449000,
        "end_ms": 2453000
      }
    ]
  },
  "agent_contract": {
    "suggested_task": "summarize_section_and_extract_links",
    "source_url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s"
  }
}

The section object is the same shape as the local TranscriptSection type. Default windows match the local server: 120 seconds before and 600 seconds after the anchor.

POST /v1/languages

Request:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac"
}

Response:

{
  "request_id": "req_01h...",
  "video_id": "Rzi7oFTzjac",
  "languages": [
    {
      "code": "en",
      "name": "English",
      "is_auto_generated": true,
      "is_translatable": true
    }
  ]
}

POST /v1/metadata

Request:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac"
}

Response:

{
  "request_id": "req_01h...",
  "metadata": {
    "video_id": "Rzi7oFTzjac",
    "title": "Example title",
    "channel": "Example channel",
    "duration_ms": 4200000,
    "has_captions": true,
    "caption_languages": ["en"]
  }
}

POST /v1/stream/start

Start a live stream transcription session.

Request:

{
  "url": "https://www.youtube.com/watch?v=jfKfPfyJRdk",
  "lang": "en"
}

Response:

{
  "request_id": "req_01h...",
  "session": {
    "session_id": "sess-1",
    "platform": "youtube",
    "title": "Live stream title",
    "channel": "Live channel",
    "started_at": 1760000000000,
    "language": "en",
    "source": "youtube_live_hls"
  }
}

The session object is the same shape as the local StreamSession type.

GET /v1/stream/{session_id}/poll

Poll a live stream session for transcript segments after a cursor.

Request:

GET /v1/stream/sess-1/poll?cursor=42

Response:

{
  "request_id": "req_01h...",
  "chunk": {
    "session_id": "sess-1",
    "segments": [
      {
        "text": "live words",
        "start_ms": 15000,
        "end_ms": 18000
      }
    ],
    "cursor": 43,
    "is_final": false,
    "buffer_depth_ms": 3000,
    "session_duration_ms": 60000,
    "health": "active",
    "last_diagnostic": null,
    "last_error": null
  }
}

The chunk object is the same shape as the local StreamChunk type.

POST /v1/stream/{session_id}/stop

Stop a live stream session and return the final buffered chunk.

Request:

POST /v1/stream/sess-1/stop

Response:

{
  "request_id": "req_01h...",
  "chunk": {
    "session_id": "sess-1",
    "segments": [],
    "cursor": 43,
    "is_final": true,
    "buffer_depth_ms": 0,
    "session_duration_ms": 61000,
    "health": "stopped",
    "last_diagnostic": null,
    "last_error": null
  }
}

GET /v1/stream/{session_id}/events

SSE form of poll_stream.

Request:

GET /v1/stream/sess-1/events?cursor=42
Accept: text/event-stream

Events:

event: chunk
data: {"request_id":"req_01h...","chunk":{"session_id":"sess-1","segments":[{"text":"live words","start_ms":15000,"end_ms":18000}],"cursor":43,"is_final":false,"buffer_depth_ms":3000,"session_duration_ms":60000,"health":"active","last_diagnostic":null,"last_error":null}}

The first implementation emits one chunk event per request using the same cursor semantics as poll. Long-lived heartbeat/final event streams and Last-Event-ID reconnection are reserved for the durable session store work.

GET /v1/stream

List active sessions for the current API key.

Response:

{
  "request_id": "req_01h...",
  "sessions": [
    {
      "session_id": "sess-1",
      "platform": "youtube",
      "title": "Live stream title",
      "channel": "Live channel",
      "started_at": 1760000000000,
      "language": "en",
      "source": "youtube_live_hls"
    }
  ]
}

Hosted session IDs are scoped to the account that created them. Wrong-account access returns 404 not_found to avoid leaking whether a session exists.

POST /v1/recognize

Reserved for the Phase F audio recognition surface. Do not implement in the first hosted MVP unless TIN-528/TIN-529/TIN-530 have landed.

Data Flow

VOD Section MVP

HTTP request
  -> auth and rate-limit check
  -> parse URL/timestamp
  -> local core get_transcript_section semantics
  -> transcript cache write-through
  -> structured JSON response
  -> usage event

The hosted service should call the same Rust library boundary that powers the MCP tool rather than maintaining a separate transcript implementation.

Live Stream

start request
  -> auth and session quota check
  -> MediaResolver
  -> SessionManager or hosted session store
  -> background ingestion worker
  -> poll/SSE delivery

For the protected-preview MVP, the accepted model is sticky routing to one active worker plus an in-memory session-owner registry. poll, events, stop, and list are account-scoped. A worker restart or wrong-worker route returns 404 not_found for the old session because raw audio and stream buffers are not durably stored. See Hosted Stream Session Routing.

Before broad paid or multi-replica traffic, sessions need the Redis-backed model reserved by the routing RFC:

  • Redis for session cursors, active-session indexes, and short-lived buffers
  • worker ownership metadata and leases so polls route to the right worker
  • explicit session timeout and cleanup jobs

Persistence Model

Current protected-preview implementation:

  • account/key records are loaded from environment configuration
  • usage events can be appended to TUBEBRAIN_USAGE_EVENT_LOG as JSONL
  • tubebrain-hosted rebuilds the current rolling quota window from that JSONL file on restart and ignores duplicate event_id records
  • stream session state remains in-memory and single-process

Preferred paid-pilot database shape:

PostgreSQL tables:

  • accounts
  • api_keys
  • usage_events
  • billing_customers
  • idempotency_keys

Redis keys:

  • session:{account_id}:{session_id}:metadata
  • session:{account_id}:{session_id}:segments
  • session:{account_id}:{session_id}:diagnostics
  • rate:{account_id}:{window}
  • rate:ip:{ip}:{window}

Raw transcript segments may be cached for performance. Raw audio must not be persisted by default.

Metering

Minimum usage dimensions:

  • transcript requests
  • transcript section requests
  • upstream media fetch attempts
  • live session starts
  • live session active seconds
  • live audio seconds decoded
  • live STT seconds processed
  • egress bytes
  • source failures and retry counts

Usage events should include request_id, account_id, endpoint, outcome, duration, and cost dimensions. They must not include API key material or raw audio bytes.

Minimum storage fields for the first paid-pilot implementation:

Field Type Notes
event_id string Unique usage event ID
request_id string Matches the public response header/body request ID
account_id string Customer/account owner
api_key_id string Stable key ID only, never the raw key
endpoint string Hosted route or MCP-equivalent operation
source_kind string youtube_vod, youtube_live, http_audio, or future adapter
session_id string? Present for stream events
outcome string ok, client_error, source_error, transcription_error, rate_limited, internal_error
status_code integer? Hosted HTTP status when applicable
duration_ms integer Server-side wall-clock duration
stream_active_ms integer? Active session time
audio_decoded_ms integer? Decoded media duration
stt_processed_ms integer? Audio duration submitted to STT
stt_backend string? Primary STT backend when available
stt_fallback_mode string? Managed fallback mode when available
stt_provider string? Managed provider name when available
estimated_cost_micro_usd integer? Optional cost estimate
egress_bytes integer? Response/SSE egress estimate
retry_count integer Source, network, or resolver retries
error_code string? Stable public error code only
created_at_unix_s integer Event timestamp

Forbidden storage fields:

  • raw API keys or bearer token strings
  • cookies
  • signed media URL path or query values
  • PoToken values
  • BotGuard worker internals
  • raw audio bytes

Rate Limits

Initial conservative defaults:

Limit Test Free/design partner Paid
VOD transcript requests 30/hour 120/hour tiered
Section requests 60/hour 300/hour tiered
Concurrent live sessions 1 2 tiered
Live session duration 10 min 30 min tiered
SSE connections 1/key 4/key tiered

Rate limits should be enforced per API key and backed by a coarse per-IP abuse limit for unauthenticated or invalid-key traffic.

Current protected-preview quotas are per account, use TUBEBRAIN_USAGE_WINDOW_SECS as a rolling window, and emit x-ratelimit-reset as seconds until the oldest counted event or in-flight reservation exits that window.

Privacy And Retention

Default retention:

  • API request metadata: 30 days
  • usage events: billing/audit retention
  • transcript cache: short TTL, initially 1 hour
  • live segment buffers: session lifetime plus a short cleanup window
  • raw audio: not persisted
  • PoToken material: not exposed and not stored beyond operational need
  • cookies and signed media URLs: not stored as customer-visible records

The service should expose these boundaries in public docs before charging.

Compliance Boundaries

Hosted source resolution has a higher risk profile than local execution. Keep these boundaries explicit:

  • Layer 1 media resolution remains isolated from Layer 2 transcription.
  • Public API responses must not include resolved signed media URLs, cookies, PoTokens, or BotGuard internals.
  • Managed PoToken minting is not a public endpoint in v1.
  • Error messages should be useful but should not leak credential-bearing URLs.

Deployment Shape

Recommended first implementation:

crates or workspace members
  tubebrain-core        existing library boundary
  tubebrain            local MCP binary
  tubebrain-hosted   axum HTTP/SSE binary

Preferred stack:

  • axum for HTTP and SSE
  • tower middleware for request IDs, auth, tracing, compression, and limits
  • PostgreSQL for accounts, API keys, usage, and idempotency
  • Redis for rate limiting and live-session state
  • background workers in the same binary for the first MVP, split later when live stream load requires it

The hosted service must keep logs on stderr/stdout according to the deployment platform, but the local MCP binary still reserves stdout for protocol traffic.

GStack Demo Contract

The hosted demo should use:

POST /v1/transcript/section

with:

{
  "url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
  "lang": "en",
  "before_s": 120,
  "after_s": 600
}

The calling harness receives the section packet and runs:

summarize the section about gstack and open all the articles described in my browser to read.

Browser-opening actions are outside TubeBrain's API boundary. TubeBrain provides the timestamped transcript context; the harness extracts links and executes browser actions.

Acceptance Criteria

TIN-531 is complete when:

  • this API contract is published in the repo docs
  • the public hosted RFC points at this concrete contract
  • the roadmap describes /v1/transcript/section as the first hosted MVP slice
  • the GStack demo plan maps to the hosted endpoint and local MCP tool
  • Linear records that implementation should start from POST /v1/transcript/section

Implementation is a follow-up issue, not part of TIN-531.