RFC: Hosted HTTP/SSE API Contract¶
Date: 2026-05-14 Status: draft Owner: Jess Sullivan Linear: TIN-531
Implementation status: GET /v1/health, POST /v1/transcript/section, and
the first single-process stream session endpoints are implemented in the
tubebrain-hosted binary. Hosted auth supports account-scoped API key records,
and protected-preview metering can persist JSONL usage events for restart-safe
quota windows. Durable hosted account storage, database-backed billing, and
multi-worker session routing remain follow-up work.
Summary¶
Define the first hosted tubebrain.ai API surface around the already-shipped
local TubeBrain semantics.
The hosted layer is a convenience wrapper over the FOSS core. It must preserve the same data model and tool behavior as the local MCP server while adding:
- HTTP request/response access for VOD transcripts and timestamped sections
- polling and SSE delivery for live stream sessions
- API-key auth, metering, rate limits, and abuse controls
- explicit privacy and retention boundaries
This RFC is an implementation contract for the first hosted service slice. It does not make hosted execution the canonical path; the local MCP binary remains first-class.
Goals¶
- Make the hosted API semantics match local MCP tools closely enough that an agent harness can switch between local and hosted modes with minimal logic.
- Support the GStack research demo data flow: timestamped YouTube URL -> transcript section -> agent summary/link extraction -> browser actions.
- Keep auth, rate limits, retention, and cost controls visible from day one.
- Avoid exposing low-level PoToken or resolver internals as public API fields.
Non-Goals¶
- Build billing UI or subscription management in this RFC.
- Expose a public managed PoToken minting endpoint.
- Store raw audio by default.
- Replace local MCP stdio with hosted-only behavior.
- Promise the audio fingerprinting endpoint in the first deployed MVP.
Versioning¶
All endpoints live under /v1.
Breaking changes require /v2. Additive fields are allowed in /v1; clients
must ignore unknown response fields.
All timestamps are Unix milliseconds unless a field name explicitly ends in
_s.
Auth¶
Request Authentication¶
Clients authenticate with an API key:
Authorization: Bearer tb_sk_live_...
API keys are opaque. The protected-preview implementation stores only a SHA-256 hash of each API key in memory; future durable key storage must store only hashes or deployment-secret references, never raw keys.
Key Shape¶
Recommended key prefix:
tb_sk_test_for non-billable development keystb_sk_live_for billable production keys
The prefix is informational. Authorization must rely on server-side key records, not prefix parsing alone.
Scopes¶
Initial scopes:
transcript:read- VOD transcript, metadata, language, and section endpointsstream:write- start and stop stream sessionsstream:read- poll or subscribe to stream sessionsrecognize:write- future audio recognition endpointadmin:read- account and usage inspection
The GStack demo requires only transcript:read for the hosted path.
Headers¶
Successful responses include:
X-Request-Id: req_...
X-RateLimit-Limit: 120
X-RateLimit-Remaining: 119
X-RateLimit-Reset: 1760000000
Clients may send:
Idempotency-Key: user-generated-key
Idempotency-Key is honored for mutating endpoints such as stream start and
stop. It is ignored for pure reads.
Error Envelope¶
Errors use a stable JSON envelope:
{
"error": {
"code": "invalid_request",
"message": "url is required",
"request_id": "req_01h..."
}
}
Initial error codes:
| Code | HTTP | Meaning |
|---|---|---|
invalid_request |
400 | malformed JSON, missing fields, invalid cursor |
unauthorized |
401 | missing or invalid API key |
forbidden |
403 | valid key lacks required scope |
not_found |
404 | unknown session, video, or route |
conflict |
409 | idempotency conflict or terminal session state |
rate_limited |
429 | per-key or per-IP limit exceeded |
source_unavailable |
502 | upstream media source failed |
transcription_unavailable |
503 | STT backend unavailable |
internal_error |
500 | unexpected service error |
Endpoint Summary¶
| Endpoint | Auth scope | Local MCP equivalent | MVP |
|---|---|---|---|
GET /v1/health |
none | none | yes |
POST /v1/transcript |
transcript:read |
get_transcript |
reserved |
POST /v1/transcript/section |
transcript:read |
get_transcript_section |
yes |
POST /v1/languages |
transcript:read |
list_languages |
reserved |
POST /v1/metadata |
transcript:read |
get_metadata |
reserved |
POST /v1/stream/start |
stream:write |
start_stream |
yes |
GET /v1/stream/{session_id}/poll |
stream:read |
poll_stream |
yes |
POST /v1/stream/{session_id}/stop |
stream:write |
stop_stream |
yes |
GET /v1/stream/{session_id}/events |
stream:read |
push form of poll_stream |
yes |
GET /v1/stream |
stream:read |
list_streams |
yes |
POST /v1/recognize |
recognize:write |
recognize_audio future tool |
deferred |
The current hosted implementation intentionally ships transcript/section and
stream sessions before the broader VOD metadata/language endpoints because the
GTM wedge is live/radio/YouTube source monitoring. Stream sessions are
single-process and in-memory until Redis/Postgres account storage and worker
routing are added.
Common Request Fields¶
Transcript endpoints accept these common fields where relevant:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
"lang": "en",
"format": "json"
}
format may be json, markdown, srt, vtt, or text. Hosted JSON
responses should prefer structured JSON and return rendered text only when a
non-JSON format is explicitly requested.
GET /v1/health¶
Readiness endpoint for load balancers and canaries.
Response:
{
"status": "ok",
"service": "tubebrain-hosted",
"version": "0.1.0",
"core_version": "0.1.9"
}
status values:
ok- service can accept trafficdegraded- service can answer some requests but one dependency is impairedunavailable- service should not receive traffic
POST /v1/transcript¶
Fetch a full structured transcript for a supported VOD URL.
Request:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac",
"lang": "en",
"format": "json"
}
JSON response:
{
"request_id": "req_01h...",
"transcript": {
"video_id": "Rzi7oFTzjac",
"title": "Example title",
"channel": "Example channel",
"duration_ms": 4200000,
"language": "en",
"source": "caption_auto_generated",
"segments": [
{
"text": "example text",
"start_ms": 2449000,
"end_ms": 2453000
}
]
},
"cache": {
"hit": false,
"ttl_s": 3600
}
}
The transcript object is the same shape as the local Transcript type.
POST /v1/transcript/section¶
Fetch a timestamp-windowed transcript section. This is the primary hosted MVP endpoint for agent workflows and the GStack research demo.
Request:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
"lang": "en",
"at_s": 2449,
"before_s": 120,
"after_s": 600
}
at_s may be omitted when the URL contains a parseable YouTube timestamp.
Response:
{
"request_id": "req_01h...",
"section": {
"video_id": "Rzi7oFTzjac",
"title": "Example title",
"channel": "Example channel",
"duration_ms": 4200000,
"language": "en",
"source": "caption_auto_generated",
"anchor_ms": 2449000,
"window_start_ms": 2329000,
"window_end_ms": 3049000,
"segments": [
{
"text": "example text",
"start_ms": 2449000,
"end_ms": 2453000
}
]
},
"agent_contract": {
"suggested_task": "summarize_section_and_extract_links",
"source_url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s"
}
}
The section object is the same shape as the local TranscriptSection type.
Default windows match the local server: 120 seconds before and 600 seconds after
the anchor.
POST /v1/languages¶
Request:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac"
}
Response:
{
"request_id": "req_01h...",
"video_id": "Rzi7oFTzjac",
"languages": [
{
"code": "en",
"name": "English",
"is_auto_generated": true,
"is_translatable": true
}
]
}
POST /v1/metadata¶
Request:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac"
}
Response:
{
"request_id": "req_01h...",
"metadata": {
"video_id": "Rzi7oFTzjac",
"title": "Example title",
"channel": "Example channel",
"duration_ms": 4200000,
"has_captions": true,
"caption_languages": ["en"]
}
}
POST /v1/stream/start¶
Start a live stream transcription session.
Request:
{
"url": "https://www.youtube.com/watch?v=jfKfPfyJRdk",
"lang": "en"
}
Response:
{
"request_id": "req_01h...",
"session": {
"session_id": "sess-1",
"platform": "youtube",
"title": "Live stream title",
"channel": "Live channel",
"started_at": 1760000000000,
"language": "en",
"source": "youtube_live_hls"
}
}
The session object is the same shape as the local StreamSession type.
GET /v1/stream/{session_id}/poll¶
Poll a live stream session for transcript segments after a cursor.
Request:
GET /v1/stream/sess-1/poll?cursor=42
Response:
{
"request_id": "req_01h...",
"chunk": {
"session_id": "sess-1",
"segments": [
{
"text": "live words",
"start_ms": 15000,
"end_ms": 18000
}
],
"cursor": 43,
"is_final": false,
"buffer_depth_ms": 3000,
"session_duration_ms": 60000,
"health": "active",
"last_diagnostic": null,
"last_error": null
}
}
The chunk object is the same shape as the local StreamChunk type.
POST /v1/stream/{session_id}/stop¶
Stop a live stream session and return the final buffered chunk.
Request:
POST /v1/stream/sess-1/stop
Response:
{
"request_id": "req_01h...",
"chunk": {
"session_id": "sess-1",
"segments": [],
"cursor": 43,
"is_final": true,
"buffer_depth_ms": 0,
"session_duration_ms": 61000,
"health": "stopped",
"last_diagnostic": null,
"last_error": null
}
}
GET /v1/stream/{session_id}/events¶
SSE form of poll_stream.
Request:
GET /v1/stream/sess-1/events?cursor=42
Accept: text/event-stream
Events:
event: chunk
data: {"request_id":"req_01h...","chunk":{"session_id":"sess-1","segments":[{"text":"live words","start_ms":15000,"end_ms":18000}],"cursor":43,"is_final":false,"buffer_depth_ms":3000,"session_duration_ms":60000,"health":"active","last_diagnostic":null,"last_error":null}}
The first implementation emits one chunk event per request using the same
cursor semantics as poll. Long-lived heartbeat/final event streams and
Last-Event-ID reconnection are reserved for the durable session store work.
GET /v1/stream¶
List active sessions for the current API key.
Response:
{
"request_id": "req_01h...",
"sessions": [
{
"session_id": "sess-1",
"platform": "youtube",
"title": "Live stream title",
"channel": "Live channel",
"started_at": 1760000000000,
"language": "en",
"source": "youtube_live_hls"
}
]
}
Hosted session IDs are scoped to the account that created them. Wrong-account
access returns 404 not_found to avoid leaking whether a session exists.
POST /v1/recognize¶
Reserved for the Phase F audio recognition surface. Do not implement in the first hosted MVP unless TIN-528/TIN-529/TIN-530 have landed.
Data Flow¶
VOD Section MVP¶
HTTP request
-> auth and rate-limit check
-> parse URL/timestamp
-> local core get_transcript_section semantics
-> transcript cache write-through
-> structured JSON response
-> usage event
The hosted service should call the same Rust library boundary that powers the MCP tool rather than maintaining a separate transcript implementation.
Live Stream¶
start request
-> auth and session quota check
-> MediaResolver
-> SessionManager or hosted session store
-> background ingestion worker
-> poll/SSE delivery
For the protected-preview MVP, the accepted model is sticky routing to one active
worker plus an in-memory session-owner registry. poll, events, stop, and
list are account-scoped. A worker restart or wrong-worker route returns
404 not_found for the old session because raw audio and stream buffers are not
durably stored. See
Hosted Stream Session Routing.
Before broad paid or multi-replica traffic, sessions need the Redis-backed model reserved by the routing RFC:
- Redis for session cursors, active-session indexes, and short-lived buffers
- worker ownership metadata and leases so polls route to the right worker
- explicit session timeout and cleanup jobs
Persistence Model¶
Current protected-preview implementation:
- account/key records are loaded from environment configuration
- usage events can be appended to
TUBEBRAIN_USAGE_EVENT_LOGas JSONL tubebrain-hostedrebuilds the current rolling quota window from that JSONL file on restart and ignores duplicateevent_idrecords- stream session state remains in-memory and single-process
Preferred paid-pilot database shape:
PostgreSQL tables:
accountsapi_keysusage_eventsbilling_customersidempotency_keys
Redis keys:
session:{account_id}:{session_id}:metadatasession:{account_id}:{session_id}:segmentssession:{account_id}:{session_id}:diagnosticsrate:{account_id}:{window}rate:ip:{ip}:{window}
Raw transcript segments may be cached for performance. Raw audio must not be persisted by default.
Metering¶
Minimum usage dimensions:
- transcript requests
- transcript section requests
- upstream media fetch attempts
- live session starts
- live session active seconds
- live audio seconds decoded
- live STT seconds processed
- egress bytes
- source failures and retry counts
Usage events should include request_id, account_id, endpoint, outcome,
duration, and cost dimensions. They must not include API key material or raw
audio bytes.
Minimum storage fields for the first paid-pilot implementation:
| Field | Type | Notes |
|---|---|---|
event_id |
string | Unique usage event ID |
request_id |
string | Matches the public response header/body request ID |
account_id |
string | Customer/account owner |
api_key_id |
string | Stable key ID only, never the raw key |
endpoint |
string | Hosted route or MCP-equivalent operation |
source_kind |
string | youtube_vod, youtube_live, http_audio, or future adapter |
session_id |
string? | Present for stream events |
outcome |
string | ok, client_error, source_error, transcription_error, rate_limited, internal_error |
status_code |
integer? | Hosted HTTP status when applicable |
duration_ms |
integer | Server-side wall-clock duration |
stream_active_ms |
integer? | Active session time |
audio_decoded_ms |
integer? | Decoded media duration |
stt_processed_ms |
integer? | Audio duration submitted to STT |
stt_backend |
string? | Primary STT backend when available |
stt_fallback_mode |
string? | Managed fallback mode when available |
stt_provider |
string? | Managed provider name when available |
estimated_cost_micro_usd |
integer? | Optional cost estimate |
egress_bytes |
integer? | Response/SSE egress estimate |
retry_count |
integer | Source, network, or resolver retries |
error_code |
string? | Stable public error code only |
created_at_unix_s |
integer | Event timestamp |
Forbidden storage fields:
- raw API keys or bearer token strings
- cookies
- signed media URL path or query values
- PoToken values
- BotGuard worker internals
- raw audio bytes
Rate Limits¶
Initial conservative defaults:
| Limit | Test | Free/design partner | Paid |
|---|---|---|---|
| VOD transcript requests | 30/hour | 120/hour | tiered |
| Section requests | 60/hour | 300/hour | tiered |
| Concurrent live sessions | 1 | 2 | tiered |
| Live session duration | 10 min | 30 min | tiered |
| SSE connections | 1/key | 4/key | tiered |
Rate limits should be enforced per API key and backed by a coarse per-IP abuse limit for unauthenticated or invalid-key traffic.
Current protected-preview quotas are per account, use
TUBEBRAIN_USAGE_WINDOW_SECS as a rolling window, and emit
x-ratelimit-reset as seconds until the oldest counted event or in-flight
reservation exits that window.
Privacy And Retention¶
Default retention:
- API request metadata: 30 days
- usage events: billing/audit retention
- transcript cache: short TTL, initially 1 hour
- live segment buffers: session lifetime plus a short cleanup window
- raw audio: not persisted
- PoToken material: not exposed and not stored beyond operational need
- cookies and signed media URLs: not stored as customer-visible records
The service should expose these boundaries in public docs before charging.
Compliance Boundaries¶
Hosted source resolution has a higher risk profile than local execution. Keep these boundaries explicit:
- Layer 1 media resolution remains isolated from Layer 2 transcription.
- Public API responses must not include resolved signed media URLs, cookies, PoTokens, or BotGuard internals.
- Managed PoToken minting is not a public endpoint in v1.
- Error messages should be useful but should not leak credential-bearing URLs.
Deployment Shape¶
Recommended first implementation:
crates or workspace members
tubebrain-core existing library boundary
tubebrain local MCP binary
tubebrain-hosted axum HTTP/SSE binary
Preferred stack:
axumfor HTTP and SSEtowermiddleware for request IDs, auth, tracing, compression, and limits- PostgreSQL for accounts, API keys, usage, and idempotency
- Redis for rate limiting and live-session state
- background workers in the same binary for the first MVP, split later when live stream load requires it
The hosted service must keep logs on stderr/stdout according to the deployment platform, but the local MCP binary still reserves stdout for protocol traffic.
GStack Demo Contract¶
The hosted demo should use:
POST /v1/transcript/section
with:
{
"url": "https://www.youtube.com/watch?v=Rzi7oFTzjac&t=2449s",
"lang": "en",
"before_s": 120,
"after_s": 600
}
The calling harness receives the section packet and runs:
summarize the section about gstack and open all the articles described in my browser to read.
Browser-opening actions are outside TubeBrain's API boundary. TubeBrain provides the timestamped transcript context; the harness extracts links and executes browser actions.
Acceptance Criteria¶
TIN-531 is complete when:
- this API contract is published in the repo docs
- the public hosted RFC points at this concrete contract
- the roadmap describes
/v1/transcript/sectionas the first hosted MVP slice - the GStack demo plan maps to the hosted endpoint and local MCP tool
- Linear records that implementation should start from
POST /v1/transcript/section
Implementation is a follow-up issue, not part of TIN-531.