RFC: Hosted Agent Transcription And PoToken Service¶

Date: 2026-04-16 Status: exploratory Owner: open

Summary¶

Explore an optional hosted service layer for yt-text that complements the local FOSS binary instead of replacing it.

The target outcome is simple:

keep the core tool fully open, local-first, and self-hostable
offer a paid convenience path for users who want instant agent bootstrap
support hosted HTTP and SSE delivery for fast transcript retrieval, live stream transcription, and managed PoToken handling

This is not an "open core" proposal. The local binary remains the primary product and the hosted path exists for convenience, speed, and reduced setup friction.

Why This Matters¶

The likely user distribution is uneven:

many users will just want a working transcript tool inside an agent as fast as possible
some users will fork or customize the FOSS project for specific workflows
some users will prefer to pay a few dollars for immediate access rather than manage model downloads, stream processing, or BotGuard sidecars themselves

That makes yt-text a good fit for a FOSS-plus-hosting model:

local install stays viable and first-class
hosted service monetizes time-to-value, not code access
the public codebase remains relevant even if a convenience service exists

Candidate Capabilities¶

Potential hosted surfaces:

HTTP API for static transcript extraction
SSE API for incremental stream transcription output
managed speech-to-text inference for live audio and captionless videos
managed PoToken minting for users who do not want to run BotGuard machinery
thin bootstrap clients that let agents switch between local and hosted modes

This could also include a remote MCP or MCP-adjacent gateway later, but plain HTTP and SSE are the lowest-friction starting points.

Product Shape¶

The clearest split today looks like this:

FOSS Local Path¶

stdio MCP server
local release installs
local Whisper fallback
local PoToken support via feature flag
maximum transparency and modifiability

Hosted Convenience Path¶

fast no-build onboarding
optional hosted model execution
optional hosted stream transcription
optional hosted PoToken handling
account, quota, and billing layer for reliability and abuse control

The hosted path should feel like a convenience layer for agent bootstrap, not a different product with incompatible semantics.

Monetization Frame¶

The monetization logic is operational, not extractive.

Users would pay for:

hosted compute
managed model download and warm caches
always-on streaming infrastructure
reduced latency to first useful transcript
not having to think about BotGuard or model placement

Users would not pay to unlock the source code or remove a deliberate local cripple point.

Architecture Themes¶

Likely building blocks:

request/response service for static YouTube transcript jobs
SSE delivery for incremental transcript chunks from long-lived sessions
worker pool for model inference and audio chunk processing
storage or cache layer for model files and hot session state
PoToken worker isolation so managed token minting does not contaminate the local architecture
thin auth, billing, and rate-limit layer

The main design pressure is keeping the hosted service compatible with the local tool's semantics so documentation and agents do not split into two unrelated worlds.

Risks And Open Questions¶

YouTube and stream-source ToS risk gets sharper in a hosted environment
hosted PoToken minting may carry materially different abuse and legal risk
streaming audio retention and privacy boundaries need explicit policy
model hosting changes cost structure and operational burden
a paid service can distort roadmap priority if it diverges from the FOSS tool
agents may need a clean mode switch between local and hosted execution

Non-Goals For A First Version¶

locking core transcript functionality behind a paywall
replacing the local MCP binary as the main interface
hiding protocol details or service behavior from the open repo
pretending hosted STT and PoToken handling have the same risk profile as the purely local path

Suggested Next Step¶

Before implementation:

define the boundary between the local MCP server and a hosted HTTP/SSE layer
decide whether the first paid surface is static transcripts, streaming STT, PoToken handling, or a bundle of all three
define privacy, retention, and abuse policy for hosted audio and token work
cost the hosted model path against the likely "few dollars for convenience" buyer profile
decide whether the hosted service is a separate repo/service or lives as an integration layer around this codebase