RFC: Hosted Agent Transcription And PoToken Service¶
Date: 2026-04-16 Status: exploratory Owner: open
Summary¶
Explore an optional hosted service layer for yt-text that complements the
local FOSS binary instead of replacing it.
The target outcome is simple:
- keep the core tool fully open, local-first, and self-hostable
- offer a paid convenience path for users who want instant agent bootstrap
- support hosted HTTP and SSE delivery for fast transcript retrieval, live stream transcription, and managed PoToken handling
This is not an "open core" proposal. The local binary remains the primary product and the hosted path exists for convenience, speed, and reduced setup friction.
Why This Matters¶
The likely user distribution is uneven:
- many users will just want a working transcript tool inside an agent as fast as possible
- some users will fork or customize the FOSS project for specific workflows
- some users will prefer to pay a few dollars for immediate access rather than manage model downloads, stream processing, or BotGuard sidecars themselves
That makes yt-text a good fit for a FOSS-plus-hosting model:
- local install stays viable and first-class
- hosted service monetizes time-to-value, not code access
- the public codebase remains relevant even if a convenience service exists
Candidate Capabilities¶
Potential hosted surfaces:
- HTTP API for static transcript extraction
- SSE API for incremental stream transcription output
- managed speech-to-text inference for live audio and captionless videos
- managed PoToken minting for users who do not want to run BotGuard machinery
- thin bootstrap clients that let agents switch between local and hosted modes
This could also include a remote MCP or MCP-adjacent gateway later, but plain HTTP and SSE are the lowest-friction starting points.
Product Shape¶
The clearest split today looks like this:
FOSS Local Path¶
- stdio MCP server
- local release installs
- local Whisper fallback
- local PoToken support via feature flag
- maximum transparency and modifiability
Hosted Convenience Path¶
- fast no-build onboarding
- optional hosted model execution
- optional hosted stream transcription
- optional hosted PoToken handling
- account, quota, and billing layer for reliability and abuse control
The hosted path should feel like a convenience layer for agent bootstrap, not a different product with incompatible semantics.
Monetization Frame¶
The monetization logic is operational, not extractive.
Users would pay for:
- hosted compute
- managed model download and warm caches
- always-on streaming infrastructure
- reduced latency to first useful transcript
- not having to think about BotGuard or model placement
Users would not pay to unlock the source code or remove a deliberate local cripple point.
Architecture Themes¶
Likely building blocks:
- request/response service for static YouTube transcript jobs
- SSE delivery for incremental transcript chunks from long-lived sessions
- worker pool for model inference and audio chunk processing
- storage or cache layer for model files and hot session state
- PoToken worker isolation so managed token minting does not contaminate the local architecture
- thin auth, billing, and rate-limit layer
The main design pressure is keeping the hosted service compatible with the local tool's semantics so documentation and agents do not split into two unrelated worlds.
Risks And Open Questions¶
- YouTube and stream-source ToS risk gets sharper in a hosted environment
- hosted PoToken minting may carry materially different abuse and legal risk
- streaming audio retention and privacy boundaries need explicit policy
- model hosting changes cost structure and operational burden
- a paid service can distort roadmap priority if it diverges from the FOSS tool
- agents may need a clean mode switch between local and hosted execution
Non-Goals For A First Version¶
- locking core transcript functionality behind a paywall
- replacing the local MCP binary as the main interface
- hiding protocol details or service behavior from the open repo
- pretending hosted STT and PoToken handling have the same risk profile as the purely local path
Suggested Next Step¶
Before implementation:
- define the boundary between the local MCP server and a hosted HTTP/SSE layer
- decide whether the first paid surface is static transcripts, streaming STT, PoToken handling, or a bundle of all three
- define privacy, retention, and abuse policy for hosted audio and token work
- cost the hosted model path against the likely "few dollars for convenience" buyer profile
- decide whether the hosted service is a separate repo/service or lives as an integration layer around this codebase