Agent Guide¶
yt-text works best when an agent treats transcript retrieval as a staged
decision instead of a single blind tool call.
This page is the canonical guide for in-session tool ordering, language handling, transcript-source interpretation, and error-handling patterns.
Recommended Tool Order¶
Use the tools in this order unless you have a reason not to:
get_metadatalist_languagesget_transcript
Why this order works:
get_metadatais the lightest validation path for a URL, title, duration, and caption availabilitylist_languagestells you whether strict language selection is realistic before you promise a specific outputget_transcriptis the heavy path, so it should come after basic grounding
When To Call Each Tool¶
get_metadata¶
Use get_metadata first when you need to:
- verify that a URL resolves to a public video
- inspect title, channel, and duration for routing or summarization
- check whether captions exist before asking for a transcript
- cheaply validate that the client and MCP server are wired correctly
list_languages¶
Use list_languages when language choice matters:
- before promising the user a transcript in a specific language
- before treating
lang=enas guaranteed - before deciding whether to ask the user which caption track they want
get_transcript¶
Use get_transcript once you know you actually want the transcript payload.
For most agent workflows, format=json is the best default because it keeps
the language, source, and timestamp fields available for downstream reasoning.
Use other formats when the output target is already known:
markdownfor direct human review inside a chatsrtorvttfor subtitle exporttextfor lightweight plain-text summarization
Language Handling¶
The lang argument is a preference, not a hard contract.
What matters in practice:
- the returned
languagefield is authoritative lang=endoes not guarantee an English transcript- if the requested language is missing,
yt-textmay fall back to another available language instead of hard-failing
If strict language choice matters:
- call
list_languagesfirst - confirm the target language code is present
- inspect the
languagefield in the transcript response before presenting the result as final
Transcript Source Handling¶
The source field tells the agent how the transcript was produced.
caption_manual: human-authored captions; usually the highest-confidence pathcaption_auto_generated: machine-generated captions; usable but more error-pronewhisper_local: local speech-to-text fallback; slower and usually noisier than captions
Treat source as part of the answer quality, not just an implementation detail.
If the source is not caption_manual, say so in the response when quality or
attribution matters.
Error-Handling Pattern¶
Use these decision rules in agent sessions:
invalid YouTube URL or video ID: stop and ask for a valid watch URL or video IDvideo is private or requires authentication: stop; the default path does not support itvideo is age-restricted: stop; treat it as unsupported for the default pathno captions available ... try building with --features po-token: explain that metadata and language listing may still work, but the caption fetch path is blockedPoToken required but botguard sidecar not available: suggest apo-tokenbuildwhisper feature not enabled: suggest awhisperbuild only if the user needs captionless fallback
If a transcript fetch fails, do not throw away the metadata path. get_metadata
and list_languages can still provide useful grounding.
MCP Client Hygiene¶
When wiring yt-text into an MCP client:
- prefer pointing the client directly at the
yt-textbinary - if the binary is not on
PATH, use an absolute path - do not wrap
yt-textin a shell script that prints tostdout - restart the client after config changes unless the client explicitly hot-reloads MCP
stdout is reserved for MCP protocol traffic. Any banner text or wrapper noise
can break the session.
Prompting Patterns¶
Useful instructions to give an agent:
- "Run
get_metadatafirst. If language choice matters, runlist_languages. Then fetchget_transcriptin JSON." - "If the transcript
sourceis notcaption_manual, tell me that in the answer." - "If the requested language is unavailable, say which language was returned instead of assuming English."
- "If transcript fetch fails, still tell me whether captions exist and which languages are listed."
Related Docs¶
- Integrations for client configuration
- Quickstarts for end-to-end workflows
- Sample Outputs for response shape
- Troubleshooting for concrete failure strings
- Tools for the reference surface