Agent Guide¶

yt-text works best when an agent treats transcript retrieval as a staged decision instead of a single blind tool call.

This page is the canonical guide for in-session tool ordering, language handling, transcript-source interpretation, and error-handling patterns.

Recommended Tool Order¶

Use the tools in this order unless you have a reason not to:

get_metadata
list_languages
get_transcript

Why this order works:

get_metadata is the lightest validation path for a URL, title, duration, and caption availability
list_languages tells you whether strict language selection is realistic before you promise a specific output
get_transcript is the heavy path, so it should come after basic grounding

When To Call Each Tool¶

`get_metadata`¶

Use get_metadata first when you need to:

verify that a URL resolves to a public video
inspect title, channel, and duration for routing or summarization
check whether captions exist before asking for a transcript
cheaply validate that the client and MCP server are wired correctly

`list_languages`¶

Use list_languages when language choice matters:

before promising the user a transcript in a specific language
before treating lang=en as guaranteed
before deciding whether to ask the user which caption track they want

`get_transcript`¶

Use get_transcript once you know you actually want the transcript payload.

For most agent workflows, format=json is the best default because it keeps the language, source, and timestamp fields available for downstream reasoning.

Use other formats when the output target is already known:

markdown for direct human review inside a chat
srt or vtt for subtitle export
text for lightweight plain-text summarization

Language Handling¶

The lang argument is a preference, not a hard contract.

What matters in practice:

the returned language field is authoritative
lang=en does not guarantee an English transcript
if the requested language is missing, yt-text may fall back to another available language instead of hard-failing

If strict language choice matters:

call list_languages first
confirm the target language code is present
inspect the language field in the transcript response before presenting the result as final

Transcript Source Handling¶

The source field tells the agent how the transcript was produced.

caption_manual: human-authored captions; usually the highest-confidence path
caption_auto_generated: machine-generated captions; usable but more error-prone
whisper_local: local speech-to-text fallback; slower and usually noisier than captions

Treat source as part of the answer quality, not just an implementation detail. If the source is not caption_manual, say so in the response when quality or attribution matters.

Error-Handling Pattern¶

Use these decision rules in agent sessions:

invalid YouTube URL or video ID: stop and ask for a valid watch URL or video ID
video is private or requires authentication: stop; the default path does not support it
video is age-restricted: stop; treat it as unsupported for the default path
no captions available ... try building with --features po-token: explain that metadata and language listing may still work, but the caption fetch path is blocked
PoToken required but botguard sidecar not available: suggest a po-token build
whisper feature not enabled: suggest a whisper build only if the user needs captionless fallback

If a transcript fetch fails, do not throw away the metadata path. get_metadata and list_languages can still provide useful grounding.

MCP Client Hygiene¶

When wiring yt-text into an MCP client:

prefer pointing the client directly at the yt-text binary
if the binary is not on PATH, use an absolute path
do not wrap yt-text in a shell script that prints to stdout
restart the client after config changes unless the client explicitly hot-reloads MCP

stdout is reserved for MCP protocol traffic. Any banner text or wrapper noise can break the session.

Prompting Patterns¶

Useful instructions to give an agent:

"Run get_metadata first. If language choice matters, run list_languages. Then fetch get_transcript in JSON."
"If the transcript source is not caption_manual, tell me that in the answer."
"If the requested language is unavailable, say which language was returned instead of assuming English."
"If transcript fetch fails, still tell me whether captions exist and which languages are listed."

Integrations for client configuration
Quickstarts for end-to-end workflows
Sample Outputs for response shape
Troubleshooting for concrete failure strings
Tools for the reference surface