Agent Guide

yt-text works best when an agent treats transcript retrieval as a staged decision instead of a single blind tool call.

This page is the canonical guide for in-session tool ordering, language handling, transcript-source interpretation, and error-handling patterns.

Use the tools in this order unless you have a reason not to:

  1. get_metadata
  2. list_languages
  3. get_transcript

Why this order works:

  • get_metadata is the lightest validation path for a URL, title, duration, and caption availability
  • list_languages tells you whether strict language selection is realistic before you promise a specific output
  • get_transcript is the heavy path, so it should come after basic grounding

When To Call Each Tool

get_metadata

Use get_metadata first when you need to:

  • verify that a URL resolves to a public video
  • inspect title, channel, and duration for routing or summarization
  • check whether captions exist before asking for a transcript
  • cheaply validate that the client and MCP server are wired correctly

list_languages

Use list_languages when language choice matters:

  • before promising the user a transcript in a specific language
  • before treating lang=en as guaranteed
  • before deciding whether to ask the user which caption track they want

get_transcript

Use get_transcript once you know you actually want the transcript payload.

For most agent workflows, format=json is the best default because it keeps the language, source, and timestamp fields available for downstream reasoning.

Use other formats when the output target is already known:

  • markdown for direct human review inside a chat
  • srt or vtt for subtitle export
  • text for lightweight plain-text summarization

Language Handling

The lang argument is a preference, not a hard contract.

What matters in practice:

  • the returned language field is authoritative
  • lang=en does not guarantee an English transcript
  • if the requested language is missing, yt-text may fall back to another available language instead of hard-failing

If strict language choice matters:

  1. call list_languages first
  2. confirm the target language code is present
  3. inspect the language field in the transcript response before presenting the result as final

Transcript Source Handling

The source field tells the agent how the transcript was produced.

  • caption_manual: human-authored captions; usually the highest-confidence path
  • caption_auto_generated: machine-generated captions; usable but more error-prone
  • whisper_local: local speech-to-text fallback; slower and usually noisier than captions

Treat source as part of the answer quality, not just an implementation detail. If the source is not caption_manual, say so in the response when quality or attribution matters.

Error-Handling Pattern

Use these decision rules in agent sessions:

  • invalid YouTube URL or video ID: stop and ask for a valid watch URL or video ID
  • video is private or requires authentication: stop; the default path does not support it
  • video is age-restricted: stop; treat it as unsupported for the default path
  • no captions available ... try building with --features po-token: explain that metadata and language listing may still work, but the caption fetch path is blocked
  • PoToken required but botguard sidecar not available: suggest a po-token build
  • whisper feature not enabled: suggest a whisper build only if the user needs captionless fallback

If a transcript fetch fails, do not throw away the metadata path. get_metadata and list_languages can still provide useful grounding.

MCP Client Hygiene

When wiring yt-text into an MCP client:

  • prefer pointing the client directly at the yt-text binary
  • if the binary is not on PATH, use an absolute path
  • do not wrap yt-text in a shell script that prints to stdout
  • restart the client after config changes unless the client explicitly hot-reloads MCP

stdout is reserved for MCP protocol traffic. Any banner text or wrapper noise can break the session.

Prompting Patterns

Useful instructions to give an agent:

  • "Run get_metadata first. If language choice matters, run list_languages. Then fetch get_transcript in JSON."
  • "If the transcript source is not caption_manual, tell me that in the answer."
  • "If the requested language is unavailable, say which language was returned instead of assuming English."
  • "If transcript fetch fails, still tell me whether captions exist and which languages are listed."