Summarize a Lecture or Talk

Use this workflow when an agent needs the content of a long video in a form it can summarize, outline, or quote with timestamps.

Goal

Get a transcript that is clean enough for downstream summarization without losing timing and source metadata.

  1. get_metadata
  2. list_languages
  3. get_transcript

Why This Order

  • get_metadata tells you duration and whether captions exist before you spend effort on transcript extraction.
  • list_languages lets you choose a language intentionally instead of relying on fallback.
  • get_transcript gives the full content in the format best suited to your summarizer.

Walkthrough

1. Check the video first

get_metadata(url: "https://youtube.com/watch?v=VIDEO_ID")

What to look for:

  • duration_ms to estimate summarization cost
  • has_captions to decide whether the normal caption path is available

If has_captions is false, you may still be able to use a binary built with --features whisper.

2. Choose the best available language

list_languages(url: "https://youtube.com/watch?v=VIDEO_ID")

Prefer:

  • manual captions over auto-generated captions
  • the actual language you want the agent to reason over

3. Extract the transcript

Good default for summarization:

get_transcript(
  url: "https://youtube.com/watch?v=VIDEO_ID",
  lang: "en",
  format: "markdown"
)

Use markdown when:

  • you want readable timestamp markers
  • the agent is going to summarize directly from the result

Use json when:

  • you want explicit segment objects
  • you plan to chunk, transform, or post-process the transcript before asking an model to summarize it

What To Expect

  • markdown gives a human-readable transcript with **[m:ss]** timestamps.
  • json gives video_id, title, channel, duration_ms, language, source, and segments.

See Sample Outputs for concrete examples.

Tips

  • If strict language choice matters, do not skip list_languages.
  • Inspect the language field in JSON output to confirm which fallback path won.
  • If transcript fetch fails after metadata and languages succeed, that is often a sign you need the po-token feature.