How do some sites download YouTube captions?

Solution 1:

Send a GET request on:

http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}

Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA

Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).

There are multiple subtitles available for the video

  • English
  • Korean
  • Spanish
  • Korean (auto-generated) also called asr (automatic speech recognition)

These stand for the subtitle name parameter (i.e., name=English).

lang stands for the country code. In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish

If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.

https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv

This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).

As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:

https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv

Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.

Solution 2:

There is this unofficial API used by Youtube :

https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID}

LANG here is ISO 639-1 2 letter country code. For your example it would be :

https://www.youtube.com/api/timedtext?lang=ko&v=0db1_qWZjRA

You can check it in network tab while toggling the closed caption button :

enter image description here

Solution 3:

A 2022 answer:

Option 1: Send a curl request to the webpage: curl -L "https://youtu.be/YbJOTdZBX1g", search for timedtext in the result, and you would get a URL. replace \u0026 with & and you get the link for the subtitle.

Option 2: Use the yt-dlp package:

# For installing see: https://github.com/yt-dlp/yt-dlp#with-pip
from yt_dlp import YoutubeDL

ydl_opts = {
    "skip_download": True,
    "writesubtitles": True,
    "subtitleslangs": ["all", "-live_chat"],
    # Looks like formats available are vtt, ttml, srv3, srv2, srv1, json3
    "subtitlesformat": "json3",
    # You can skip the following option
    "sleep_interval_subtitles": 1,
}
with YoutubeDL(ydl_opts) as ydl:
    ydl.download(["YbJOTdZBX1g"])