How do all of these "Save video from YouTube" services work?
There is a very popular open source command-line downloader called youtube-dl
, which does exactly that. It grabs the actual video and audio file links from a given YouTube link – or any other popular web video site like Vimeo, Yahoo! Video, uStream, etc.
To see how that's done, look into the YouTube extractor. That's just too much to show here. Other extractors exist for simpler sites.
In order to find the video stream, you'd have to pretend to be the actual browser client, trying to load the video. This means you first have to parse the HTML code, load the relevant Javascript code, and initialize a player object, which plays video through an HTML <video>
element.
This means that somewhere in the Javascript execution, there is initialization code for the player, containing important parameters like where to actually find the video.
In the simplest case, the video might be present as a URL to some MP4 file, directly in some configuration object. This is very easy to parse by looking at the src
attribute of the <video>
element. But it could also be generated on the fly with some specific download tokens negotiated between client and some authentication server. The video might also play through a blob
URL, so you cannot see it directly, because it's generated via MediaSource APIs.
Often, the Javascript code itself is obfuscated to make it harder to re-engineer it, using variables like xyz
rather than player
.
Most video websites these days use MPEG-DASH or Apple's HTTP Live Streaming (HLS) behind the scenes. These do not use direct URLs to a video file, but instead work with a so-called "manifest" file. The manifest provides meta-information to get the actual video stream. The manifest file (.mpd
for example in DASH, and .m3u8
for HLS) will contain links to segments of video and audio, which you'd later have to combine to get a playable file.
Many websites transmit these manifests from the server to the client player, so if you can inspect the network requests made by the client, so might find a .mpd
file which you can then just use to download the video segments from your own client.
However, the manifest could also be transmitted via other side-channels, embedded into some Javascript code, generated on-the-fly, etc. For youtube-dl
, you can see how the code tries to extract the DASH manifest URL from the transmitted configuration information.
There's no general solution for this. It requires careful inspection and debugging of the target site.
Start with a typical video:
https://www.youtube.com/watch?v=XeojXq6ySs4
Using the same ID, construct a URL like this:
https://www.youtube.com/get_video_info?eurl=https://www.youtube.com&video_id=XeojXq6ySs4
The response will be a query string, like this (edited for readability):
innertube_api_version=v1&
innertube_context_client_version=2.20210504.09.00&
player_response=%7B%22responseContext%22%3A%7B%22serviceTrackingParams%22%3A...
ps=desktop-polymer&
root_ve_type=27240&
Extract the player_response
value. This will be a JSON object, like this:
{
"streamingData": {
"adaptiveFormats": [
{
"itag": 137,
"mimeType": "video/mp4; codecs=\"avc1.640020\"",
"bitrate": 570464,
"height": 1080,
"signatureCipher": "s=VZVZOq0QJ8wRgIhANWm3sPF-2hbzQQGrErjQFMNmxTfALco..."
}
]
}
}
Then extract the signatureCipher
value, this is a query string, like this:
sp=sig&
s=VZVZOq0QJ8wRgIhANWm3sPF-2hbzQQGrErjQFMNmxTfALcoZkZ4IVR1djIpAiEA8HFKix6d4B3T...&
url=https://r3---sn-q4flrnek.googlevideo.com/videoplayback%3Fexpire%3D16201927...
The url
is the URL to the audio or video. However before you can access the
URL, you must add an entry to the query string. The new key, is the value under
sp
above (sig
in this case). The new value, is the value under s
above
(VZVZOq0QJ8wRgIhANWm3sPF-2hbzQQGrErjQFMNmxTfALcoZkZ4IVR1djIpA...
in this case).
However before you can add the new entry, you must decode the s
value. To
decode the value, take the following steps. First, visit the original page:
https://www.youtube.com/watch?v=XeojXq6ySs4
In the source code, will be some text like this:
/s/player/3e7e4b43/player_ias.vflset/en_US/base.js
which you can turn into:
https://www.youtube.com/s/player/3e7e4b43/player_ias.vflset/en_US/base.js
In this new page, will be some code like this:
var uy={an:function(a){a.reverse()},
gN:function(a,b){a.splice(0,b)},
J4:function(a,b){var c=a[0];a[0]=a[b%a.length];a[b%a.length]=c}};
vy=function(a){a=a.split("");uy.gN(a,2);uy.J4(a,47);uy.gN(a,1);uy.an(a,49);
uy.gN(a,2);uy.J4(a,4);uy.an(a,71);uy.J4(a,15);uy.J4(a,40);return a.join("")};
Take the original s
value, and run it through this function:
vy('_l_lOq0QJ8wRAIgc-yNc9Z4lSO2CozG4B-W9uC5zeuTATDvqHlnQaHGNmkCICsZJGbEjKDmD...')
Result will look about the same, but scrambled:
AOq0QJ8wRAIgc-ylc9Z4lSO2CozG4B-W9uC5zeuTNTDvqH_nQaHGNmkCICsZJGbEjKDmDSnKg_atTR...
Finally you can construct the resulting URL:
https://r3---sn-q4fl6nz7.googlevideo.com/videoplayback?vprv=1&
id=o-AHThxQXyxJ3jfw5EBUJeT0IJLrdQeYpMdCsCImMfbuac&
sig=AOq0QJ8wRAIgc-ylc9Z4lSO2CozG4B-W9uC5zeuTNTDvqH_nQaHGNmkCICsZJGbEjKDmDSnKg_...
I have a library and program that does these steps:
https://pkg.go.dev/github.com/89z/mech/youtube
My answer: from 22 January 2019, using these methods can get caught if you try to bypass without linking your user information as well.
Why? since I'm a new user to this platform, I cannot comment for rule specified by @Daniel-B. According to new ToS (in German as I am in Germany; please translate) for YouTube under $6.1 G$:
You agree any automated system (including – but not limited to – any robot, spider or offline reader) to use that on the website accesses in such a way that more requests to the server within a specified time YouTube directed being able to reasonably produce as a human within the same time period using a publicly available, unmodified standard web browser;
Now they can find out the time duration for each request and can track if you are violating. How is it possible now, given this scenario and your external IP address will be known even if you use a VPN to protect yourself without linking details of user to any service.