Python selenium get "Developer Tools" →Network→Media logs
Finally I have done it, all by myself, without anybody's help.
The trick is simple, once you know what to do, it isn't so hard to achieve.
The responses are in json format, so we need the json
module.
The structure of the json varies, but the first level keys are fixed, there are always three keys: level
, message
, timestamp
.
We need the message
key, its value is a json object packed in a string, so we need json.loads
to unpack it.
The structure of these packed json objects varies a lot, but there is always a message
key and a method
key inside the message
key.
Here we are trying to scrape received media file addresses, and long story short, the message
→message
→method
key should equal to 'Network.responseReceived'
.
If message
→message
→method
key equals to 'Network.responseReceived'
, then there will always be a message
→message
→params
→response
→mimeType
key.
That key stores the file type of the resource, I will spare you the details, I know .mp4
stands for Motion Picture Expert Group-4
and is a video format, but here the media type should be 'audio/mp4'
.
If all the about criteria are satisfied then the address of the media file is the value of message
→message
→params
→response
→url
key.
This is the final code:
import json
import os
import random
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path = (os.environ['LOCALAPPDATA'] + '\\Google\\Chrome\\User Data')
options = webdriver.ChromeOptions()
options.add_argument('--disable-gpu')
options.add_argument('--headless')
options.add_argument('--log-level=3')
options.add_argument('--mute-audio')
options.add_argument(f'--user-data-dir={path}')
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': 'ALL'}
Chrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)
wait = WebDriverWait(Chrome, 5)
def getlink(addr):
Chrome.get(addr)
iframe = Chrome.find_element_by_xpath('//iframe[@id="g_iframe"]')
Chrome.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = Chrome.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(5)
logs = Chrome.get_log('performance')
addresses = []
for i in logs:
log = json.loads(i['message'])
if log['message']['method'] == 'Network.responseReceived':
if log['message']['params']['response']['mimeType'] == 'audio/mp4':
addresses.append(log['message']['params']['response']['url'])
check = set([i.split('/')[-1] for i in addresses])
if len(check) == 1:
return random.choice(addresses)
if __name__ == '__main__':
print(getlink(sys.argv[1]))