How to use Python requests to fake a browser visit a.k.a and generate User Agent?
I want to get the content from this website.
If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget
command) to get it, it returns a totally different HTML page.
I thought the developer of the website had made some blocks for this.
Question
How do I fake a browser visit by using python requests or command wget?
Solution 1:
Provide a User-Agent
header:
import requests
url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
FYI, here is a list of User-Agent strings for different browsers:
- List of all Browsers
As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:
fake-useragent
Up to date simple useragent faker with real world database
Demo:
>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
Solution 2:
I used fake UserAgent.
How to use:
from fake_useragent import UserAgent
import requests
ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)
Output:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
Solution 3:
Try doing this, using firefox as fake user agent (moreover, it's a good startup script for web scraping with the use of cookies):
#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4
import cookielib, urllib2, sys
def doIt(uri):
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open(uri)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
print page.read()
for i in sys.argv[1:]:
doIt(i)
USAGE:
python script.py "http://www.ichangtou.com/#company:data_000008.html"
Solution 4:
The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.
So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.
Solution 5:
Answer
You need to create a header with a proper formatted User agent String, it server to communicate client-server.
You can check your own user agent Here.
Example
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0
Third party Package user_agent 0.1.9
I found this module very simple to use, in one line of code it randomly generates a User agent string.
from user_agent import generate_user_agent, generate_navigator
from pprint import pprint
print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'
print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'
pprint(generate_navigator())
# {'app_code_name': 'Mozilla',
# 'app_name': 'Netscape',
# 'appversion': '5.0',
# 'name': 'firefox',
# 'os': 'linux',
# 'oscpu': 'Linux i686 on x86_64',
# 'platform': 'Linux i686 on x86_64',
# 'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
# 'version': '41.0'}
pprint(generate_navigator_js())
# {'appCodeName': 'Mozilla',
# 'appName': 'Netscape',
# 'appVersion': '38.0',
# 'platform': 'MacIntel',
# 'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}