How to scrape a website which requires login using python and beautifulsoup?

Solution 1:

You can use mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib ## http.cookiejar in python3

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

Or urllib - Login to website using urllib2

Solution 2:

There is a simpler way, from my pov, that gets you there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated.

Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

What you need to do is use the same cookies and headers when you make your http requests, and you'll be in.

To replicate that, follow these steps:

  1. In your browser, open the developer tools
  2. Go to the site, and login
  3. After the login, go to the network tab, and then refresh the page
    At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it
  4. Right click the site request (the top one), hover over copy, and then copy as cURL
    Like this:

enter image description here

  1. Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
  2. Take the python code and use the generated cookies and headers to proceed with the scraping

Solution 3:

If you go for selenium, then you can do something like below:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()

However, if you're adamant that you're only going to use BeautifulSoup, you can do that with a library like requests or urllib. Basically all you have to do is POST the data as a payload with the URL.

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)

Solution 4:

You can use selenium to log in and retrieve the page source, which you can then pass to Beautiful Soup to extract the data you want.