Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

So I didn't find too much of a problem scraping wellness-spain.com with beautifulsoup.. The website doesn't have that much javascript. This can cause problems with HTML parsers like beautifulsoup and so you should be mindful when you scrape websites, to turn off javascript to see what output you get from your browser before scraping.

You didn't specify what data you were requiring of that website so I took an educated guess.

Coding Example

import requests 
from bs4 import BeautifulSoup

url = 'http://www.wellness-spain.com/-/estres-produce-acidez-en-el-organismo-principal-causa-de-enfermedades#:~:text=Con%20respecto%20al%20factor%20emocional,produce%20acidez%20en%20el%20organismo'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
title = soup.select_one('h1.header-title > span').get_text().strip()
sub_title = soup.select_one('div.journal-content-article > h2').get_text()
author = soup.select_one('div.author > p').get_text().split(':')[1].strip()

Explanation of Code

We use the get method for requests to grab the HTTP response. Beautiful soup, requires that response with .text. You will often seen html.content but that is binary response so don't use that. HTML parser is just the parser beautifulsoup uses to parse the html correctly.

We then use CSS selectors to choose the data you want. In the variabl title we use select_one which will select only one of a list of elements, as sometimes your CSS selector will provide you a list of HTML tags. If you don't know about CSS selectors here are some resources.

Video
Article

Essentially in the title variable we specify the html tag, the . signifies a class name, so h1.header-title will grab the html tag h1 with class header-title. The > directs you towards the direct child of h1 and in this case we want the span element that is the child element of the H1.

Also in the title variable we have the get_text() method grabs the text from the html tag. We then using the string strip method strip the string of whitespace.

Similar for the sub_title variable we are grabbing the div element with class name journal-content-article, we're getting the direct child html tag h2 and grabbing it's text.

The author variable, we're selecting the div of class name author and getting the direct child p tag. We're grabbing the text but the underlying text had autor: NAME so using the split string method we split that string into a list of two elements, autor and NAME, I then selected the 2nd element of that list and then using the string method strip, stripped any white space from that.

If you're having problems scraping specific websites, best to make a new question and show us the code you've tried, what your specific data needs are, try be as explicit as possible with this. The URL helps us direct you to getting your scraper working.

Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

Coding Example

Explanation of Code

Related

Recent Posts