Scraping with BeautifuldSoup to csv

Solution 1:

You can first work out the number of overarching "sections", or listings as I call them, by locating the h3 headers, which I do with section:has([data-widget_type="heading.default"]) then loop those and extract the manufacturer. Use find_next to move to the actual following sections containing the model and table. All data appears to be present on that single page if you scroll down to bottom.

With respect to headers:

td:not([colspan]) strong

The :not([colspan]) is used to exclude the last Back to Top row of each table for each listing. This is a "merged cell" with a colspan attribute and doesn't contain data you want. You could also have used an nth-child range selector. The first (or left most as you view page) and third table columns are used for the headers, and I access these only for the first listing. I checked that these same headers were present in all tables initially. The space strong is to then select for descendant strong elements, which are present for the 1st and 3rd td children in each row of the tables.

With respect to row values in csv after headers:

td:not([colspan]):nth-child(even)

The first part is as per the headers explanation. However, instead of then adding in a descendant combinator with strong type selector, I simply used nth-child(even); This selected for the 2nd and 4th columns as desired as these are the even numbered children.

import requests, csv
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.flyingmag.com/2019-buyers-single-engine-piston')
soup = bs(r.content, 'lxml')
listings = soup.select('section:has([data-widget_type="heading.default"])')

with open('flyingmag.csv', "w", encoding="utf-8-sig", newline='') as f:
    
    writer = csv.writer(f, delimiter = ",", quoting=csv.QUOTE_MINIMAL)    
    
    for num, listing in enumerate(listings):
        
        manufacturer = listing.select_one('[data-widget_type="heading.default"] h2').text
        model = listing.find_next('h3').text
        table = listing.find_next('table')
        
        if num == 0:
            
            row = ['Manufacturer', 'Model']
            row.extend([i.text for i in table.select('td:not([colspan]) strong')])
            writer.writerow(row)
        
        values = [i.text for i in table.select('td:not([colspan]):nth-child(even)')]
        row = [manufacturer, model]
        row.extend(values)
        writer.writerow(row)

Scraping with BeautifuldSoup to csv

Solution 1:

Related

Recent Posts