Beautiful Soup Can't Find Tags
I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
Thank you for your help, BF
BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!