Reading dynamically generated web pages using python

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:

<div id="cntnt"></div>

But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.

I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you

http://code.google.com/p/spynner/

http://phantomjs.org/

http://zombie.labnotes.org/

http://github.com/ryanpetrello/python-zombie

http://jeanphix.me/Ghost.py/

http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer

Reading dynamically generated web pages using python

Related

Recent Posts