web scraping dynamic content with python

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');

Returns

[<div class=​"inline-text-org" title=​"University of Manchester">​University of Manchester​</div>, 
 <div class=​"inline-text-org" title=​"University of California Irvine">​University of California ...​</div>​
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

This is really cool:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")

A very similar question was asked earlier here. Quoted is selenium, originally a testing environment for web-apps.

I usually use Chrome's Developer Mode, which IMHO already gives even more details than Firefox.


For scraping dynamic content, you need not a simple scraper but a full-fledged headless browser.

dhamaniasad/HeadlessBrowsers: A list of (almost) all headless web browsers in existence is the fullest list of these that I've seen; it lists which languages each has bindings for.

(Note that more than a few of the listed projects are abandoned!)