Batch download webpages as displayed by browser

Wget doesn't work because the page is dynamic, and no matter what options I put it won't download some of the text shown in Firefox.

I Googled quite a bit, but all the solutions I find are quite cumbersome, like writing a script that sends firefox specific key strokes. However some of the answers are old, so I wonder if something better exists now.

All I need from the pages is text, I don't need any images.

Solution 1:

I wonder if something better exists now.

Based on personal experience, I would venture to say it seems unlikely to me.

For pages with content rendered by JavaScript only when visible (which is what it sounds like you're describing), the best solution I have come across is Python, running Selenium (available via pip/PyPI) controlling e.g. Ungoogled Chromium (Windows builds available here).

And this still requires (at least some) scripting for Python/Selenium to control e.g. Ungoogled Chromium.

Also note that in the case of JavaScript rendered only when visible, you probably will have to execute some JavaScript via Selenium to scroll the web page. It's also worth noting that JavaScript elements may not be rendered under a modern browser (Chrome/Firefox) running in "headless" mode (i.e. without a GUI). So you may have to watch your web browser browse those pages, unfortunately.

I would also suggest potentially looking into Beautiful Soup with lxml as well for parsing HTML (available via pip/PyPI here and here). You can get web page text via Selenium, but in some cases, saving the page for parsing later may be simpler.

Batch download webpages as displayed by browser

Solution 1:

Related

Recent Posts