How to convert a webpage to PDF with preserving its look (exactly as on web browser) and text/links?

I'm looking for a way to convert a webpage to PDF, but preserving the webpage's look. Also preserving webpage's text (being selectable), searchable [Generating image screenshot for the webpage would make text neither selectable nor searchable].

I'm looking for printing the webpage to PDF as is (as on web browser) without any manipulation on style or alignment, or loss of any webpage's static components.

This would help preserving offline copies of webpages that are easily readable, annotateable and searchable.


You don't need to read any of below (Question is just the above section) in order to get my question. The following section is just listing of what I've got through research or others' answers in a nested way in order to reach an answer for the question.

Research Outcomes (Suggestions that didn't solve my problem)

Outcomes till now on trying to find a solution (All still not working as a solution for this question)

I've tried these PDF web printing engines but all manipulate pages' look, more even damaging and making some hardly readable: (Example page screenshots are included in square brackets)

  • Chrome [Original, Print Styles (Disabled | not Disabled)]
  • Firefox [Original, Print Styles (Disabled p1,p2 | not Disabled p1,p2)]
  • Readability
    • It simplifies the webpage (which is a good thing for focused reading–However, this isn't what I'm looking for). I'm looking for keeping all the webpage's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
  • Foxit Reader
  • NovaPDF
  • CutyCapt [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • I'll add links after I solve program's running issues on Windows"
  • wkhtmltopdf [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • It doesn't support CSS3.

All webpage screenshot image capturing plugins (e.g. Abduction, Awesome Screenshot, Fireshot, Firefox Screenshot Developer Tool, Full Page Screen Capture, Page2Images, web-capture, ...) don't answer my question, because they don't preserve text and links.

Scrible is great at preserving webpages as is for further annotation and research, but unfortunately still online and without conversion to PDF format.

There are two other questions on the community similar somehow to mine, however, this one is different a little bit but with those important distinctions:

  • How to get WYSIWYP (print what you see) in a web browser?
    • This question asks about a way to capture a webpage (as seen on screen) anyway even if it's an image and text won't be preserved. Whereas, I'm looking for capturing text and links also (importantly preserve text and links).

More Similar questions where preserving text and links isn't a requirement (pages are captured as image screenshots mostly):

  • How to Take Screenshots /Save a web page as PDF
  • Print From Browser Using Screen CSS?
    • It asks about disabling print styles, which seems it doesn't help from the above screenshots.

Notes

OS: Windows 10


We faced the same problem in a University project and were able to solve it using

wkhtmltopdf

We quite enjoyed the capabilities of this tool on the command line. We also called it using python code to render the current state of webpages. It has the option to deliver the webpage as pdf, usually not perfect to preserve the website view due to the Page formatting (A4 for example), or as png (preserves the view of the page but not links)

There is also the readability(for Python:pypi.python.org/pypi/readability-lxml) project we used that does the ads removal and content detection quite well (e.g. for newspaper articles and the like). If you just want an addon or extension for your browser the following readability implementation might satisfy your need:

Offline now: https://www.readability.com/addons/

WaybackMachine Link: https://web.archive.org/web/20160308192045/https://readability.com/addons


I really struggled with this and tried most of the tools that are mentioned so far. The best results I got was using Chrome's headless mode. The command on MacOS would look like this:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --print-to-pdf=test.pdf http://127.0.0.1:8080

The best list of command line options I found was here.

However there were problems with that. Specifically my pages are very javascript heavy and I couldn't make the print function wait for them to finish execution. So my output didn't have the images in it.

The solution I found was a nodeJS package: chrome-headless-render-pdf. It's scant documentation is here. It works and it is easily scriptable.