How can I compress my .PDF (two pages) to less than 100 kB?

I have a PDF file that I want to compress to less than 100 kB. It has a scan of two pages that I scanned from my mobile scanner. Post scan it is 338 kB (with minimum quality to view the pages). Basically I want to upload this file to a government portal which only allows upload of a maximum 100 kB in just one file. This is my primary purpose. The below are the methods that I have tried now and it did not do the needed task:

  • using a simple wrapper around Ghostscript to shrink PDF files "./shrinkpdf.sh in.pdf out.pdf xx". I set xx to 90 and it gives me 282 kB. Below the 90 value, the text in PDF document is not visible clearly and I am sure my application will be rejected then.

  • I also tried "gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=setting -sOutputFile=output.pdf input.pdf" with setting value as /screen which I think is minimum setting and it gives me 232 kB and the text is fairly visible.

  • I also tried converting it to JPEG using "pdftoppm" method like - "pdftoppm compressed.pdf jpeg -r 75 -jpeg" which gives me 141 kB for 1 page and 128 kB for page 2. I am not sure how I will get these two JPEG files to upload as one file, but I guess my primary aim should be now to get it less than 100 kB first?

I use Ubuntu 20.04.2.


Solution 1:

Below 90 value the text in pdf is not visible clearly and I am sure my application will be rejected then.

It's a scanned document. That means it's not text, it's an image of the page. PDF supports multiple image compression schemes, including lossless, but quality and degradation suggest that you're using JPEG.

This is probably the most efficient way to store it.

You want to store two pages in <100kB. That's 50kB per page. That's a tall order - but probably possible.

I would export the pages as JPG's and play with quality settings and resolution until you get your result:

convert file.pdf file.jpg

This will give you file-0.jpg and file-1.jpg for page 1 and two respectively.

Now we can try to reduce the resolution of the pages:

mogrify -resize 600x700 -quality 45 file-0.jpg

By this measure, I managed to get an A4 page down to 28kB. It's legible, but not very clear:

Unclear image

To convert your files back to a PDF after playing with them to reduce size, run

convert file-?.jpg file.pdf

In addition to mogrify to modify files, you can use tools such as gimp.

Solution 2:

Since a filesize-optimized scanned document is probably going to be black and white, .pbm is a monochrome bitmap format that seems perfect for this, and you can covert that back into a png to embed in a pdf.

Using a sample document1 (imgur permalink: https://i.imgur.com/Ak2kVGD.jpg)

Original Document

Its a 1751x2451 jpg scan of a document, size 1.71MB, black and white with some blue accents

convert document_scanner_sample_scan_00_zoom.jpg -resize 1000 intermediate.pbm
convert intermediate.pbm page1.png # 1000x1436, 46kb

page1.png looks quite presentable for 46kb (https://i.imgur.com/gYwtipQ.png)

Optimized Document

As pointed out in comments, the png needs to be transcoded to be embedded in the pdf. convert uses /FlateDecode pdf compression format by default (convert page1.png page1.pdf) and the resulting pdf is 67kb. Use the /CCITTFaxDecode format instead, which is for monochrome images, to get that down to 57kb

convert page1.png -alpha off -monochrome -compress fax page1.pdf

See the image magick documentation for a mapping of command line options to pdf compression format: https://legacy.imagemagick.org/Usage/formats/#pdf_compression

For documentation of the pdf compression filters, see section 7.4 of the pdf reference (version 1.7). An introduction is provided in section 7.4.1 Table 6.

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf


1. Original sample document found here:

https://s1-www.scan2docx.com/img/samples/document_scanner_sample_scan_00_zoom.jpg

Solution 3:

Thank you all for helping me! Thanks @vidarlo! Your suggestion and ideas really helped me get through this and also a little bit of luck did the trick. I will mark your input as solution since it really helps in this task.

For me the luck part was the government site allowing us to upload two files each of 100kb separately. This was no where mentioned! It only shows up a second dialog box in site post uploading one page. Waht!

So now the idea was to compress each page to less than 100kb. I decided to scan each page separately:

Page 1 144kb.pdf, and Page 2 165kb.pdf

I found that (for my document atleast) convert file.pdf file.jpg performs less than pdftoppm file.pdf jpeg -r 75 -jpeg. I am not sure why but the output of convert gave me 258kb.jpg for page one (144kb.pdf) and from pdftoppm it gave me 130.6kb .jpg for the same page, but a better looking jpeg file! I decided to go ahead with that.

pdftoppm 144kb.pdf jpeg -r 75 -jpeg  --> 130.6kb.jpg
pdftoppm 165kb.pdf jpeg -r 75 -jpeg  --> 134.4kb.jpg

Then like @vidarlo suggested, I tried mogrify but without resizing option, so mogrify -quality 50 page1.jpg and page2.jpg gave me 91kb and 96kb for two files! 45 does blur out things little more, and >50 increases the file size to > 100kb.

Just in case, while trying convert file.pdf file.jpg i got below error -

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408.
convert-im6.q16: no images defined `file.jpg' @ error/convert.c/ConvertImageCommand/3258.

For this i modified the policy.xml file located in /etc/ImageMagick-6. I had the line <policy domain="coder" rights="read | write" pattern="PDF" /> added before </policymap>, which was previously set to none (rights). This solution bypassed this error.

Also I had an issue modifying this policy.xml file and it was opening in read only mode (though i was logged in as admin). For that I used gedit admin:///etc/ImageMagick-6/policy.xml from terminal to make the file writable.

For both of this, I have to thank these,

  • https://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion
  • How do I get permissions to edit system configuration files?

Again, thank you all very much!

Solution 4:

Most likely the best way would be to scan it in black and white (two colors, not grayscale). A form for a government agency is unlikely to need full color or grayscale.

A decent scanner will choose a compression option other than jpeg for black and white images which will lead to much smaller files.

If that's not enough, then manually compressing the bw images with jbig2 can lead to even smaller files, although the savings only really start adding up for longer documents with many pages.

Solution 5:

After a lot of experimentation with this I find the easiest method is to load the PDF into LibreOffice Writer (this may take some time and consume memory with large PDFs - so close unnecessary apps). Once loaded 'Export to PDF...' setting jpeg compression to 50% & image resolution to 150dpi - you can play with the compression & dpi settings to suit. Mike