Remove jpeg artifacts for scanned texts

I have a scanned PDF of a textbook, but the PDF is aggressively compressed so lots of JPEG artifacts are present and affects its readability. Example: enter image description here I tried a variety of method to fix it but the outcome is not great. waifu2x: Looks better but still have weird artifacts. Also very slow. enter image description here convert -threshold 70% in.jpg out.png enter image description here

Is there a fast and effective way to get rid of these artifacts?


Solution 1:

PDF is not a image format, it's just a container that holds images. You have to extract those images, save them in a lossless format (or at least lower the compression otherwise you will add new artefacts). Afterwards you can try to get rid of the artefacts manually or use existing auto filters. However they need to configure them manually specifically to the image. The last step would be to reintegrate them into a PDF.

However there is not "fast, universal" way to remove those artefacts. If there would be, those artefacts - simply speaking - wouldn't have been created in order to reduce the file size.

The only way to get rid of the artefacts would be to recognize the symbols (letters, numbers etc.) and get rid of everything else, which might be done by an OCR software. There is advance OCR software which can work with low resolution documents, but often it is not free. You don't have to buy the software but check for an online service (there are dozens out there). Consider that this will essentially change your graphic files to text files.