How do I convert a scanned PDF into a PDF with text
I have scanned about 80 pages into gray scale pdf (image format). The end size of the file is about 70MB, which is very huge.
Now I am looking for a method to convert the grayscale image-based PDF file into a simple black/white text-based PDF file.
I have done many attempts with gs
but with no success (only a few percent recovery).
If any expert has some idea, kindly let me know.
gImageReader is a simple GTK+ front-end to tesseract-ocr
.
sudo apt-get install gimagereader tesseract-ocr
sorry for the german text
You can try pdfocr:
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr
To execute the syntax is
pdfocr -i input.pdf -o output.pdf
where input.pdf
is the name of the input file and output.pdf
the output file.
By default it uses Tesseract. To install it:
sudo apt-get install tesseract-ocr
pdfocr creates an embedded text layer.
Have a look at OCRmyPDF that works well.
pdfsandwich
It loads tesseract and others on install. It's an easy one step solution and can be scripted. It can use hocr2pdf
to create a plain text pdf, but its not ready for prime time...yet. The default uses tesseract and creates a "sandwiched" pdf: image + text underneath.
The embedded image can be removed with commands like:
gs -o ocr_noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE ocr_image.pdf
but the text is hidden, so it looks like a blank page.
Loading the PDF into LibreOffice Draw
exposes the text and the image can be deleted manually.
You could try shrinkpdf to reduce the filesize and then ocr.sh to add the text layer.