Extracting text from a .PDF scanned book [closed]

I have earlier posted an answer detailing how to use Cuneiform (open source software) to do OCR on PDF files and how to create a PDF file with the recognized text in a hidden text layer "behind" the original image. As far as I know, Cuneiform actually does support Romanian as well.

While the particular solution was for Linux, Cuneiform is available also for Windows.


Adobe Acrobat Professional can do that. I'm not sure if there is a Romanian version...


ABBYY Fine Reader is very strong OCR software. It deals with very complex layouts and supports a lot of formats (including pdf). Romanian is supported with dictionary, i.e. software uses dictionary for hypothesis prioritizing during recognition. (here).

In any case, OCR-ing scientific literature, with has poor scan quality is difficult task. Be ready to spend a lot of time to help software with results check and layot fixes. On your scan I see a lot of very poor-quality text :(. I don't think any OCR software could work normally with it.