Scan-to-PDF software for Linux?
I have a workflow whereby I scan paper documents into searchable PDFs using a Fujitsu ScanSnap S500 document scanner. I am not a great fan of the bundled software, but it's dead simple to use: place a stack of paper at the top, push the green button and a searchable PDF comes out.
Now, I'd like to do something similar on Linux (Ubuntu 10.10). The scanner is supported out of the box.
I've looked at gscan2pdf
and XSane
:
-
XSane
looks powerful, but not really suitable as a workflow solution; -
gscan2pdf
is a bit closer to the "push the button, get the PDF" ideal, but still not 100% there.
Any other software you can recommend (free or otherwise)?
Here are some things that I found when researching this earlier this year. Sorry, I can't post more than one hyperlink due to my limited rating, so you'll have to Google for the links.
gscan2pdf
A really good GUI system that can use various OCR engines for the backend. This probably will meet your one-touch solution (and digitxp already mentioned it).
Tesseract OCR Engine
Can be used with gscan2pdf.
- http://www.linuxjournal.com/article/9676
Ocropus
I didn't get very far with ocropus since it wasn't recognizing text without extensive training. It would probably be really good for books, but didn't work well for me with bills and such. YMMV.
Cuneiform
I had the best success with Cuneiform and was able to create searchable PDF's by scripting commands similar to the following workflow:
# extract images from scans
# (not shown)
# convert to black-and-white
optimize2bw -n -i nuance-test.png -o bw.bmp
# do the OCR process and generate an hOCR file
cuneiform -l eng -f hocr -o nuance-test.html bw.bmp
# reassemble the original image with the hOCR file to generate a new PDF
hocr2pdf -s -i nuance-test.png -o nuance-test.hocr.pdf < nuance-test.html
You will also need to install the exactimage package.
Various open-source projects for OCR'ing PDF's use Cuniform and hocr2pdf as well:
- WatchOCR
- Archivista
Let me know what you find out!