Scan-to-PDF software for Linux?

I have a workflow whereby I scan paper documents into searchable PDFs using a Fujitsu ScanSnap S500 document scanner. I am not a great fan of the bundled software, but it's dead simple to use: place a stack of paper at the top, push the green button and a searchable PDF comes out.

Now, I'd like to do something similar on Linux (Ubuntu 10.10). The scanner is supported out of the box.

I've looked at gscan2pdf and XSane:

XSane looks powerful, but not really suitable as a workflow solution;
gscan2pdf is a bit closer to the "push the button, get the PDF" ideal, but still not 100% there.

Any other software you can recommend (free or otherwise)?

Here are some things that I found when researching this earlier this year. Sorry, I can't post more than one hyperlink due to my limited rating, so you'll have to Google for the links.

gscan2pdf

A really good GUI system that can use various OCR engines for the backend. This probably will meet your one-touch solution (and digitxp already mentioned it).

Tesseract OCR Engine

Can be used with gscan2pdf.

http://www.linuxjournal.com/article/9676

Ocropus

I didn't get very far with ocropus since it wasn't recognizing text without extensive training. It would probably be really good for books, but didn't work well for me with bills and such. YMMV.

Cuneiform

I had the best success with Cuneiform and was able to create searchable PDF's by scripting commands similar to the following workflow:

# extract images from scans
# (not shown)

# convert to black-and-white
optimize2bw -n -i nuance-test.png  -o bw.bmp                               

# do the OCR process and generate an hOCR file
cuneiform -l eng -f hocr -o nuance-test.html bw.bmp

# reassemble the original image with the hOCR file to generate a new PDF
hocr2pdf -s -i nuance-test.png -o nuance-test.hocr.pdf < nuance-test.html

You will also need to install the exactimage package.

Various open-source projects for OCR'ing PDF's use Cuniform and hocr2pdf as well:

WatchOCR
Archivista

Let me know what you find out!

Scan-to-PDF software for Linux?

gscan2pdf

Tesseract OCR Engine

Ocropus

Cuneiform

Related

Recent Posts