Extract text from a scanned document
Is there any way to select the text from a scanned document? (output is a jpg) What kind of tools does Ubuntu offer to do such a task? Is there any libraries I can use instead of pre-build software binaries to do the same thing? I tried converting it to a .pdf using Imagemagick and then trying to select the text, which obviously didn't worked.
The name for this type of procedure is OCR (Optical Character Recognition). That link also offers a couple of choices:
gocr - A command line OCR fuzzyocr - spamassassin plugin to check image attachments libhocr0 - Hebrew OCR ocrad - Optical Character Recognition program ocrfeeder - Document layout analysis and optical character recognition system ocropus - document analysis and OCR system tesseract-ocr cuneiform - multi-language OCR system
And it suggests that Tesseract (very old tutorial) is the better option of these. So give it a try.
A while ago evaluated the various OCR packages in Ubuntu, found that Tesseract was the least bad of them (but bad enough), and wrote a wrapper script for the OCRing (since Tesseract wants obscure input formats like TIFF). Here's my ~/bin/ocr
:
#!/bin/sh
# usage: ocr filename.jpg
if test -z "$1"; then
echo "usage: ocr filename.jpg [...]"
echo "needs imagemagick and tesseract-ocr"
echo "if tesseract fails, check if you've got tesseract-ocr-eng installed"
fi
tmpdir="$(mktemp -d)"
for fn in "$@"; do
convert "$fn" "$tmpdir/page.tif"
tesseract "$tmpdir/page.tif" "$tmpdir/page" 2>&1 | grep -v '^Tesseract Open Source OCR Engine$'
cat "$tmpdir/page.txt"
cp -i "$tmpdir/page.txt" "${fn%.jpg}.txt"
rm "$tmpdir/page.tif" "$tmpdir/page.txt"
done
rm -r "$tmpdir"
Preprocessing the images with GIMP (converting to B&W using the Threshold tool) seemed to helped a lot.
I hope things have improved since then. I've seen the name OCR Feeder in blog posts recently, I'd give it a try.
The Tesseract-ocr package is command line. If you want a program with a GUI, I use "gscan2pdf" and you can find it in the Ubuntu Software Center.
In gscan2pdf all you need to do is click the little scan icon near the top. I think it gives you two or three options, GOCR which isnt very good, and Tesseract which works admirably. Pick Tesseract and from this point you need to click the appropriate tab so you can find the resolution settings. Your best bet is 300 or even 600 and Tesseract will do well.
Poorly scanned, crooked, or old documents dont convert well. Good luck!
PS.. I keep reading that Tesseract can only read TIFF images. This isnt the case for me. I can import JPG or PNG too.
PPS... sorry for the edits! You might try OCRFeeder in the software center too. I have yet to try it though.
I'f found this, it's called Tesseract OCR, hopefully it may be of use to you.
http://linuxappfinder.com/package/tesseract-ocr