Limit characters tesseract is looking for
Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.
Solution 1:
Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs
or /usr/share/tesseract-ocr/tessdata/configs
And add this line to the config file:
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz
...or maybe [a-z] works. I don't know. Then call tesseract similar to this:
tesseract input.tif output nobatch letters
That will limit tesseract to recognize only the wanted characters.
Solution 2:
To use whitelist in a config file or using the -c tessedit_char_whitelist=...
command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting.
Example of proper command-line for 4.0 version:
tesseract input_file output_file --oem 0 -c tessedit_char_whitelist=abc123
UPDATE: In newer versions (4.0) there's corrupted eng.traineddata
file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata
file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar.
Update from tesseract 4.1.1
-
However, in tesseract 4.1.1 the above bug is fixed, that is, in tesseract 4.1.1 the following works like a charm
tesseract my_image.jpg stdout -l mylang configfile myconfig
Where "myconfig" is a plaintext file located in TESSDATA/configs
load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789