How to make tesseract to recognize only numbers, when they are mixed with letters?
I want to use tesseract
to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.
Can I set a threshold value so that tesseract
omits the symbols with low resemblance?
NOTE: I set tesseract
to recognize only digits so there is no confusion between O and 0.
Solution 1:
Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:
tesseract image.tif outputbase nobatch digits
As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.
Solution 2:
For tesseract 3, the command is simpler tesseract imagename outputbase digits
according to the FAQ. But it doesn't work for me very well.
I turn to try different psm
options and find -psm 6
works best for my case.
man tesseract
for details.
Solution 3:
For tesseract 3, i try to create config file according FAQ.
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits
:
tessedit_char_whitelist 0123456789
then, it works by using the command: tesseract imagename outputbase digits
Solution 4:
If one want to match 0-9
tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789
Or if one almost wants to match 0-9, but with one or more different characters
tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE