Optimal font for Tesseract? (specifically the .NET wrapper)

Solution 1:

I've done an experiment to answer this question.

  • Generate a document with random 6000 characters from the base 64 character sets (basically all letters upper and lower case + digits).
  • For each font on my system (a Linux box), generate an image with the same content
  • Give it to Tesseract
  • Measure the error rate / accuracy

Here are the results for Tesseract v4.1.1, I give the top performing fonts:

  • mitra
  • TeX_Gyre_Bonum
  • DejaVu_Serif
  • Roboto
  • Cantarell

See also this wrap-up: https://www.monperrus.net/martin/perfect-ocr-digital-data