scan A4 doc > pdf > ocr > translate to english?

I've tried using a combination of

  • my home scanner to create a '300 dpi', 'document', 'pdf' (options on Canon all-in-one)
  • ZoHoViewer to create either an RTF or TXT file
  • google docs to translate

I'm not sure how good or bad a product ZoHoViewer is, but the following:

Als Arbeitsmarkbehörde haben wir den gesetzlichen Auftrag, die Vermittelbarkeit von

turns into:

AlsArbeitsmarktbeh6rde habenwirdengesetzlichenAuftrag,dieVermittelbarkeit vonSt...

consequently, goog docs makes a pig's breakfast of trying to translate it.

Does anyone have any better suggestions (preferably free online services)


There have been several other questions on SuperUser on OCR, which might be worth checking out for possible solutions.

Most notably this answer by Molly looks promising:

I really like TopOCR, certainly a great addition to your scan tools:

  • Incredible OCR accuracy, upto 99.8% with a 3 MP camera
  • No page limits, and no extra downloads or components needed
  • Handles images with mixed text and graphics (Manual or Auto Zoning)
  • Tolerates skew and uneven lighting
  • Multiple text output formats, including searchable PDF and HTML
  • Able to read 11 different languages
  • Powerful, easy to use Image Processing with Image Dewarping
  • Supports Smartphones: See some Smartphone samples
  • Includes built-in, full featured Text and Image WYSIWYG Editors
  • Post-processing spell checker for all 11 languages
  • Built-in Text-To-Speech software. How about OCR to MP3?
  • Includes a built-in multi-lingual text translater
  • Supports a Command Line Interface and a GUI
  • Make a high performance document Search and Indexing system
  • Browser Helper Mode supports creating free audio eBooks
  • With TopOCR's Web Engine it's easy to add new features

alt text

it's very accurate and works excellent with low quality images such as photographs of pages/documents

TopOCR is freeware (can be made portable with Universal Extractor)

Further reading:

Which OCR software has the most options?

Practical OCR solution for converting a large book to a digital format?

How to extract text with OCR from a PDF on Linux?


Given that the OCR has converted:

Als Arbeitsmarkbehörde ...

to:

AlsArbeitsmarktbeh6rde ...

A couple of things spring to mind.

  1. Try scanning at a higher dpi. It looks like it can't recognise the space between the words, a higher dpi might improve that.

  2. Can you set the language of your OCR program? I see that it's converted the "ö" to a "6". While this might be a problem caused by the resolution it might also be that as "ö" isn't an everyday part of English, the program is choosing the "next best" fit - in this case "6".