How can I extract text from images?

Solution 1:

The act of extracting text from images is called OCR and Ubuntu has a wiki page dedicated to OCR. From that page:

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

  1. gocr - A command line OCR
  2. fuzzyocr - spamassassin plugin to check image attachments
  3. libhocr0 - Hebrew OCR
  4. ocrad - Optical Character Recognition program
  5. ocrfeeder - Document layout analysis and optical character recognition system
  6. ocropus - document analysis and OCR system
  7. tesseract-ocr

The Ubuntu multiverse respositories also contain:

  1. cuneiform - multi-language OCR system

Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.

edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.

Solution 2:

tesseract-ocr would be the great one compared to all others. For Installation, run the below command

sudo apt-get install tesseract-ocr

Usage is tesseract filename.jpg output.txt, then it will generate output.txt file.

You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG package, where LANG is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:

tesseract mySpanishText.jpg output -l spa