LogicalDOC Community could be used for this purpose. Allows you to catalog and tag many file types and has a built-in free OCR.

One of the features that I really like about this package is the full-text search engine that can run natively language specific searches.

There is a good documentation for installation on Ubuntu, which doesn't involve special difficulties


There are several open source document management systems and scanning solutions which would work to help your archiving needs. For document management there is:

  • OpenKm (seems to include OCR, though it's not immediately clear whether that's part of the community edition)
  • Mayan EDMS (includes OCR, pure OpenSource project written in Python, so just pip install mayan-edsm)
  • KnowledgeTree
  • LetoDMS (seems rather dead)
  • OpenDocMan
  • Nuxeo
  • Feng Office
  • Project Looking Glass

As for scanning software, there are a few open source options - but nothing that will perform too well. Depending on what you are looking to archive (and how you plan on accessing it in the future) you might be able to just tag your documents accordingly inside of your management software. Also...you are unlikely to find solid OCR in any freeware scanning application.

If you have the option, I strongly suggest outsourcing document conversion projects. Not only will you get it done faster - you will have the option to OCR your files and know that the finished quality of your project will be professional and easy to read.


There is a document management system that does pretty much exactly what you require, called Archivista. I've evaluated it for our museum's archive.

It can be downloaded as an installable ISO or purchased pre-installed on small business computers. I do not know of a possibility to install it under Ubuntu, however, which may be a dealbreaker for you. Here, we just run it as a virtual machine and interact with it via X forwarding and its HTML interface.

Archivista claims the software is designed for long (approx. 20 years) data retention periods. It can make use of scanners, and stores an image of the scanned document, a PDF and OCR version. Documents can be assigned metatags, and their OCR'ed text is searchable.