Indexing PDF files on Ubuntu
I'm looking for a solution in Ubuntu that indexes PDF (and ps?) files for searching later.
The criteria would be:
- Compatibility: Often extracting text varies, depending on what software was used to create the PDF. Some PDFs can also be "locked", which I guess one should respect.
- Search functionality: wildcards, regex's, "fuzzy" matching.
- Speed of search
In my case I want to index a folder of academic journal articles, hence the requirement that it works consistently regardless of what software created the PDF. I'm already using a reference manager so would rather not replace that.
For example: A good front-end to Beagle, and a plugin that allows it to index PDFs would be perfect.
Tracker does the same thing as Beagle and Strigi, but contrary to Beagle, it's written in pure C (Beagle is a Mono application). Allegedly, it is a lot faster than Beagle, though I haven't done the math myself.
I can't find you a link to Tracker, but I'm sure it's in the default Ubuntu repositories.
Lucene does fulltext indexing of PDF, HTML, Microsoft Word, and OpenDocument. It's just a library, but there are several applications/CMS using it, or you could use it as a base for your own solution.
It is free software (Apache license).
Edit:
If you are looking for something with a frontend, you might consider Beagle or Strigi:
Beagle
Strigi