How do I search a PDF file from command line?

Solution 1:

Just to add to the above answer, in particular you can use a command line tool from xpdf-utils called pdftotext and then search the text document created by this tool with grep.

This might look something like this:

pdftotext document.pdf - | grep -C5 -n -i "search term"

There is more information in the manual. The only drawback to pdftotext is that you can't us globbing to transform multiple files at the same time. This problem can be overcome with a small bash script:

for f in pdf_directory; do echo $f; pdftotext $f - | grep -i "search_term"; done

If you are having problems creating a text document from a pdf due to, for example, an incompatible pdf file, then that is another problem.

I think in general, pdf editors don't include command lines because they are graphical. If you want to use bash, (or maybe zsh!) then you might have to use a terminal shell.

Good luck!

Solution 2:

poppler-utils

Note: xpdf-utils is a transitional package for poppler-utils.

You can use poppler-utils. poppler-utils is a suite of tools for Portable Document Format (PDF) files.

To install it you can use the Ubuntu Software Center, or by clicking below:

Install poppler-utils

pdfgrep

pdfgrep can search a string or a pattern in PDF files recursively in directory trees, counting matches or printing some context for each match. For example, to recursively search keyword in /some directory, case insensitive:

pdfgrep -Ri keyword /some/directory

Pdfgrep is a tool to search text in PDF files. It works similar to `grep'.

Features:

  • search for regular expressions.
  • support for some important grep options, including: + filename output. + page number output. + optional case insensitivity. + count
    occurrences.
  • and the most important feature: color output!

Install pdfgrep

1Source:Ubuntu Apps Directory

Solution 3:

To search for a regular expression in multiple pdf files using pdfgrep:

find /path -iname '*.pdf' -exec pdfgrep -H 'pattern' {} \;

where path is location for your pdf files.

Solution 4:

The reason pdftotext was without success might be that the PDF are scanned images and you need to OCR them, I wrote a quick way to search all pdfs that cannot be greped and OCR them.

I noticed if a pdf file doesn't have any font it is usually not searchable. Knowing this we can use pdffonts.

First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:

gedit check_pdf_searchable.sh

paste this

#!/bin/bash 
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
ocrmypdf "$1" "$1"_ocr.pdf
fi

then make it executable

chmod +x check_pdf_searchable.sh

then list all non-searchable pdfs in the directory:

ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}

You also need to instal:

sudo apt install ocrmypdf