Command line tool to search phrases in large number of pdf files
Solution 1:
SEARCH_DIR="/some/dir/where/you/want/to/search/"; SEARCH_STRING="whatever-you-are-searching";
# extracting text from pdf pdftotext "file.pdf" "file.txt" # connecting with grep pdftotext "file.pdf" /dev/stdout |grep -H --label="file.pdf" -- "$SEARCH_STRING" # if you want grep to show only file list of matching pdf file, add --files-with-matches pdftotext "file.pdf" /dev/stdout |grep -H --label="file.pdf" --files-with-matches -- "$SEARCH_STRING" # find possible list of pdf to search from find "$SEARCH_DIR" -type f -name '*.pdf' > list-of-pdf.txt
# everything joined by awk as duct tape, sent to bash for processing # double quote is escaped as x22 inside awk. find "$SEARCH_DIR" -type f -name '*.pdf' |awk -v SEARCH_STRING="$SEARCH_STRING" '{ print "pdftotext \x22"$0"\x22 /dev/stdout | grep -H --label=\x22"$0"\x22 -- \x22"SEARCH_STRING"\x22" }' |bash
# With out bash. Further process to match your need find "$SEARCH_DIR" -type f -name '*.pdf' |awk -v SEARCH_STRING="$SEARCH_STRING" ' { EXEC="pdftotext \x22"$0"\x22 /dev/stdout | grep -H --label=\x22"$0"\x22 -- \x22"SEARCH_STRING"\x22"; while(EXEC|getline ret){ print "For file ["$0"] we have match ["ret"]"; # do whatever you like. }; close(EXEC); }'
Solution 2:
Under both Linux and Windows, you can use Acrobat Reader, which has a command to search multiple files.
Under Linux, there is Recoll, which will build an index of your pdf files (and more) the first time you run it. After the index is built, word searches should be very fast; phrase searches should be reasonable. Make sure the pdftotext
command is installed before you start Recoll; under Debian and Ubuntu, it's in the poppler-utils
package, I don't know about Suse.
Or you could directly convert the files to text and use grep on the text files with the commands below.
find -name '*.pdf' -exec pdftotext {} \; grep -r --include '*.txt' -l -F "exact phrase to search" grep -r --include '*.txt' -l -E "regular expression to search"
Solution 3:
Adobe Reader X does the job and it does allow searching under a whole directory and subdirectories, not only inside a file, but it is not a command line program.