Count the number of words in a PDF file
Solution 1:
Quick Answer:
pdftotext myfile.pdf - | wc -w
Long Answer:
If on Unix, you can use pdftotext
:
- http://linux.about.com/od/commands/l/blcmdl1_pdftote.htm
and then do the word count in the generated file. If on Unix, you can use:
wc -w converted-pdf.txt
to get the word count.
Also, see the comment by frabjous - basically, you can do it in one step by piping to stdout
instead to a temporary file:
pdftotext myfile.pdf - | wc -w
Solution 2:
This is a hard task not not easy to solve. If you really want an exact result, copy paragraph by paragraph for your PDF viewer into a text file and check it with the wc -w
tool. The reason why not to use pdftotext
in that case is: mathematical formulas may get also into the output and regarded as "words". (Alternatively you could edit the output you get from pdftotext
). Another reason why this may fail are the headings: "4.3.2 Foo Bar" is counted as three words.
A way around is only to count words starting with a char out of [A-Za-z]. So what I usally do is a two step approach:
-
get the list of uniq words and check if there are too much false positives inside:
pdftotext foo.pdf - | tr " " "\n" | sort | uniq | grep "^[A-Za-z]" > words
I don't use a dictionary here, as some spelling errors would not count as words.
-
Get this word list and grep it within the output of pdftotext:
pdftotext foo.pdf - | tr " " "\n" | grep -Ff words | wc -l
I know this could be done within a one liner, but then I could not easily see the filter result from the first step. The -F
may help you as stated by the comment of moi below (thanks).
Solution 3:
I just tried out a free program, Translator's Abacus. You can drag and drop various file types (including PDF), and it pops up a browser with a printable report of the word count for each document. It worked fine for me. (It is specifically created for word counts and is only 435 KB... that is, not a "big application"). Translator's Abacus doesn't work on PDF 1.5 or later.
Alternatively: you can just Ctrl+A to select all text in Acrobat Reader and then copy-paste it into a program like Microsoft Word (which has a word count on the status bar at the bottom of the screen).
Solution 4:
A straightforward way to do this if you using Acrobat Pro is to export the PDF to a Microsoft Word document and then do the word count in Word. Alternatively, you can export it to a plain text file and use a word count utility in the text editor of your choice/. I just did a word count on a pdf article using the Word method and it took all of 30 seconds to complete.
Hope this helps.