Word count for multiple .txt files in linux
I need to find the words in multiple .txt files using a linux cli. Currently I am using the following command:
cat *.txt|wc -w
I have made a test directory to practice the command and it seems to work for each individual .txt file but it fails to do it properly for all the .txt files.
I have a directory with 5 files in which 4 of them contain each 5 words and 1 is emtpy.
For the individual cat textfile.txt|wc -w
it gives the right answer.
But for the count it gives 17 when it should be (4 times 5 + 0 times 0 =) 20
Can someone tell my why the count given is 17 while the real count is 20?
Solution 1:
You can run
wc -w *.txt
This will give you the word count for each file and a total sum in the last row.
As it turned out, OPs issue was a missing newline in one of the files. This caused cat *txt
to combine multiple words into one and therefore resulting in a wrong count.
The command above is more robust in this situation as it processes each file individually.
Solution 2:
The most likely explanation is that the final lines of your files are not properly newline-terminated, so that when you cat
them, the first word of the next file gets appended to last word of the previous file:
Ex. given
steeldriver@pc:~$ printf 'foo\nbar\nbaz\nbam\nboo' | tee {1..4}.txt
foo
bar
baz
bam
boosteeldriver@pc:~$ printf '' > 5.txt
then
steeldriver@pc:~$ wc -w {1..5}.txt
5 1.txt
5 2.txt
5 3.txt
5 4.txt
0 5.txt
20 total
but
steeldriver@pc:~$ cat {1..5}.txt | wc -w
17