How to grep for two words existing on the same line? [duplicate]
How do I grep for lines that contain two input words on the line? I'm looking for lines that contain both words, how do I do that? I tried pipe like this:
grep -c "word1" | grep -r "word2" logs
It just stucks after the first pipe command.
Why?
Solution 1:
Why do you pass -c
? That will just show the number of matches. Similarly, there is no reason to use -r
. I suggest you read man grep
.
To grep for 2 words existing on the same line, simply do:
grep "word1" FILE | grep "word2"
grep "word1" FILE
will print all lines that have word1 in them from FILE, and then grep "word2"
will print the lines that have word2 in them. Hence, if you combine these using a pipe, it will show lines containing both word1 and word2.
If you just want a count of how many lines had the 2 words on the same line, do:
grep "word1" FILE | grep -c "word2"
Also, to address your question why does it get stuck : in grep -c "word1"
, you did not specify a file. Therefore, grep
expects input from stdin
, which is why it seems to hang. You can press Ctrl+D to send an EOF (end-of-file) so that it quits.
Solution 2:
Prescription
One simple rewrite of the command in the question is:
grep "word1" logs | grep "word2"
The first grep
finds lines with 'word1' from the file 'logs' and then feeds those into the second grep
which looks for lines containing 'word2'.
However, it isn't necessary to use two commands like that. You could use extended grep
(grep -E
or egrep
):
grep -E 'word1.*word2|word2.*word1' logs
If you know that 'word1' will precede 'word2' on the line, you don't even need the alternatives and regular grep
would do:
grep 'word1.*word2' logs
The 'one command' variants have the advantage that there is only one process running, and so the lines containing 'word1' do not have to be passed via a pipe to the second process. How much this matters depends on how big the data file is and how many lines match 'word1'. If the file is small, performance isn't likely to be an issue and running two commands is fine. If the file is big but only a few lines contain 'word1', there isn't going to be much data passed on the pipe and using two command is fine. However, if the file is huge and 'word1' occurs frequently, then you may be passing significant data down the pipe where a single command avoids that overhead. Against that, the regex is more complex; you might need to benchmark it to find out what's best — but only if performance really matters. If you run two commands, you should aim to select the less frequently occurring word in the first grep
to minimize the amount of data processed by the second.
Diagnosis
The initial script is:
grep -c "word1" | grep -r "word2" logs
This is an odd command sequence. The first grep
is going to count the number of occurrences of 'word1' on its standard input, and print that number on its standard output. Until you indicate EOF (e.g. by typing Control-D), it will sit there, waiting for you to type something. The second grep
does a recursive search for 'word2' in the files underneath directory logs
(or, if it is a file, in the file logs
). Or, in my case, it will fail since there's neither a file nor a directory called logs
where I'm running the pipeline. Note that the second grep
doesn't read its standard input at all, so the pipe is superfluous.
With Bash, the parent shell waits until all the processes in the pipeline have exited, so it sits around waiting for the grep -c
to finish, which it won't do until you indicate EOF. Hence, your code seems to get stuck. With Heirloom Shell, the second grep
completes and exits, and the shell prompts again. Now you have two processes running, the first grep
and the shell, and they are both trying to read from the keyboard, and it is not determinate which one gets any given line of input (or any given EOF indication).
Note that even if you typed data as input to the first grep
, you would only get any lines that contain 'word2' shown on the output.
Footnote:
At one time, the answer used:
grep -E 'word1.*word2|word2.*word1' "$@"
grep 'word1.*word2' "$@"
This triggered the comments below.
Solution 3:
you could use awk. like this...
cat <yourFile> | awk '/word1/ && /word2/'
Order is not important. So if you have a file and...
a file named , file1 contains:
word1 is in this file as well as word2
word2 is in this file as well as word1
word4 is in this file as well as word1
word5 is in this file as well as word2
then,
/tmp$ cat file1| awk '/word1/ && /word2/'
will result in,
word1 is in this file as well as word2
word2 is in this file as well as word1
yes, awk is slower.
Solution 4:
The main issue is that you haven't supplied the first grep with any input. You will need to reorder your command something like
grep "word1" logs | grep "word2"
If you want to count the occurences, then put a '-c' on the second grep.