How to grep for two words existing on the same line? [duplicate]

grep

How do I grep for lines that contain two input words on the line? I'm looking for lines that contain both words, how do I do that? I tried pipe like this:

grep -c "word1" | grep -r "word2" logs

It just stucks after the first pipe command.

Why?

Solution 1:

Why do you pass -c? That will just show the number of matches. Similarly, there is no reason to use -r. I suggest you read man grep.

To grep for 2 words existing on the same line, simply do:

grep "word1" FILE | grep "word2"

grep "word1" FILE will print all lines that have word1 in them from FILE, and then grep "word2" will print the lines that have word2 in them. Hence, if you combine these using a pipe, it will show lines containing both word1 and word2.

If you just want a count of how many lines had the 2 words on the same line, do:

grep "word1" FILE | grep -c "word2"

Also, to address your question why does it get stuck : in grep -c "word1", you did not specify a file. Therefore, grep expects input from stdin, which is why it seems to hang. You can press Ctrl+D to send an EOF (end-of-file) so that it quits.

Solution 2:

Prescription

One simple rewrite of the command in the question is:

grep "word1" logs | grep "word2"

The first grep finds lines with 'word1' from the file 'logs' and then feeds those into the second grep which looks for lines containing 'word2'.

However, it isn't necessary to use two commands like that. You could use extended grep (grep -E or egrep):

grep -E 'word1.*word2|word2.*word1' logs

If you know that 'word1' will precede 'word2' on the line, you don't even need the alternatives and regular grep would do:

grep 'word1.*word2' logs

The 'one command' variants have the advantage that there is only one process running, and so the lines containing 'word1' do not have to be passed via a pipe to the second process. How much this matters depends on how big the data file is and how many lines match 'word1'. If the file is small, performance isn't likely to be an issue and running two commands is fine. If the file is big but only a few lines contain 'word1', there isn't going to be much data passed on the pipe and using two command is fine. However, if the file is huge and 'word1' occurs frequently, then you may be passing significant data down the pipe where a single command avoids that overhead. Against that, the regex is more complex; you might need to benchmark it to find out what's best — but only if performance really matters. If you run two commands, you should aim to select the less frequently occurring word in the first grep to minimize the amount of data processed by the second.

Diagnosis

The initial script is:

grep -c "word1" | grep -r "word2" logs

This is an odd command sequence. The first grep is going to count the number of occurrences of 'word1' on its standard input, and print that number on its standard output. Until you indicate EOF (e.g. by typing Control-D), it will sit there, waiting for you to type something. The second grep does a recursive search for 'word2' in the files underneath directory logs (or, if it is a file, in the file logs). Or, in my case, it will fail since there's neither a file nor a directory called logs where I'm running the pipeline. Note that the second grep doesn't read its standard input at all, so the pipe is superfluous.

With Bash, the parent shell waits until all the processes in the pipeline have exited, so it sits around waiting for the grep -c to finish, which it won't do until you indicate EOF. Hence, your code seems to get stuck. With Heirloom Shell, the second grep completes and exits, and the shell prompts again. Now you have two processes running, the first grep and the shell, and they are both trying to read from the keyboard, and it is not determinate which one gets any given line of input (or any given EOF indication).

Note that even if you typed data as input to the first grep, you would only get any lines that contain 'word2' shown on the output.

Footnote:

At one time, the answer used:

grep -E 'word1.*word2|word2.*word1' "$@"
grep 'word1.*word2' "$@"

This triggered the comments below.

Solution 3:

you could use awk. like this...

cat <yourFile> | awk '/word1/ && /word2/'

Order is not important. So if you have a file and...

a file named , file1 contains:

word1 is in this file as well as word2
word2 is in this file as well as word1
word4 is in this file as well as word1
word5 is in this file as well as word2

then,

/tmp$ cat file1| awk '/word1/ && /word2/'

will result in,

word1 is in this file as well as word2
word2 is in this file as well as word1

yes, awk is slower.

Solution 4:

The main issue is that you haven't supplied the first grep with any input. You will need to reorder your command something like

grep "word1" logs | grep "word2"

If you want to count the occurences, then put a '-c' on the second grep.