Overlapping/Comparing two files and printing what didn't match
Hello I have two files with some filenames that look like this:
File 1:
123.txt
456.txt
789.txt
101112.txt
File 2:
123.txt
789.txt
101112.txt
Is there any bash command that I can use to overlapp them and to print only those lines or file names that didnt match. So I am expecting something like this:
456.txt
comm
is your friend here:
If the files are sorted already:
comm -3 f1.txt f2.txt
If not sorted, sort
and pass them as file descriptors using process substitution (so that we don't need any temporary files):
comm -3 <(sort f1.txt) <(sort f2.txt)
Example:
% cat f1.txt
123.txt
456.txt
789.txt
101112.txt
% cat f2.txt
123.txt
789.txt
101112.txt
% comm -3 <(sort f1.txt) <(sort f2.txt)
456.txt
A simple approach would be to use two 'grep' commands, which each take one of the files as a list of lines to search the other file. Assuming your files are named f1.txt and f2.txt:
grep -Fxvf f1.txt f2.txt ; grep -xvf f2.txt f1.txt
The grep
options used are as follows:
-
-F
- Use each line as a fixed string to match, rather than a regular expression -
-x
- Only match whole lines -
-v
- Invert the match to select non-matching lines -
-f
- Use the file given as an argument as a list of patterns to match
I understand your question the way that you want all lines which appear in only one of the files, not both, and disregarding the line order.
I also assume we compare the files f1.txt
and f2.txt
. Insert your respective names instead.
Using Bash, you could do it with two loops, where each processes one file and checks for each line if it appears in the other. This approach is not very efficient, but it should work:
# This loops over f1.txt and searches each line in f2.txt
while read line ; do grep -Fxqe "$line" f2.txt || echo "$line" ; done < f1.txt
# This loops over f2.txt and searches each line in f1.txt
while read line ; do grep -Fxqe "$line" f1.txt || echo "$line" ; done < f2.txt
Both loops together produce the desired result. Each for itself only checks for lines in one file that don't appear in the other.
A neater solution could be written e.g. with a short Python one-liner:
python3 -c 's1=set(open("f1.txt")); s2=set(open("f2.txt")); print(*s1.symmetric_difference(s2), sep="")'
This uses a Set data structure, which only contains unique values and allows set operations like "symmetric difference".
Note that using both solutions, if any of the files contain duplicate lines, these are ignored and handled only like a single occurrence.
Assuming you don't need the results to remain in the original order, just use:
cat file1 file2 | sort | uniq -u
Explanation:
cat file1 file2
Outputs both files to standard output, one after the other.
sort
Sorts the combined contents of the two files. The useful side effect that we're interested in is that this puts identical lines from both files right next to each other.
uniq -u
Outputs only the lines that are "unique" i.e. that only occur once. Annoyingly enough this only looks at pairs of adjacent lines, which is why the previous sort
command is necessary.
You can also use uniq -d
to output only the lines that occur twice. This will give you the lines that are common to both files.
NOTE: I'm not sure how well this solution works if the same lines occurs more than once in the same file.