Difference of two big files
Solution 1:
Sounds like a job for comm
:
$ comm -3 <(sort test1.csv) <(sort test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
As explained in man comm
:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
So, the -3
means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:
$ comm -3 <(sort test1.csv) <(sort test2.csv) | tr -d '\t'
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
In this case, you don't really even need to sort the files and you can simplify the above to:
comm -3 test1.csv test2.csv | tr -d '\t' > difference.csv
Solution 2:
Using grep
with bash
process substitution:
$ cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
To save the output as results.csv
:
cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv) >results.csv
<()
is thebash
process substitution patterngrep -vFf test2.csv test1.csv
will find the lines unique to onlytest1.csv
grep -vFf test1.csv test2.csv
will find the lines unique to onlytest2.csv
Finally we are summing up the results by
cat
Or as Oli suggested, you can use command grouping also:
$ { grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv; }
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
Or just run one after another, as they are both writing to STDOUT they will ultimately get added:
$ grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
Solution 3:
If the order of rows is not relevant, use awk
or perl
:
awk '{seen[$0]++} END {for (i in seen) {if (seen[i] == 1) {print i}}}' 1.csv 2.csv
Use grep
to get the common lines and filter those out:
grep -hxvFf <(grep -Fxf 1.csv 2.csv) 1.csv 2.csv
The internal grep gets the common lines, then the external grep finds lines which don't match these common lines.