Difference of two big files

Solution 1:

Sounds like a job for comm:

$ comm -3 <(sort test1.csv) <(sort test2.csv)
100,300,500,700
    100,4,2,1,7
    21,22,23,24,25
    50,25,700,5

As explained in man comm:

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

So, the -3 means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:

$ comm -3 <(sort test1.csv) <(sort test2.csv) | tr -d '\t'
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

In this case, you don't really even need to sort the files and you can simplify the above to:

comm -3 test1.csv test2.csv | tr -d '\t' > difference.csv

Solution 2:

Using grep with bash process substitution:

$ cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

To save the output as results.csv:

cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv) >results.csv
  • <() is the bash process substitution pattern

  • grep -vFf test2.csv test1.csv will find the lines unique to only test1.csv

  • grep -vFf test1.csv test2.csv will find the lines unique to only test2.csv

  • Finally we are summing up the results by cat

Or as Oli suggested, you can use command grouping also:

$ { grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv; }
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

Or just run one after another, as they are both writing to STDOUT they will ultimately get added:

$ grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

Solution 3:

If the order of rows is not relevant, use awk or perl:

awk '{seen[$0]++} END {for (i in seen) {if (seen[i] == 1) {print i}}}' 1.csv 2.csv

Use grep to get the common lines and filter those out:

grep -hxvFf <(grep -Fxf 1.csv 2.csv) 1.csv 2.csv

The internal grep gets the common lines, then the external grep finds lines which don't match these common lines.