how can i compare data in 2 files to identify common and unique data? [duplicate]

How can I compare data in 2 files to identify common and unique data ? I can't do it line by line because I have file 1 which contains say 100 id/codes/number-set and I want to compare a file 2 to file 1.

The thing is that file 2 contains a subset of data in file 1 and also data unique to file 2, for example:

file 1      file 2
1            1
2            a
3            2
4            b
5            3 
6            c

How can I compare both files to identify data that is common and unique to each files? diff can't seem to do the job.


No matter if your file1 and file2 are sorted or not, use awk command as follows:

unique data in file1:

awk 'NR==FNR{a[$0];next}!($0 in a)' file2 file1
4
5
6

unique data in file2:

awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2
a
b
c

common data:

awk 'NR==FNR{a[$0];next} ($0 in a)' file1 file2
1
2
3

Explanation:

NR==FNR    - Execute next block for 1st file only
a[$0]      - Create an associative array with key as '$0' (whole line) and copy that into it as its content.
next       - move to next row
($0 in a)  - For each line saved in `a` array:
             print the common lines from 1st and 2nd file "($0 in a)' file1 file2"
             or unique lines in 1st file only "!($0 in a)' file2 file1"
             or unique lines in 2nd file only "!($0 in a)' file1 file2"

This is what comm is for:

$ comm <(sort file1) <(sort file2)
        1
        2
        3
4
5
6
    a
    b
    c

The first column is lines only appearing in file 1
The second column is lines only appearing in file 2
The third column is lines common to both files

comm requires the input files to be sorted

To exclude any column from appearing, add an option with that column number. For example, to see only the lines in common, use comm -12 ... or the lines that are only in file2, comm -13 ...