Bash Script: count unique lines in file

Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:

ip.ad.dre.ss[:port]

Desired result:

There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format

ip.ad.dre.ss[:port] count

where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.

So far, I'm using this command to scrape all of the ip addresses from the log file:

grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt

From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)

I can then use the following to extract the unique entries:

sort -u ips.txt > intermediate.txt

I don't know how I can aggregate the line counts somehow with sort.


Solution 1:

You can use the uniq command to get counts of sorted repeated lines:

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

sort ips.txt | uniq -c | sort -bgr

Solution 2:

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Generating text file:

$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[$0]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s