Determine number of unique lines with awk or similar in bash

I am using AWK to read through a custom log file I have. The format is something like this:

[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.4:8091 HTTP/1.0" 200

Right now, I have AWK (from bash) set to read the whole log, analyze each line and grab each line that contains "CONNECT" which works, however, it does not help me discover unique clients.

The way to do this would be to somehow filter it so that it analyzed this part of each line: "CONNECT 192.168.2.4:8091 HTTP/1.0"

If there was a way to grab all those lines in a log file, then compare them all and only count similar lines as one. So let's say, for example:

 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.6:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.2:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200

In this case, the answer I need would be 3, not 4. Because 2 lines are the same, so there are only 3 unique lines. What I need is an automated way to accomplish this with AWK.

If anybody can lend a hand that would be great.


You could let awk count unique instances like this:

awk -F\" '/CONNECT/ && !seen[$2] { seen[$2]++ } END { print length(seen) }' logfile

Output:

3

This collects the first double-quoted string from lines containing CONNECT in the seen hash array. When the end of input is reached, the number of elements in seen is printed.


sed -re 's/.*"([^"]*)".*/\1/' <logfile> |sort |uniq

Awk variant: awk -F'"' {print $2} <logfile> |sort |uniq

Add -c to uniq to get a count of each matching line, or |wc -l to get a count of the number of matching lines.