Determine number of unique lines with awk or similar in bash
I am using AWK to read through a custom log file I have. The format is something like this:
[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.4:8091 HTTP/1.0" 200
Right now, I have AWK (from bash) set to read the whole log, analyze each line and grab each line that contains "CONNECT" which works, however, it does not help me discover unique clients.
The way to do this would be to somehow filter it so that it analyzed this part of each line: "CONNECT 192.168.2.4:8091 HTTP/1.0"
If there was a way to grab all those lines in a log file, then compare them all and only count similar lines as one. So let's say, for example:
[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.6:8091 HTTP/2.0" 200
[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200
[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.2:8091 HTTP/2.0" 200
[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200
In this case, the answer I need would be 3, not 4. Because 2 lines are the same, so there are only 3 unique lines. What I need is an automated way to accomplish this with AWK.
If anybody can lend a hand that would be great.
You could let awk count unique instances like this:
awk -F\" '/CONNECT/ && !seen[$2] { seen[$2]++ } END { print length(seen) }' logfile
Output:
3
This collects the first double-quoted string from lines containing CONNECT
in the seen
hash array. When the end of input is reached, the number of elements in seen
is printed.
sed -re 's/.*"([^"]*)".*/\1/' <logfile> |sort |uniq
Awk variant: awk -F'"' {print $2} <logfile> |sort |uniq
Add -c
to uniq
to get a count of each matching line, or |wc -l
to get a count of the number of matching lines.