Do you have any useful awk and grep scripts for parsing apache logs? [closed]
You can do pretty much anything with apache log files with awk alone. Apache log files are basically whitespace separated, and you can pretend the quotes don't exist, and access whatever information you are interested in by column number. The only time this breaks down is if you have the combined log format and are interested in user agents, at which point you have to use quotes (") as the separator and run a separate awk command. The following will show you the IPs of every user who requests the index page sorted by the number of hits:
awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }
END { for (i in ipcount) {
printf "%15s - %d\n", i, ipcount[i] } }' logfile.log
$7 is the requested url. You can add whatever conditions you want at the beginning. Replace the '$7 == "/" with whatever information you want.
If you replace the $1 in (ipcount[$1]++), then you can group the results by other criteria. Using $7 would show what pages were accessed and how often. Of course then you would want to change the condition at the beginning. The following would show what pages were accessed by a user from a specific IP:
awk -F'[ "]+' '$1 == "1.2.3.4" { pagecount[$7]++ }
END { for (i in pagecount) {
printf "%15s - %d\n", i, pagecount[i] } }' logfile.log
You can also pipe the output through sort to get the results in order, either as part of the shell command, or also in the awk script itself:
awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }
END { for (i in ipcount) {
printf "%15s - %d\n", i, ipcount[i] | sort } }' logfile.log
The latter would be useful if you decided to expand the awk script to print out other information. It's all a matter of what you want to find out. These should serve as a starting point for whatever you are interested in.
One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.
For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.
Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot
# Custom log format, for testing
#
# date proto ipaddr status time req referer user-agent
LogFormat "%{%F %T}t %p %a %>s %D %r %{Referer}i %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot
What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:
for line in file("access.log"):
line = line.split("\t")
if line[3] != "200":
print line
Or if I wanted to do 'who is hotlinking images?' it would be
if line[6] in ("","-") and "/images" in line[5]:
For IP counts in an access log, the previous example:
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" logfile | sort -n | uniq -c | sort -n
becomes something like this:
cut -f 3 log | uniq -c | sort -n
Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thing for User-agents. If your logs are space-delimited, you have to do some regular expression matching or string searching by hand. With this format, it's simple:
cut -f 8 log | uniq -c | sort -n
Exactly the same as the above. In fact, any summary you want to do is essentially exactly the same.
Why on earth would I spend my system's CPU on awk and grep when cut will do exactly what I want orders of magnitude faster?
Forget about awk and grep. Check out asql. Why write unreadable scripts when you can use sql like syntax for querying the logfile. Eg.
asql v0.6 - type 'help' for help.
asql> load /home/skx/hg/engaging/logs/access.log
Loading: /home/skx/hg/engaging/logs/access.log
sasql> select COUNT(id) FROM logs
46
asql> alias hits SELECT COUNT(id) FROM logs
ALIAS hits SELECT COUNT(id) FROM logs
asql> alias ips SELECT DISTINCT(source) FROM logs;
ALIAS ips SELECT DISTINCT(source) FROM logs;
asql> hits
46
asql> alias
ALIAS hits SELECT COUNT(id) FROM logs
ALIAS ips SELECT DISTINCT(source) FROM logs;
Here is a script to find top urls, top referrers and top useragents from the recent N log entries
#!/bin/bash
# Usage
# ls-httpd type count
# Eg:
# ls-httpd url 1000
# will find top URLs in the last 1000 access log entries
# ls-httpd ip 1000
# will find top IPs in the last 1000 access log entries
# ls-httpd agent 1000
# will find top user agents in the last 1000 access log entries
type=$1
length=$2
if [ "$3" == "" ]; then
log_file="/var/log/httpd/example.com-access_log"
else
log_file="$3"
fi
if [ "$type" = "ip" ]; then
tail -n $length $log_file | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n
elif [ "$type" = "agent" ]; then
tail -n $length $log_file | awk -F\" '{print $6}'| sort -n | uniq -c | sort -n
elif [ "$type" = "url" ]; then
tail -n $length $log_file | awk -F\" '{print $2}'| sort -n | uniq -c | sort -n
fi
Source
for IP counts in an access log:
cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n
It's a bit ugly, but it works. I also use the following with netstat (to see active connections):
netstat -an | awk '{print $5}' | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | egrep -v "(`for i in \`ip addr | grep inet |grep eth0 | cut -d/ -f1 | awk '{print $2}'\`;do echo -n "$i|"| sed 's/\./\\\./g;';done`127\.|0\.0\.0)" | sort -n | uniq -c | sort -n
They're some of my favorite "one liners" :)