How to use awk variables in regular expressions?
I have a file called domain which contains some domains. For example:
google.com
facebook.com
...
yahoo.com
And I have another file called site which contains some sites URLs and numbers. For example:
image.google.com 10
map.google.com 8
...
photo.facebook.com 22
game.facebook.com 15
..
Now I'm going to count the url number each domain has. For example: google.com has 10+8. So I wrote an awk script like this:
BEGIN{
while(getline dom < "./domain" > 0) {
domain[dom]=0;
}
for(dom in domain) {
while(getline < "./site" > 0) {
if($1 ~/$dom$) #if $1 end with $dom {
domain[dom]+=$2;
}
}
}
}
But the code if($1 ~/$dom$)
doesn't run like I want. Because the variable $dom in the regular expression was explained literally. So, the first question is:
Is there any way to use variable $dom
in a regular expression?
Then, as I'm new to writing script
Is there any better way to solve the problem I have?
awk
can match against a variable if you don't use the //
regex markers.
if ( $0 ~ regex ){ print $0; }
In this case, build up the required regex as a string
regex = dom"$"
Then match against the regex
variable
if ( $1 ~ regex ) {
domain[dom]+=$2;
}
First of all, the variable is dom
not $dom
-- consider $
as an operator to extract the value of the column number stored in the variable dom
Secondly, awk will not interpolate what's between //
-- that is just a string in there.
You want the match()
function where the 2nd argument can be a string that is treated as the regular expression:
if (match($1, dom "$")) {...}
I would code a solution like:
awk '
FNR == NR {domain[$1] = 0; next}
{
for (dom in domain) {
if (match($1, dom "$")) {
domain[dom] += $2
break
}
}
}
END {for (dom in domain) {print dom, domain[dom]}}
' domain site
One way using an awk
script:
BEGIN {
FS = "[. ]"
OFS = "."
}
FNR == NR {
domain[$1] = $0
next
}
FNR < NR {
if ($2 in domain) {
for ( i = 2; i < NF; i++ ) {
if ($i != "") {
line = (line ? line OFS : "") $i
}
}
total[line] += $NF
line = ""
}
}
END {
for (i in total) {
printf "%s\t%s\n", i, total[i]
}
}
Run like:
awk -f script.awk domain.txt site.txt
Results:
facebook.com 37
google.com 18
You clearly want to read the site
file once, not once per entry in domain
. Fixing that, though, is trivial.
Equally, variables in awk
(other than fields $0
.. $9
, etc) are not prefixed with $
. In particular, $dom
is the field number identified by the variable dom
(typically, that's going to be 0
since domain strings don't convert to any other number).
I think you need to find a way to get the domain from the data read from the site
file. I'm not sure if you need to deal with sites with country domains such as bbc.co.uk
as well as sites in the GTLDs (google.com
etc). Assuming you are not dealing with country domains, you can use this:
BEGIN {
while (getline dom < "./domain" > 0) domain[dom] = 0
FS = "[ .]+"
while (getline < "./site" > 0)
{
topdom = $(NF-2) "." $(NF-1)
domain[topdom] += $NF
}
for (dom in domain) print dom " " domain[dom]
}
In the second while
loop, there are NF
fields; $NF
contains the count, and $1
.. $(NF-1)
contain components of the domain. So, topdom
ends up containing the top domain name, which is then used to index into the array initialized in the first loop.
Given the data in the question (minus the lines of dots), the output is:
yahoo.com 0
facebook.com 37
google.com 18