What's the quickest way to count the number of each character in a file?
If you want some real speed:
echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;
Is an incredibly fast pseudo-one-liner.
A simple test shows that on my Core i7 CPU 870 @ 2.93GHz it counts at just over 600MB/s:
$ du -h bigdna
1.1G bigdna
time ./a.out < bigdna
t: 178977308
a: 178958411
c: 178958823
g: 178947772
n: 178959673
-: 178939837
real 0m1.718s
user 0m1.539s
sys 0m0.171s
Unlike solutions involving sorting, this one runs in constant (4K) memory, which is very useful, if your file is far larger than your ram.
And, of course with a little bit of elbow grease, we can shave off 0.7 seconds:
echo 'int cache[256],x,buf[4096],*bp,*ep;char letters[]="tacgn-"; int main(){while((ep=buf+(read(0,buf,sizeof buf)/sizeof(int)))>buf)for(bp=buf;bp<ep;bp++){cache[(*bp)&0xff]++;cache[(*bp>>8)&0xff]++;cache[(*bp>>16)&0xff]++;cache[(*bp>>24)&0xff]++;}for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -O2 -xc -; ./a.out < file; rm a.out;
Nets just over 1.1GB/s finishing in:
real 0m0.943s
user 0m0.798s
sys 0m0.134s
For comparison, I tested some of the other solutions on this page which seemed to have some kind of speed promise.
The sed
/awk
solution made a valiant effort, but died after 30 seconds. With such a simple regex, I expect this to be a bug in sed (GNU sed version 4.2.1):
$ time sed 's/./&\n/g' bigdna | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}'
sed: couldn't re-allocate memory
real 0m31.326s
user 0m21.696s
sys 0m2.111s
The perl method seemed promising as well, but I gave up after running it for 7 minutes
time perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c' < bigdna
^C
real 7m44.161s
user 4m53.941s
sys 2m35.593s
grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
Will do the trick as a one liner. A little explanation is needed though.
grep -o foo.text -e A -e T -e C -e G -e N -e -
greps the file foo.text for letters a and g and the character -
for each character you want to search for. It also prints it one character a line.
sort
sorts it in order. This sets the stage for the next tool
uniq -c
counts the duplicate consecutive occurrences of any line. In this case, since we have a sorted list of characters, we get a neat count of when the characters we grepped out in the first step
If foo.txt contained the string GATTACA-
this is what I'd get from this set of commands
[geek@atremis ~]$ grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
1 -
3 A
1 C
1 G
2 T
Try this one, inspired by @Journeyman's answer.
grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:
A|T|C|G|N|-
This means "match A or T or C or G or N or -". The manual describes various regular expression syntax you can use.
Now we have output that looks something like this:
$ grep -o -E 'A|T|C|G|N|-' foo.txt
A
T
C
G
N
-
-
A
A
N
N
N
Our last step is to merge and count all the similar lines, which can simply be accomplished with a sort | uniq -c
, as in @Journeyman's answer. The sort gives us output like this:
$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort
-
-
A
A
A
C
G
N
N
N
N
T
Which, when piped through uniq -c
, finally resembles what we want:
$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
2 -
3 A
1 C
1 G
4 N
1 T
Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l
instead of sort | uniq -c
. There's lots of different things you can count with only slight modifications to this approach.