What's the quickest way to count the number of each character in a file?

If you want some real speed:

echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;

Is an incredibly fast pseudo-one-liner.

A simple test shows that on my Core i7 CPU 870 @ 2.93GHz it counts at just over 600MB/s:

$ du -h bigdna 
1.1G    bigdna

time ./a.out < bigdna 
t: 178977308
a: 178958411
c: 178958823
g: 178947772
n: 178959673
-: 178939837

real    0m1.718s
user    0m1.539s
sys     0m0.171s

Unlike solutions involving sorting, this one runs in constant (4K) memory, which is very useful, if your file is far larger than your ram.

And, of course with a little bit of elbow grease, we can shave off 0.7 seconds:

echo 'int cache[256],x,buf[4096],*bp,*ep;char letters[]="tacgn-"; int main(){while((ep=buf+(read(0,buf,sizeof buf)/sizeof(int)))>buf)for(bp=buf;bp<ep;bp++){cache[(*bp)&0xff]++;cache[(*bp>>8)&0xff]++;cache[(*bp>>16)&0xff]++;cache[(*bp>>24)&0xff]++;}for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -O2 -xc -; ./a.out < file; rm a.out;

Nets just over 1.1GB/s finishing in:

real    0m0.943s
user    0m0.798s
sys     0m0.134s

For comparison, I tested some of the other solutions on this page which seemed to have some kind of speed promise.

The sed/awk solution made a valiant effort, but died after 30 seconds. With such a simple regex, I expect this to be a bug in sed (GNU sed version 4.2.1):

$ time sed 's/./&\n/g' bigdna | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}' 
sed: couldn't re-allocate memory

real    0m31.326s
user    0m21.696s
sys     0m2.111s

The perl method seemed promising as well, but I gave up after running it for 7 minutes

time perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c' < bigdna 
^C

real    7m44.161s
user    4m53.941s
sys     2m35.593s

grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c

Will do the trick as a one liner. A little explanation is needed though.

grep -o foo.text -e A -e T -e C -e G -e N -e - greps the file foo.text for letters a and g and the character - for each character you want to search for. It also prints it one character a line.

sort sorts it in order. This sets the stage for the next tool

uniq -c counts the duplicate consecutive occurrences of any line. In this case, since we have a sorted list of characters, we get a neat count of when the characters we grepped out in the first step

If foo.txt contained the string GATTACA-this is what I'd get from this set of commands

[geek@atremis ~]$ grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
      1 -
      3 A
      1 C
      1 G
      2 T

Try this one, inspired by @Journeyman's answer.

grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c

The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:

A|T|C|G|N|-

This means "match A or T or C or G or N or -". The manual describes various regular expression syntax you can use.

Now we have output that looks something like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt 
A
T
C
G
N
-
-
A
A
N
N
N

Our last step is to merge and count all the similar lines, which can simply be accomplished with a sort | uniq -c, as in @Journeyman's answer. The sort gives us output like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort
-
-
A
A
A
C
G
N
N
N
N
T

Which, when piped through uniq -c, finally resembles what we want:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
      2 -
      3 A
      1 C
      1 G
      4 N
      1 T

Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l instead of sort | uniq -c. There's lots of different things you can count with only slight modifications to this approach.

What's the quickest way to count the number of each character in a file?

Related

Recent Posts