Count line lengths in file using command line tools

Problem

If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?

Example:

file.txt

this
is
a
sample
file
with
several
lines
of
varying
length

Running count_line_lengths file.txt would give:

Length Occurences
1      1
2      2
4      3
5      1
6      2
7      2

Ideas?

This

counts the line lengths using awk, then
sorts the (numeric) line lengths using sort -n and finally
counts the unique line length values uniq -c.

$ awk '{print length}' input.txt | sort -n | uniq -c
      1 1
      2 2
      3 4
      1 5
      2 6
      2 7

In the output, the first column is the number of lines with the given length, and the second column is the line length.

Pure awk

awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt

4 3
5 1
6 2
7 2
1 1
2 2

Using bash arrays:

#!/bin/bash

while read line; do
    ((histogram[${#line}]++))
done < file.txt

echo "Length Occurrence"
for length in "${!histogram[@]}"; do
    printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done

Example run:

$ ./t.sh
Length Occurrence
1      1
2      2
4      3
5      1
6      2
7      2

$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt

Output

You can accomplish this by using basic unix utilities only:

$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2

How it works?

Here's the source file:

$ cat file.txt
this
is
a
sample
file
with
several
lines
of
varying
length

Replace each line of the source file with its length:

$ for line in $(cat file.txt); do printf $line | wc -c; done
4
2
1
6
4
4
7
5
2
7
6

Sort and count the number of length occurrences:

$ for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c
      1 1
      2 2
      3 4
      1 5
      2 6
      2 7

Swap and format the numbers:

$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/") 
1 1
2 2
4 3
5 1
6 2
7 2

Count line lengths in file using command line tools

Problem

Example:

Output

How it works?

Related

Recent Posts