Why did the command uniq -c put a whitespace at the beginning?
The default behaviour of uniq is to right-justify the frequency in a line 7 spaces wide, then separate the frequency from the item with a single space.
Source : https://www.thelinuxrain.com/articles/tweaking-uniq-c
Remove the leading spaces with sed :
$ sort input | uniq -c | sort -nr | sed 's/^\s*//' > output
uniq -c
adds leading whitespace. E.g.
$ echo test
test
$ echo test | uniq -c
1 test
You could add a command at the end of the pipeline to remove it. E.g.
$ echo test | uniq -c | sed 's/^\s*//'
1 test
FWIW you can use a different sorting tool for more flexibility. Python is one such tool.
Source
#!/usr/bin/python3
import sys, operator, collections
counter = collections.Counter(map(operator.methodcaller('rstrip', '\n'), sys.stdin))
for item, count in counter.most_common():
print(count, item)
In theory this would even be faster than the sort
tool for large inputs since the above program uses a hash table to identify duplicate lines instead of a sorted list. (Alas it places lines of identical count in an arbitrary instead of a natural order; this can be amended and still be faster than two sort
invocations.)
Output Format
If you want more flexibility on the output format you can look into the print()
and format()
built-in functions.
For instance, if you want to print the count number in octal with up to 7 leading zeros and followed by a tab instead of a space character with a NUL line terminator, replace the last line with:
print(format(count, '08o'), item, sep='\t', end='\0')
Usage
Store the script in a file, say sort_count.py
, and invoke it with Python:
python3 sort_count.py < input