Handy parsing for numbers with unit suffixes?

Let's say you have data with quantities in human-readable format, such as the output of du -h, and want to further operate on those numbers. Let's say you want to pipe your data through grep to do a summation of a sub-set of that data. You do this ad-hoc on many systems you've never seen before, and have only minimal utilities. You want suffix conversions for all the standard 10^n suffixes.

Exists a gnu-linux utility to convert the suffixed numbers to real numbers within a pipeline? Do you have a bash function written to do this, or some perl which might be easy to remember, instead of a length of regex replacements or several sed steps?

38M     /var/crazyface/courses/200909-90147
2.7M    /var/crazyface/courses/200909-90157
1.1M    /var/crazyface/courses/200909-90159
385M    /var/crazyface/courses/200909-90161
1.3M    /var/crazyface/courses/200909-90169
376M    /var/crazyface/courses/200907-90171
8.0K    /var/crazyface/courses/200907-90173
668K    /var/crazyface/courses/200907-90175
564M    /var/crazyface/courses/200907-90178
4.0K    /var/crazyface/courses/200907-90179

| grep 200907 | <amazing suffix conversion> | awk '{s+=$1} END {print s}'


Relevant references:

  • How can I sort du -h output by size
  • https://stackoverflow.com/questions/2557649/convert-memory-size-human-readable-into-actual-number-bytes-in-perl

Solution 1:

Based on my answer at one of the questions you linked to:

awk '{
    ex = index("KMGTPEZY", substr($1, length($1)))
    val = substr($1, 0, length($1) - 1)

    prod = val * 10^(ex * 3)

    sum += prod
}
END {print sum}'

Another method that's used:

sed 's/G/ * 1000 M/;s/M/ * 1000 K/;s/K/ * 1000/; s/$/ +\\/; $a0' | bc

Solution 2:

You can use perl regular expressions to do this. For example,

$value = 0;
if($line =~ /(\d+\.?\d*)(\D+)\s+/) {
   $amplifier = 1024 if ($2 eq 'K');
   $amplifier = 1024 * 1024 if ($2 eq 'M');
   $amplifier = 1024 * 1024 * 1024 if ($2 eq 'G');
   $value = $1 * $amplifier;
}

This is a simple script. You can consider it as starting point. Hope it will help!

Solution 3:

Personally, I'd just not use the -h flag in the first place. The "human readable" version rounds off numbers which will need to be rounded again when you convert back, getting even less accurate. (For instance, 2.7MiB is 2831155.2 bytes. What did you do with the other 0.8th of a byte??!)

Otherwise, you can ask units to convert MiB/GiB/KiB to just "B" and it'll handle this, but you'd have to do something like (assuming your output is tabbed, otherwise cut appropriately)

{your output} | cut -f1 '-d{tab}' | xargs -L 1 -I {} units -1t {}iB B | awk '{s+=$1}END{printf "%d\n",s}'

Solution 4:

VALUE=$1

for i in "g G m M k K"; do
        VALUE=${VALUE//[gG]/*1024m}
        VALUE=${VALUE//[mM]/*1024k}
        VALUE=${VALUE//[kK]/*1024}
done

[ ${VALUE//\*/} -gt 0 ] && echo VALUE=$((VALUE)) || echo "ERROR: size invalid, pls enter correct size"