Is there a way to make any command output machine readable / friendly?

The actual issue I'm facing is: I want to sample disk usage / free disk periodically, say once a day. And store all the output in a database or similar.

I saw that du uses spaces to separate values, but didn't see any switch to change that format.

But the general question is this: is there a common way to turn any command output to a more machine readable version (i.e. CSV, XML, etc)?


First, no, 'du' uses line breaks to separate values. Each line is a pair of size + path, separated by whitespace. This is quite machine-readable – common tools such as awk or cut can work with it easily.

For example, awk '{print $1}' gets you the 1st column – and if you're using du -s to get only the total, this immediately gives you just the number.

In general, though, command output is just text, and there's great variety in how that text can be formatted – so there cannot possibly be a generic tool that works with every program.

Some programs have built-in options to output CSV or JSON or YAML (or rarely XML), but if the program itself cannot output something in machine-readable format, you end up using various text-processing tools (such as awk, grep, perl, sed, cut) to mangle it into something usable and just kind of hope that the format stays constant.

Related to your specific problem, there are two ways to get disk usage: you can sum up the sizes of all individual files (which is slow and takes a bit of I/O), or you can ask the filesystem to give you the "free space" number that it already tracks (which is therefore very fast). So instead of du, you should be using df.

  • 'df' does have more columns in its output; fortunately most of them (except the last one) are more or less guaranteed to never contain any spaces (well, at least for local disks), so it is also okay to use 'awk' to get the correct column.

    avail=$(df -P / | awk 'NR == 2 {print $5}')
    

    ('df' has the -P option to make it always output fields in this specific order that's POSIX-mandated, even if the default order may vary between systems. So with df -P, available space is always the 5th field.)

  • Linux (GNU Coreutils) 'df' actually has the --output option to make it output just the columns that you want, but this is probably not available in other variants (e.g. Busybox df):

    avail=$(df --output=Avail / | tail -1)
    
  • There are other tools which can query free space of a filesystem, e.g. findmnt --df on Linux which has a similar option to output individual columns as well as one to hide the header:

    avail=$(findmnt -b -n -o AVAIL /)
    

    It can also output JSON, which is easily machine-readable (and within shell, the jq tool can be used to process it – Python or Perl would of course have built-in modules to parse JSON into dicts or arrays):

    data=$(findmnt --json --df --bytes /)
    avail=$(jq -r ".filesystems[0].avail" <<< "$data")
    

The other approach is to bypass the tool and directly do whatever it does under the hood. For example, if you were programming in C or similar, you wouldn't run any of those tools to check disk usage – call all those programs use the statfs() system call, so you could literally just call the statfs("/") function in.

For example, if you're writing your monitoring script in Python, you don't need to bother with any output parsing – just call os.statvfs("/") and the available block count is in .f_bavail.


There is a way to turn any output delimited by blanks to CSV containing for example the first two fields.

The StackOverflow post Append text after du output in csv provides the following bash function:

sizeFolder(){
  du -h --max-depth=1 --block-size=1M $TMP_DIR | sort -hr | awk 'NR==FNR{a[NR]=$0; next} {print $1, $2, a[FNR]}' textfile -
}