What is a reliable code (dependable code) to count files using Bash?

Counting files ...

Example #1: ls code to count directories and count regular files with Ubuntu 20.04.3 or specifically
bash50171$

ls -1 | wc -l

Good news, above ls code handles a file name with spaces since it puts single quotes ' ' around 'file name'

Bad news, above ls code miscounts: One (1) file with a newline in its file name, counts as two (2) files.

Problem: miscounting files.

Example #2: ls code gives correct file count:

ls -1qi  | grep -o '^ *[0-9]*' | wc -l

Above ls command correctly counts files with a newline. Command counts a List of inode numbers.

Above shortened ls command is:

ls -1qi

Correctly shows a file name with spaces using ' '. Correctly shows a file name with a newline using ' '.

How to create problem files? use:

touch 'a b' 'a  b' a$'\xe2\x80\x82'b a$'\xe2\x80\x83'b a$'\t'b a$'\n'b

Command recursively, add R:

ls -1qRi
ls -1qRi | grep -o '^ *[0-9]*'
ls -1qRi | grep -o '^ *[0-9]*' | wc -l

Problem:

  • When to use ls in code?
  • When not to use ls in code?

Reference A:

Why you shouldn't parse the output of ls is explained here. In short, parsing ls is bad practice.

Reference B:

This post explains why not parse ls (and what to do instead)?

Example #2 code skirts one problem (one snag)
with a file with a newline in its file name.

ls -1qi  | grep -o '^ *[0-9]*' | wc -l

What counting code is more reliable than Example #2?

ls -1qi  | grep -o '^ *[0-9]*' | wc -l

Count code means, correctly:

  • Count directories
  • Count regular files
  • Count symbolic links
  • Count hidden files
  • Count and display a file name with spaces
  • Count and display a file name with a new line
  • Count in one (1) directory
  • Count recursively

Said another way: To count files, what is a reliable code (dependable code)?


Solution 1:

Analysis

ls with -1 and -q is not the worst way to count files in some cases. Both options are defined in the POSIX specification of ls, you can call them portable.

The standard "do not parse ls" article fights the idea of using ls to get filename(s) reliably. If you want to count entries then you don't really need filenames and sometimes carefully used ls may work for you. There are general problems though:

  1. To tell apart regular files from symlinks or directories you need to use -l and to examine drwxr-xr-x or similar string. If you want to tell apart hidden from non-hidden at the same time (and you use -a to print hidden files) then you need to check if there is a dot at the beginning of the filename. It's not trivial with -l because a dot with totally different meaning may appear earlier in the line. Yes, ls -p can help with spotting directories without -l, but this only works for directories. And there is ls -F that can be kinda helpful. Depending on what files you want to count, different options for ls along with different patterns for grep are needed. This may turn ugly quite fast. Note an approach like this is exactly what we mean by "parsing" in this context: analyzing some possibly convoluted structure to get the information you need. This leads us to the main problem.
  2. The output of ls is not designed to be parsed. It is designed to be easily readable by humans. Parsing ls is like hammering a screw. In some cases the end result may be acceptable but the hammer is not for this.
  3. If you get used to parsing the output of ls in cases where it can work then you will be more eager to parse ls in cases where it cannot work reliably. If all you have is a hammer, everything looks like a nail.

Basic solution: find

What is the right replacement for ls here? IMO it's find, Let's build an example command from scratch and analyze some quirks.

First of all, the default action in find is -print. It prints pathnames to stdout as newline-terminated strings. If the pathname itself contains at least one newline character then there will be more lines than files. This means find . | wc -l is not a good way to count all files. The right portable way is:

# count all files `find' can find, starting from `.', recursively 
find . -exec printf a \; | wc -c

where a can be any one-byte character. For each file find finds (including .!), one byte is printed; wc -c counts these bytes (you can as well print fixed lines and count lines). The downside is -exec spawns a separate printf process for every file, this is costly, this is slow. With GNU find you can make find itself do the job of printf:

find . -printf a | wc -c

The above command should perform better, it's not portable though. A portable and probably improved approach is to use the printf builtin of your sh:

# count all files `find' can find, starting from `.', recursively
find . -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c

With this approach one sh will process many pathnames and call printf a for each. There may be more than one sh spawned (to avoid argument list too long error, find is that smart), still way less than one per pathname. (Note: find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?)

I wrote "probably improved approach" because printf may or may not be a builtin in your sh. If it's not a builtin then the command will perform slightly worse than the one with -exec printf …. In practice printf is a builtin in basically any implementation of sh. Still formally this is not required.


Magic begins – adding tests

find is well equipped to perform various tests on files it visits. Want to count regular files? Here:

# count all regular files `find' can find, staring from `.', recursively
find . -type f \
       -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c

Hidden files? Here:

# count all hidden files `find' can find, starting from `.', recursively
find "$PWD" -name '.*' \
            -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c

Note if I used find . … then the current working directory would match -name '.*' regardless of its "real" name. I used (properly quoted) $PWD to make find recognize the current working directory under its "real" name.

You can combine tests. Hidden regular files? Here:

# count all hidden regular files `find' can find, starting from `.', recursively
find "$PWD" -type f -name '.*' \
            -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c

You can test virtually anything. Remember -exec foo … \; is also a test, it succeeds iff foo returns exit status 0; this way you can build custom tests (example).

The downside is find is not easy to master. Common surprises:

  • -o,
  • rounding (in some tests),
  • unintuitive meaning of -n, n and +n (in some tests).

You may find this answer of mine useful (especially "theory" and "pitfalls"). Simple tests like -type f are quite straightforward though.

Then there is the recursiveness. find . finds . and everything below. In GNU find one can use -mindepth 1 to omit the starting point(s), similarly -maxdepth 1 suppresses descending into subdirectories. In other words GNU find . -mindepth 1 -maxdepth 1 should find what ls -A prints. I believe BSD find uses -depth 1 for this (note -depth n is very different than -depth). All these are not portable. This answer provides a portable solution:

Generally though, it's depth 1 you want (-mindepth 1 -maxdepth 1) as you don't want to consider . (depth 0), and then it's even simpler:

find . ! -name . -prune -extra-conditions-and-actions

And this leads us to the following example:

# count all hidden regular files `find' can find, inside `.', non-recursively
find . ! -name . -prune -type f -name '.*' \
         -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c

Note it's safe to use . here (no need for $PWD) because . does not pass ! -name . and therefore the fact it matches -name '.*' is irrelevant. Actually if we used $PWD, we would complicate things because we would need to replace ! -name . with something else and in general this would be non-trivial.

One big difference between ls … | … | wc -l and any our find … | wc -c is: with find we don't parse anything. Our tests in find test directly what we want to test, not some textual representation of it; they don't rely on our understanding of the output format of any tool (like ls or whatever). Pathnames with spaces, newlines or whatever cannot break things because they never appear in anything we process.

Another important difference is the ability of find to run virtually any test.


Magic continues – multiple counters

We know how to count files matching virtually any criteria. Each of the commands we used in the previous section gives us a single number. If we wanted two numbers, e.g. the total number of files and the number of regular files, then we could run two different find … | wc -c commands. This would be sub-optimal because:

  • each find would traverse the directory tree on its own; caching in the OS may mitigate the problem, but still;
  • if something creates regular files in the meantime, it may happen you will find more regular files than files in total; each number will be in some sense correct at the time it is calculated, yet as a tuple they won't make sense.

For these reasons one may want a single find to give us two (or more) numbers somehow.

Note: from now on I don't bother myself with stopping find . from testing . itself or from being recursive. Additionally for brevity I will use non-portable -printf (that works in GNU find) where I need it; the examples way above should suffice if you need a portable equivalent.

This (sub-optimal) code counts the total number of files and the number of regular files, using one find:

find . -printf 'files total\n' \
       -type f -printf 'regular files\n' \
| sort | uniq -c

And this (sub-optimal) code counts the number of directories, the number of regular files, the number of symlinks, finally other files:

find . \( -type d -printf 'directories\n' \) \
    -o \( -type f -printf 'regular files\n' \) \
    -o \( -type l -printf 'symlinks\n' \) \
    -o -printf 'of other types\n' \
| sort | uniq -c

It's sub-optimal because there are at least three problems with it:

  • The counting is performed by uniq -c, it needs prior sort. But in general sorting is not required to count things: you see a thing of some type and you increase the respective counter. If the directory tree is huge then sort will do a lot of work. It will be good to replace sort | uniq -c with some tool(s) more fitted for the job.
  • The final sequence of lines depends on how sort sorts strings first. You will probably prefer other types in the last line of output, but we cannot easily affect the sequence.
  • If there are no symlinks at all then there will be no line stating 0 symlinks. Suppose you see 2 of other types and no line mentioning symlinks. Then it's natural to assume "other types" include symlinks, but this is not true.

With awk we can solve all the three problems:

find . \( -type d -printf 'd\n' \) \
    -o \( -type f -printf 'f\n' \) \
    -o \( -type l -printf 'l\n' \) \
    -o -printf 'o\n' \
| awk '
    BEGIN {count["d"]=0; count["f"]=0; count["l"]=0; count["o"]=0}
    {count[$0]++}
    END {
      printf("%s directories\n", count["d"])
      printf("%s regular files\n", count["f"])
      printf("%s symlinks\n", count["l"])
      printf("%s of other types\n", count["o"])
      print "------"
      printf("%s files total\n", count["d"]+count["f"]+count["l"]+count["o"])
    }'

Now we are in control of what our code prints. We can even make it print lines like N_DIRS=123 and eval the output in a shell script, so shell variables are created to be used later in the script.

Note how I used awk to sum four numbers. I could make find additionally print t (like "total") for any single file and count appearances of t with awk, then I wouldn't need to sum in awk. My point is the general scheme is quite flexible. It's basically this:

  1. find performs tests of our choice and prints tokens (lines) of our choice. A file may generate zero, one or more tokens, depending on what we want. Obviously the more you are familiar with find, the more complex logic you can code without bugs.
  2. awk counts how many times each token appears.
  3. awk may do additional calculations.
  4. awk prints the result, using format of our choice.

Even more complex code, shell function

The following function never tests . (because I think usually you don't want to count the current working directory). It's recursive when invoked as count -R, non-recursive otherwise. If needed, get rid of non-portable -printf like we did earlier in this answer.

# should work in sh
count() (
   unset IFS
   if [ "$1" = -R ]; then
      arg=''
   else
      arg='-prune'
   fi

   find . ! -name . $arg \( \
         \( -type d \( -name '.*' -printf 'dh\n' -o -printf 'dn\n' \) \) \
      -o \( -type f \( -name '.*' -printf 'fh\n' -o -printf 'fn\n' \) \) \
      -o \( -type l \( -name '.*' -printf 'lh\n' -o -printf 'ln\n' \) \) \
      -o \(            -name '.*' -printf 'oh\n' -o -printf 'on\n' \) \
                         \) \
   | awk '
      BEGIN {
         count["dn"]=0; count["dh"]=0
         count["fn"]=0; count["fh"]=0
         count["ln"]=0; count["lh"]=0
         count["on"]=0; count["oh"]=0
      }
      {  count[$0]++ }
      END {
         tn=count["dn"]+count["fn"]+count["ln"]+count["on"]
         th=count["dh"]+count["fh"]+count["lh"]+count["oh"]
         t=tn+th
         printf("%9d directories    (%9d non-hidden, %9d hidden)\n", count["dn"]+count["dh"], count["dn"], count["dh"])
         printf("%9d regular files  (%9d non-hidden, %9d hidden)\n", count["fn"]+count["fh"], count["fn"], count["fh"])
         printf("%9d symlinks       (%9d non-hidden, %9d hidden)\n", count["ln"]+count["lh"], count["ln"], count["lh"])
         printf("%9d of other types (%9d non-hidden, %9d hidden)\n", count["on"]+count["oh"], count["on"], count["oh"])
         print "-----------------------------------------------------------------"
         printf("%9d files total    (%9d non-hidden, %9d hidden)\n", t, tn, th)
    }'
)

I guess the awk code is less elegant than it could be; and I'm not really sure if it's totally portable (I haven't studied the specification thoroughly).

In general one should double-quote variables like $arg, but here we need $arg to disappear when empty. Possible values are safe, unless the $IFS contains any character appearing in the string -prune. I deliberately designed the function to always run in a subshell (f() (…) syntax instead of f() {…}) where I unset IFS to make sure the unquoted $arg will always work.

Example output:

$ count -R
    40130 directories    (    40043 non-hidden,        87 hidden)
   363974 regular files  (   362220 non-hidden,      1754 hidden)
     6797 symlinks       (     6793 non-hidden,         4 hidden)
       25 of other types (       25 non-hidden,         0 hidden)
-----------------------------------------------------------------
   410926 files total    (   409081 non-hidden,      1845 hidden)