What is a reliable code (dependable code) to count files using Bash?
Counting files ...
Example #1: ls
code to count directories and count regular files with Ubuntu 20.04.3 or specifically
bash50171$
ls -1 | wc -l
Good news, above ls
code handles a file name with spaces since it puts single quotes ' '
around 'file name'
Bad news, above ls
code miscounts: One (1) file with a newline in its file name, counts as two (2) files.
Problem: miscounting files.
Example #2: ls
code gives correct file count:
ls -1qi | grep -o '^ *[0-9]*' | wc -l
Above ls
command correctly counts files with a newline. Command counts a List of inode numbers.
Above shortened ls
command is:
ls -1qi
Correctly shows a file name with spaces using ' '
. Correctly shows a file name with a newline using ' '
.
How to create problem files? use:
touch 'a b' 'a b' a$'\xe2\x80\x82'b a$'\xe2\x80\x83'b a$'\t'b a$'\n'b
Command recursively, add R
:
ls -1qRi
ls -1qRi | grep -o '^ *[0-9]*'
ls -1qRi | grep -o '^ *[0-9]*' | wc -l
Problem:
- When to use
ls
in code? - When not to use
ls
in code?
Reference A:
Why you shouldn't parse the output of ls
is explained here. In short, parsing ls
is bad practice.
Reference B:
This post explains why not parse ls
(and what to do instead)?
Example #2 code skirts one problem (one snag)
with a file with a newline in its file name.
ls -1qi | grep -o '^ *[0-9]*' | wc -l
What counting code is more reliable than Example #2?
ls -1qi | grep -o '^ *[0-9]*' | wc -l
Count code means, correctly:
- Count directories
- Count regular files
- Count symbolic links
- Count hidden files
- Count and display a file name with spaces
- Count and display a file name with a new line
- Count in one (1) directory
- Count recursively
Said another way: To count files, what is a reliable code (dependable code)?
Solution 1:
Analysis
ls
with -1
and -q
is not the worst way to count files in some cases. Both options are defined in the POSIX specification of ls
, you can call them portable.
The standard "do not parse ls
" article fights the idea of using ls
to get filename(s) reliably. If you want to count entries then you don't really need filenames and sometimes carefully used ls
may work for you. There are general problems though:
- To tell apart regular files from symlinks or directories you need to use
-l
and to examinedrwxr-xr-x
or similar string. If you want to tell apart hidden from non-hidden at the same time (and you use-a
to print hidden files) then you need to check if there is a dot at the beginning of the filename. It's not trivial with-l
because a dot with totally different meaning may appear earlier in the line. Yes,ls -p
can help with spotting directories without-l
, but this only works for directories. And there isls -F
that can be kinda helpful. Depending on what files you want to count, different options forls
along with different patterns forgrep
are needed. This may turn ugly quite fast. Note an approach like this is exactly what we mean by "parsing" in this context: analyzing some possibly convoluted structure to get the information you need. This leads us to the main problem. -
The output of
ls
is not designed to be parsed. It is designed to be easily readable by humans. Parsingls
is like hammering a screw. In some cases the end result may be acceptable but the hammer is not for this. - If you get used to parsing the output of
ls
in cases where it can work then you will be more eager to parsels
in cases where it cannot work reliably. If all you have is a hammer, everything looks like a nail.
Basic solution: find
What is the right replacement for ls
here? IMO it's find
, Let's build an example command from scratch and analyze some quirks.
First of all, the default action in find
is -print
. It prints pathnames to stdout as newline-terminated strings. If the pathname itself contains at least one newline character then there will be more lines than files. This means find . | wc -l
is not a good way to count all files. The right portable way is:
# count all files `find' can find, starting from `.', recursively
find . -exec printf a \; | wc -c
where a
can be any one-byte character. For each file find
finds (including .
!), one byte is printed; wc -c
counts these bytes (you can as well print fixed lines and count lines). The downside is -exec
spawns a separate printf
process for every file, this is costly, this is slow. With GNU find
you can make find
itself do the job of printf
:
find . -printf a | wc -c
The above command should perform better, it's not portable though. A portable and probably improved approach is to use the printf
builtin of your sh
:
# count all files `find' can find, starting from `.', recursively
find . -exec sh -c 'for f do printf a; done' find-sh {} + | wc -c
With this approach one sh
will process many pathnames and call printf a
for each. There may be more than one sh
spawned (to avoid argument list too long
error, find
is that smart), still way less than one per pathname. (Note: find-sh
is explained here: What is the second sh in sh -c 'some shell code' sh
?)
I wrote "probably improved approach" because printf
may or may not be a builtin in your sh
. If it's not a builtin then the command will perform slightly worse than the one with -exec printf …
. In practice printf
is a builtin in basically any implementation of sh
. Still formally this is not required.
Magic begins – adding tests
find
is well equipped to perform various tests on files it visits. Want to count regular files? Here:
# count all regular files `find' can find, staring from `.', recursively
find . -type f \
-exec sh -c 'for f do printf a; done' find-sh {} + | wc -c
Hidden files? Here:
# count all hidden files `find' can find, starting from `.', recursively
find "$PWD" -name '.*' \
-exec sh -c 'for f do printf a; done' find-sh {} + | wc -c
Note if I used find . …
then the current working directory would match -name '.*'
regardless of its "real" name. I used (properly quoted) $PWD
to make find
recognize the current working directory under its "real" name.
You can combine tests. Hidden regular files? Here:
# count all hidden regular files `find' can find, starting from `.', recursively
find "$PWD" -type f -name '.*' \
-exec sh -c 'for f do printf a; done' find-sh {} + | wc -c
You can test virtually anything. Remember -exec foo … \;
is also a test, it succeeds iff foo
returns exit status 0
; this way you can build custom tests (example).
The downside is find
is not easy to master. Common surprises:
-
-o
, - rounding (in some tests),
-
unintuitive meaning of
-n
,n
and+n
(in some tests).
You may find this answer of mine useful (especially "theory" and "pitfalls"). Simple tests like -type f
are quite straightforward though.
Then there is the recursiveness. find .
finds .
and everything below. In GNU find
one can use -mindepth 1
to omit the starting point(s), similarly -maxdepth 1
suppresses descending into subdirectories. In other words GNU find . -mindepth 1 -maxdepth 1
should find what ls -A
prints. I believe BSD find
uses -depth 1
for this (note -depth n
is very different than -depth
). All these are not portable. This answer provides a portable solution:
Generally though, it's depth 1 you want (
-mindepth 1 -maxdepth 1
) as you don't want to consider.
(depth 0), and then it's even simpler:find . ! -name . -prune -extra-conditions-and-actions
And this leads us to the following example:
# count all hidden regular files `find' can find, inside `.', non-recursively
find . ! -name . -prune -type f -name '.*' \
-exec sh -c 'for f do printf a; done' find-sh {} + | wc -c
Note it's safe to use .
here (no need for $PWD
) because .
does not pass ! -name .
and therefore the fact it matches -name '.*'
is irrelevant. Actually if we used $PWD
, we would complicate things because we would need to replace ! -name .
with something else and in general this would be non-trivial.
One big difference between ls … | … | wc -l
and any our find … | wc -c
is: with find
we don't parse anything. Our tests in find
test directly what we want to test, not some textual representation of it; they don't rely on our understanding of the output format of any tool (like ls
or whatever). Pathnames with spaces, newlines or whatever cannot break things because they never appear in anything we process.
Another important difference is the ability of find
to run virtually any test.
Magic continues – multiple counters
We know how to count files matching virtually any criteria. Each of the commands we used in the previous section gives us a single number. If we wanted two numbers, e.g. the total number of files and the number of regular files, then we could run two different find … | wc -c
commands. This would be sub-optimal because:
- each
find
would traverse the directory tree on its own; caching in the OS may mitigate the problem, but still; - if something creates regular files in the meantime, it may happen you will find more regular files than files in total; each number will be in some sense correct at the time it is calculated, yet as a tuple they won't make sense.
For these reasons one may want a single find
to give us two (or more) numbers somehow.
Note: from now on I don't bother myself with stopping find .
from testing .
itself or from being recursive. Additionally for brevity I will use non-portable -printf
(that works in GNU find
) where I need it; the examples way above should suffice if you need a portable equivalent.
This (sub-optimal) code counts the total number of files and the number of regular files, using one find
:
find . -printf 'files total\n' \
-type f -printf 'regular files\n' \
| sort | uniq -c
And this (sub-optimal) code counts the number of directories, the number of regular files, the number of symlinks, finally other files:
find . \( -type d -printf 'directories\n' \) \
-o \( -type f -printf 'regular files\n' \) \
-o \( -type l -printf 'symlinks\n' \) \
-o -printf 'of other types\n' \
| sort | uniq -c
It's sub-optimal because there are at least three problems with it:
- The counting is performed by
uniq -c
, it needs priorsort
. But in general sorting is not required to count things: you see a thing of some type and you increase the respective counter. If the directory tree is huge thensort
will do a lot of work. It will be good to replacesort | uniq -c
with some tool(s) more fitted for the job. - The final sequence of lines depends on how
sort
sorts strings first. You will probably preferother types
in the last line of output, but we cannot easily affect the sequence. - If there are no symlinks at all then there will be no line stating
0 symlinks
. Suppose you see2 of other types
and no line mentioning symlinks. Then it's natural to assume "other types" include symlinks, but this is not true.
With awk
we can solve all the three problems:
find . \( -type d -printf 'd\n' \) \
-o \( -type f -printf 'f\n' \) \
-o \( -type l -printf 'l\n' \) \
-o -printf 'o\n' \
| awk '
BEGIN {count["d"]=0; count["f"]=0; count["l"]=0; count["o"]=0}
{count[$0]++}
END {
printf("%s directories\n", count["d"])
printf("%s regular files\n", count["f"])
printf("%s symlinks\n", count["l"])
printf("%s of other types\n", count["o"])
print "------"
printf("%s files total\n", count["d"]+count["f"]+count["l"]+count["o"])
}'
Now we are in control of what our code prints. We can even make it print lines like N_DIRS=123
and eval
the output in a shell script, so shell variables are created to be used later in the script.
Note how I used awk
to sum four numbers. I could make find
additionally print t
(like "total") for any single file and count appearances of t
with awk
, then I wouldn't need to sum in awk
. My point is the general scheme is quite flexible. It's basically this:
-
find
performs tests of our choice and prints tokens (lines) of our choice. A file may generate zero, one or more tokens, depending on what we want. Obviously the more you are familiar withfind
, the more complex logic you can code without bugs. -
awk
counts how many times each token appears. -
awk
may do additional calculations. -
awk
prints the result, using format of our choice.
Even more complex code, shell function
The following function never tests .
(because I think usually you don't want to count the current working directory). It's recursive when invoked as count -R
, non-recursive otherwise. If needed, get rid of non-portable -printf
like we did earlier in this answer.
# should work in sh
count() (
unset IFS
if [ "$1" = -R ]; then
arg=''
else
arg='-prune'
fi
find . ! -name . $arg \( \
\( -type d \( -name '.*' -printf 'dh\n' -o -printf 'dn\n' \) \) \
-o \( -type f \( -name '.*' -printf 'fh\n' -o -printf 'fn\n' \) \) \
-o \( -type l \( -name '.*' -printf 'lh\n' -o -printf 'ln\n' \) \) \
-o \( -name '.*' -printf 'oh\n' -o -printf 'on\n' \) \
\) \
| awk '
BEGIN {
count["dn"]=0; count["dh"]=0
count["fn"]=0; count["fh"]=0
count["ln"]=0; count["lh"]=0
count["on"]=0; count["oh"]=0
}
{ count[$0]++ }
END {
tn=count["dn"]+count["fn"]+count["ln"]+count["on"]
th=count["dh"]+count["fh"]+count["lh"]+count["oh"]
t=tn+th
printf("%9d directories (%9d non-hidden, %9d hidden)\n", count["dn"]+count["dh"], count["dn"], count["dh"])
printf("%9d regular files (%9d non-hidden, %9d hidden)\n", count["fn"]+count["fh"], count["fn"], count["fh"])
printf("%9d symlinks (%9d non-hidden, %9d hidden)\n", count["ln"]+count["lh"], count["ln"], count["lh"])
printf("%9d of other types (%9d non-hidden, %9d hidden)\n", count["on"]+count["oh"], count["on"], count["oh"])
print "-----------------------------------------------------------------"
printf("%9d files total (%9d non-hidden, %9d hidden)\n", t, tn, th)
}'
)
I guess the awk
code is less elegant than it could be; and I'm not really sure if it's totally portable (I haven't studied the specification thoroughly).
In general one should double-quote variables like $arg
, but here we need $arg
to disappear when empty. Possible values are safe, unless the $IFS
contains any character appearing in the string -prune
. I deliberately designed the function to always run in a subshell (f() (…)
syntax instead of f() {…}
) where I unset IFS
to make sure the unquoted $arg
will always work.
Example output:
$ count -R
40130 directories ( 40043 non-hidden, 87 hidden)
363974 regular files ( 362220 non-hidden, 1754 hidden)
6797 symlinks ( 6793 non-hidden, 4 hidden)
25 of other types ( 25 non-hidden, 0 hidden)
-----------------------------------------------------------------
410926 files total ( 409081 non-hidden, 1845 hidden)