How can I operate on all files of a certain type if they might not have the right extension?
This question is prompted by a short script I found in a Linux magazine. As evidence that I didn't make this up, here's a picture of it:
I would like to write a letter to the editor of this publication about what's wrong with this and how to write it better.
The script attempts to capture jpeg files into a variable, so that something (compression using lepton
) can be done with them.
for jpeg in `echo "$(file $(find ./ ) |
grep JPEG | cut -f 1 -d ':')"`
do
/path/to/command "$jpeg"
...
Apparently in this instance we can't trust the files to be named with a .jpg
extension so we can't catch them with something like
for f in *.JPG *.jpg *.JPEG *.jpeg ; do ...
because the writer has used file
to check their type, but if the filenames can't be trusted to have a sensible extension, then I don't see how we can trust them not to be -rf *
or (; \ $!|
or have newlines or whatever else.
How can I sanely capture files into a variable by type with for
or while
, or perhaps avoid doing so by using find
with -exec
, or some other method?
Bonus for insights into and demonstrations of what's wrong with the code in the picture.
I've tagged this question with [bash] since it's about a bash script, but if you feel like answering with a way to do this that doesn't use bash, then please feel free to do that.
0. The script wants to do something like this.
The script shown in your question tries to enumerate files and check if they are JPEGs, but does neither reliably. It tries to pass all the paths to file
in a single run and extract both filenames and types from the output of file
, which is reasonable since it may be faster than running file
again and again for each file. But to do it correctly, you need to be careful about how the paths are passed to file
, how file
delimits its output, and how you consume that output. You can use this:
#!/bin/bash
find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
case "$mimetype" in image/jpeg)
# Bash placed the filename in "$REPLY" -- put commands that use it here.
# You can have as many commands as you want before the closing ";;" token.
;;
esac
done
That's one of several correct ways. (It does not need to set IFS=
; see below.) find
with +
passes multiple path arguments to file
and only runs it as many times as necessary to process them all, usually just once. Credit goes to αғsнιη for the idea of passing --mime-type
to file
to obtain the MIME type, which contains the information you actually want and is easy to parse.
A detailed explanation follows. I've used the specific task of JPEG compression as an example. That's what the script you showed is for, and lepton
has some oddities that should be considered in deciding how to improve that script. If you just want to see a script that runs lepton
on each JPEG file, you can skip to section 7. Putting It All Together.
The term path has several definitions. In this answer I use it to mean pathname.
1. Installing lepton
The script you showed is meant to traverse a directory hierarchy, find JPEG images, and process them with the lossless JPEG compressor lepton
. For the main motivation of your question, the command may not really matter, but different commands have different syntax. Some commands accept multiple input filenames for a single run. Most accept --
to indicate the end of options. I'll use lepton
as my example. The lepton
command doesn't accept multiple input filenames and doesn't recognize --
.
To use lepton
, install it first. It's officially packaged for Ubuntu 17.04 and later (sudo apt install lepton
). For earlier Ubuntu releases, or to use a newer version than is packaged for your release, clone its git
repository (git clone https://github.com/dropbox/lepton.git
) and build the source as instructed in the README. Or you might be able to find a PPA.
Depending how you install it, lepton
may be in /usr/bin
, /usr/local/bin
, or elsewhere. Probably you will want it somewhere in $PATH
; then you can run it as lepton
. The script you showed uses absolute paths to lepton
and the standard utilities mv
and rm
, but not to the other standard utilities file
, find
, grep
and cut
. (This is Bash, so echo
--pointless in that script anyway--is a shell builtin. exit
is always a builtin.) Though this isn't one of the script's serious flaws, there's no discernible reason for such inconsistency. Unless you're writing a script to tolerate not having $PATH
set sensibly--in which case you must use absolute paths for all external commands--I suggest using relative paths for standard commands and those you've installed.
2. Running lepton
Cautions and General Information
I tested with lepton v1.0-1.2.1-104-g209463a (from Git). lepton
was released back in July 2016 so I'd guess the current syntax will keep working. But future versions may add features. If you're reading this years from now, you might check if lepton
has added support for tasks that once required scripting.
Please be careful what command-line arguments you pass. For example, I tried running lepton
with -verbose
as the first argument and art.jpg
as the second. It interpreted -verbose
as an input filename and quit with an error, but not before truncating art.jpg
--which it interpreted as an output filename--down to zero bytes. Fortunately I had a backup!
You can pass zero, one, or two paths to lepton
. In all cases, it examines its input file or stream to see if it contains JPEG or Lepton data. JPEG is compressed to Lepton; Lepton is decompressed to JPEG. lepton
will remove and add file extensions but doesn't use them to decide what to do.
Zero Filenames — lepton -
reads from stdin and writes to stdout.
Thus lepton - < infile > outfile
is one way to read from infile
and write to outfile
, even if their names start with -
(like options do). But the method I'll use passes paths that start with .
, so I won't have to worry about this.
One Filename — lepton infile
reads infile
and names its own output file.
This is how the script you showed uses lepton
.
If the content of infile
looks like a JPEG, lepton
outputs a Lepton file; if its content looks like a Lepton file, lepton
outputs a JPEG. lepton
decides how it wants to name its output file by stripping an extension from infile
, if any, and adding either a .jpg
or .lep
extension depending on what kind of file it is creating. But it does not use the extension it is removing (if any) to infer the type of file it is operating on.
It considers the last .
and anything after it as an extension. If infile
is a.b.c
, you get a.b.lep
or a.b.jpg
. If the filename starts with a .
with no other .
s, lepton
still regards that as an extension: from a JPEG called .abc
you get .lep
. Only .
in the filename--not directory names--triggers this, so from a Lepton file x/fo.o/abc
you get x/fo.o/abc.jpg
(which you want), not x/fo.jpg
(which would be bad).
If the output filename obtained this way names an existing file, _
s are added to the end, after the extension, until it doesn't, and the name with added underscores is used: abc.lep
, abc.lep_
, abc.lep__
, etc.,xyz.jpg
, xyz.jpg_
, xyz.jpg__
, etc.
This works best when your files are named in a sensible way.
Automatically removing and adding extensions and adding underscores avoids a problem you'd otherwise have to manage yourself--preventing data loss when the output file already exists. But it also exposes what might be a deep design flaw in the script you showed. If your files are named sensibly, then all your JPEG files end in .jpg
or .jpeg
(maybe capitalized), and no non-JPEG files are so named. But then you don't have to examine the files with file
to find out which ones are JPEGs!
Thus the premise of the script you showed is that files might not be named reasonably. It's always bad for a script to behave wrong or unexpectedly on filenames containing spaces, *
, and other special characters. So its behavior of splitting on whitespace and expanding globs (the outer unquoted command substitution, intended just to split separate filenames, does this) is especially bad. See Byte Commander's excellent answer for details. This is probably the worst flaw in the script you showed.
But it's also worth considering what happens to filenames whose last .
doesn't conceptually begin a file extension. Suppose Pictures
has four files, all JPEGs: 01. Milan wide-angle sunset
, 01. Milan wide-angle sunset highres
, 02. Kyle birthday party prep - blooper cakes
, and 03. The subtle found art of unopened expired paint cans with peeling labels
. Then for f in ~/Pictures/0*; do lepton "$f"; done
creates 01.lep
, 01.lep_
, 02.lep
, and 03.lep
--probably not what you want.
If you have JPEGs not named .jpg
or maybe .jpeg
, the best general approach is to rename them that way and investigate any naming conflicts that arise while doing so. But that's beyond the scope of this answer.
Those renaming problems happen with JPEGs not named like JPEGs, not non-JPEGs named like JPEGs. Yet even then, there may be a better solution. If the problem is ._
files from macOS and you don't want to delete them, just exclude files with a leading ._
(or even a leading .
). Still, passing just one path to lepton
avoids data loss (due to its _
appending rules); if the main goal is to exclude non-JPEGs, the basic idea is sound even though the implementation needs fixing.
So I'll use the one-path lepton infile
syntax. But anyone who considers automating lepton
like this on strangely named files should remember the generated .lep
files may be named in ways that don't reveal the input filenames.
Two Filenames — lepton infile outfile
does exactly what you expect.
But just because you expect it doesn't make it the right thing to do.
As with the other ways to run lepton
, lepton
determines whether infile
is a JPEG to be compressed or a Lepton file to be decompressed by examining its content. If infile
is a JPEG, lepton
writes a Lepton file named outfile
; if infile
is a Lepton file, lepton
writes a JPEG named outfile
. With this two-path syntax, lepton
doesn't change your specified output filename in any way. It doesn't add or remove extensions or append _
s to resolve naming conflicts. If outfile
already exists, it is overwritten.
You may want that, but if not and you use this syntax then you have to solve the problem yourself by making your script adjust the output filenames. You may be able to do this in a way that serves you better than lepton
's own scheme when run with just one path argument. But I won't try to guess your specific needs and preferences; I'll just use the one-path syntax.
3. Passing Multiple Paths From find
to file
The script you showed tries to use file $(find ./ )
to pass one path per argument to file
by running find
in command substitution. This often won't work, because $(find ./ )
splits on whitespace, which filenames can contain. It is common for files--especially images!--and folders to have spaces in their names. The script you showed treats a path ./abc/foo bar.jpg
as two paths, ./abc/foo
and bar.jpg
. In the best case, neither exists; if they do, you unintentionally operate on the wrong thing. And the original path won't be processed at all.
Although the breadth of this problem can be lessened by setting IFS=$'\n'
so word splitting is only performed between lines (\n
represents a newline character), this isn't a good solution. Besides being awkward, it can still fail, as file and directory names may contain newlines. I advise against naming files or directories with them except to test programs or scripts for bugs. But such names can be created, including by accident where you don't expect them. The only characters a filename cannot contain are the path separator /
and the null character. The null character is thus the only one that can't appear in a path and the only safe choice to delimit lists of arbitrary paths. That's why find
has a -print0
action and xargs
has a -0
option.
This can be done correctly with find . -print0 | xargs -0 ...
but you don't need a third utility to pass paths from find
to file
. find
's -exec
action is sufficient. Arguments after -exec
build the command to run, until \;
or +
. find ... -exec ... ;
runs a command once per file, while find ... -exec ... +
passes the command as many paths as it can per run, which is usually faster. Typically all the arguments fit and the command runs just once. In rare cases the command line would be too long and find
runs the command more than once. So the +
form is only safe for running commands that (a) take their path arguments at the end and (b) work the same in one run with multiple filenames as they do in separate runs.
lepton
is an example of a command that must not be run using the +
form of -exec
because it does not accept multiple source filenames. The first would be the input, the second would be the output, and others would be excessive. But many commands do do the same thing when run once with several arguments as when run several times with one argument, and file
is one of them.
This command will generate the table:
find . -exec file --mime-type -r0F '' {} +
find
replaces the {}
argument with a path when it invokes file
, and replaces +
with as many additional path arguments as will fit.
The options --mime-type -r0F ''
passed to find
are explained below.
Some people quote {}
, e.g., '{}'
. It's fine to do so, but neither Bash nor other Bourne-style shells require it. Bash and some other shells support brace expansion, but an empty pair of braces is not expanded. I choose not to quote {}
, in light of the misconception that quoting {}
prevents find
from performing word splitting. Even if your shell required {}
to be quoted, this would still have nothing to do with word splitting, because find
never does that. (If you wanted word splitting, you'd have to tell find
to -exec
a shell.) And find
can't tell if you've written {}
or '{}'
--the shell turns '{}'
into {}
(during quote removal) before passing it to find
.
4. Emitting a Usable ⟨Path, File Type⟩ Table with file
The Problem
The reason I must pass some options to file
--and can't just use find . -exec file {} +
--is that the table file
generates by default is ambiguous:
01. Milan wide-angle sunset: JPEG image data, JFIF standard 1.01, resolution (DPI), density 1x1, segment length 16, baseline, precision 8, 1400x1400, frames 3
02. Kyle birthday party prep - blooper cakes: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 512x512, frames 3
first line
second line: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x500, frames 3
Those three rows look like four; one filename contains a newline. Filenames can also contain colons, so it won't always be clear where the filename ends. Way more confusing examples than shown above are possible.
The description column also has way more information than we need. Byte Commander explains one reason grep
ing for JPEG
in each whole row returns wrong results: a non-JPEG file with JPEG
in its name gives a false positive. (The point of checking the type is that you can't rely on the name, so this is quite a self-defeating bug in the script you showed.) But even when you know you're looking in the description column, it may still contain JPEG
even if that's not the type:
$ touch empty.JPEG # not a JPEG
$ gzip -k empty.JPEG
$ file empty.JPEG*
empty.JPEG: empty
empty.JPEG.gz: gzip compressed data, was "empty.JPEG", last modified: Mon Aug 28 16:37:56 2017, from Unix
Byte Commander's answer solved this by (a) passing the -b
option to file
, causing it to omit the paths, :
separator, and spaces in front of the type, then (b) using grep
to check if the description begins with JPEG
(the ^
anchor in the pattern ^JPEG image data,
does this). This works if you keep track of the paths passed to file
--not a problem for Byte Commander's method, which ran file
separately for each path anyway.
The Solution
I must use a different solution, because my goal is to parse both paths and types from file
's output so that file
needn't be run separately for each file. Fortunately file
in Ubuntu has many options. I use file --mime-type -r0F '' paths
:
-
--mime-type
prints a MIME type rather than a detailed description. This is all I need, and then I can just perform an exact match against the whole thing. For a JPEG,file --mime-type
showsimage/jpeg
in the description column. (See also αғsнιη's answer.) - According to
man file
,-r
causes unprintable characters not to be replaced with octal escapes like\003
. I believe I would otherwise need to add a step to convert such sequences back to the actual characters, which probably can't be done reliably--what if such a sequence appears literally in a filename? (file
doesn't escape\
as\\
.) I say "I believe" as I haven't managed to getfile
to print out such an escape sequence, and I'm not sure it really does so in the filename column. Either way,-r
is safe here. -
-0
is the key option here. Without it, this method couldn't work reliably. It makesfile
print a null character--the one character that is never allowed in paths because it is usually used to mark the ends of strings in C programs--immediately after the filename. This marks the break, in each row, between the two columns of the table. -
-F ''
makesfile
print nothing (''
is an empty argument) instead of:
. The colon is unreliable (it can appear in filenames) and of no benefit here since a null character is already being printed to indicate the end of the path column and the start of the description column.
To make find
run file --mime-type -r0F '' paths
I use -exec file --mime-type -r0F '' {} +
. find
's -exec
action replaces {} +
with the paths.
5. Consuming the Table
I created the table this way:
find . -exec file --mime-type -r0F '' {} +
As detailed above, this places a null character after each path. It would be handy if the description were also null-terminated, but file
won't do that--the description always ends with a newline. So I must alternately read until a null character, then assume there is more text and read it until a newline. I must do this for each file and stop when nothing is left.
Reading Each Row
That combination--read text that may contain a newline until a null character, then read text that can't contain a newline until a newline--isn't how any of the common Unix utilities are normally used. The approach I will take is to pipe the output of find
to a loop. Each iteration of the loop reads a single row of the table by using the read
shell builtin twice, with different options.
To read the path, I use:
read -rd ''
-
-r
isread
's only standard option and you should almost always use it. Without it, backslash escapes like\n
from the input are translated into the characters they represent. We don't want that. - Normally,
read
reads until it sees a newline. To ignore newlines and stop at a null character instead, I use the-d
option, which Bash provides, to specify a different character. For a null character, pass the empty argument''
. - I'm already using a Bash extension (the
-d
option), so I may as well avail myself of Bash's default behavior when no variable name is passed toread
. It puts everything it read--except the terminating character--in the special variable$REPLY
. Normallyread
strips whitespace ($IFS
characters) from the beginning and end of the input, and it's a common idiom to writeIFS= read ...
to prevent that. When reading implicitly to$REPLY
in Bash, this is not necessary.
To read the description, I use:
read -r mimetype
- No backslashes should appear in the MIME type, but it's good practice to pass
-r
toread
unless you want\
escapes translated. - This time, I am specifying a variable name explicitly. Call it what you like. I've chosen
mimetype
. - This time, the absence of
IFS=
to prevent leading and trailing whitespace from being stripped is significant. I want it removed. This drops the spaces from the beginning of the description thatfind
writes to make the table more human-readable when it is shown in a terminal.
Composing the Loop
The loop should continue as long as there is another path to be read. The read
command returns true (in shell programming this is zero, unlike almost all other programming languages) when it successfully reads something, and false (in shell programming, any nonzero value) when it doesn't. So the common while read
idiom is useful here. I pipe (|
) the output of find
--which is the output of one or (rarely) more file
commands--to the while
loop.
find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
# Commands using "$REPLY" and "$mimetype" go here.
done
Inside the loop, I read the rest of the row to obtain the description (read -r mimetype
). I don't bother checking if this succeeded. file
should only ever output complete rows even if it encounters errors. (file
sends error and warning messages to standard error, so they won't appear in the pipeline to corrupt the table.) You should be able to rely on this.
If you want to check if read -r mimetype
succeeded anyway, you can use if
. Or you can include it in the while
loop condition:
find . -exec file --mime-type -r0F '' {} + |
while read -rd '' && read -r mimetype; do
# Commands using "$REPLY" and "$mimetype" go here.
done
You can see I also split the top line for readability. (No \
is required to split at |
.)
Testing the Loop
If you want to test the loop before proceeding, you can put this command under (or instead of) the # Commands...
comment:
printf '[%s] [%s]\n\n' "$REPLY" "$mimetype"
The loop output looks something like this, depending on what you have in the directory (and I have left out most entries, for brevity):
[.] [inode/directory]
[./stuv] [inode/x-empty]
[./ghi
jkl] [inode/x-empty]
[./fo.o/abc
def ] [image/jpeg]
[./fo.o/wyz.lep] [application/octet-stream]
[./fo.o/wyz] [image/jpeg]
This is just to see if the loop works right. Placing the table's entries in [
]
like this wouldn't help the script do what it needs to do, as paths may contain [
, ]
, and consecutive newlines.
6. Using the Extracted Path and File Type
In each iteration of the loop, "$REPLY"
contains the path and "$mimetype"
contains the type description. To find out if "$REPLY"
names a JPEG file, check if "$mimetype"
is exactly image/jpeg
.
You can compare strings using if
and [
/test
(or [[
) with =
. But I prefer case
:
find -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
case "$mimetype" in image/jpeg)
# Put commands here that use "$REPLY".
;;
esac
done
If you just wanted to show the JPEGs' paths in the same format as above--to help test with paths containing newlines--the entire case
...esac
statement could be:
case "$mimetype" in image/jpeg) printf '[%s]\n\n' "$REPLY";; esac
But the goal is to run lepton
on each JPEG file. To do that, use:
case "$mimetype" in image/jpeg) lepton "$REPLY";; esac
7. Putting It All Together
Adding that lepton
command, and a hashbang line to run it with Bash, here's the complete script:
#!/bin/bash
find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
case "$mimetype" in image/jpeg) lepton "$REPLY";; esac
done
lepton
reports what it is doing but it doesn't show filenames. This alternative script prints a message with each path before running lepton
on it:
#!/bin/bash
find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
case "$mimetype" in image/jpeg)
printf '\nProcessing "%s":\n' "$REPLY" >&2
lepton "$REPLY"
esac
done
I've printed the messages to standard error (>&2
), since that's where lepton
sends its own messages. That way, the output all stays together when piped or redirected. Running that script produces output like this (but more of it if you have more than two JPEGs):
Processing "./art.jpg":
lepton v1.0-1.2.1-104-g209463a
6777856 bytes needed to decompress this file
56363 86007
65.53%
2635854 bytes needed to decompress this file
56363 86007
65.53%
Processing "./fo.o/abc
def ":
lepton v1.0-1.2.1-104-g209463a
6643508 bytes needed to decompress this file
36332 46875
77.51%
2456117 bytes needed to decompress this file
36332 46875
77.51%
The repetition in each stanza--which also appears when you run lepton
without printing filenames--is because lepton
checks that its output files can decompress correctly.
The script you showed had exit 0
at the end. You can do that if you like. It causes the script to always report success. Otherwise the script returns the exit status of the last command run--which is probably preferable. Either way, it may report success even if find
, file
, or lepton
encountered problems, if the last lepton
command succeeded. You can, of course, expand the script with more sophisticated error handling code.
8. Maybe You Want The Paths, Too
If you want to generate a list of paths separate from lepton
's own output, you can take advantage of lepton
's behavior of writing to standard error by printing the paths to standard output instead. In that case, you probably want to print just the paths and not a "Processing" message. You may optionally want to terminate the paths with null characters instead of newlines, as then you can process the list without breaking on paths that contain newlines.
#!/bin/bash
case "$1" in
-0) format='%s\0';;
*) format='%s\n';;
esac
find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
read -r mimetype
case "$mimetype" in image/jpeg)
printf "$format" "$REPLY"
lepton "$REPLY"
esac
done
When you run that script, you can pass the -0
flag to make it emit null characters instead of newlines. That script does not do proper Unix-style option processing: it only checks the first argument you pass; passing the flag repeatedly in the same argument (-00
) doesn't work; and no option-related error messages are ever generated. This limitation is for brevity, and because you probably don't need anything more sophisticated, as the script doesn't support any non-option arguments and -0
is the only possible option.
On my system I called that script jpeg-lep3
and put it in ~/source
, then ran ~/source/jpeg-lep3 -0 > out
, which printed just lepton
's output to my terminal. If you do something like that, you can test that null characters were properly written between paths using:
xargs -0 printf '[%s]\n\n' < out
Code first:
Let's do this with Bash's special globs and a for
loop:
#!/bin/bash
shopt -s globstar dotglob
for f in ./** ; do
if file -b -- "$f" | grep -q '^JPEG image data,' ; then
# do whatever you want with the JPEG file "$f" in here:
md5sum -- "$f"
fi
done
Explanation:
First of all, we need to make the Bash globs more useful by enabling the globstar
and dotglob
shell options. Here is their description from man bash
in the SHELL BUILTIN COMMANDS section about shopt
:
dotglob
If set, bash includes filenames beginning with a `.' in the results of
pathname expansion.
globstar
If set, the pattern ** used in a pathname expansion context will match
all files and zero or more directories and subdirectories. If the pattern
is followed by a /, only directories and subdirectories match.
Then we use this new "recursive glob" ./**
in a for
loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a ./
or ../
in your globs, not just **
, to prevent problems with special file names like ~
.
Now we test each file (and folder) name with the file
command for its contents. The -b
option prevents it from printing the file name again before the content information string, which makes filtering more safe.
Now we know that the content information of all valid JPG/JPEG files must start with JPEG image data,
, which is what we test the output of file
for with grep
. We use the -q
option to suppress any output, as we are only interested in grep
's exit code, which indicates if the pattern matched or not.
If it matched, the code inside the if
/then
block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable $f
. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after --
, which causes most commands to interpret it as a filename even if it's something like -v
or --help
that would otherwise be interpreted as an option.
Bonus question:
Time to blow up some code, for science! Here is the version from your question/book:
for jpeg in `echo "$(file $(find ./ )
| grep JPEG | cut -f 1 -d ':')"`
do
/path/to/command "$jpeg"
done
First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (``
and $()
), which are just necessary because of the incorrect/suboptimal usage of find
.
Here find
just lists all files and prints their names, one per line. Then the full output is passed to file
to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!
$ ls --escape ne*ne
new\nline
$ file $(find . -name 'ne*ne' )
./new: cannot open `./new' (No such file or directory)
line: cannot open `line' (No such file or directory)
Actually even simple spaces break it too, because those are treated as separators as well by file
. You can't even quote the "$(find ./ )"
here as a remedy, because that would then quote the whole multi-line output as one single filename argument.
$ ls simple*
simple spaces.jpg
$ file $(find ./ -name 'simple*')
./simple: cannot open `./simple' (No such file or directory)
spaces.jpg: cannot open `spaces.jpg' (No such file or directory)
Next step, the file
output gets scanned with grep JPEG
. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plain file
always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.
$ echo "to be or not to be" > IAmNoJPEG.txt
$ file IAmNoJPEG.txt | grep JPEG
IAmNoJPEG.txt: ASCII text
Okay, so we have the file
output of all JPEG files (or those who pretend to be one), now they process all lines with cut
to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:
$ ls colon*
colons:evil.jpeg
$ file colon* | grep JPEG | cut -f 1 -d ':'
colons
So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.