How can I delete a file if it starts with <html> in bash?

I need a bash command to delete the entire file if the file itself begins with <html>.

I'm not sure the best way to go about this...

Context: I download a series of files via curl requests. Most time the downloads and processing work fine. But other times the download request results in a 404 for whatever reason. When I get those, the contents of the downloaded file begins with a html tag. When the rest of my processing hits this file, it hangs. So I want to run a command prior to my other processing to cat each of the files and delete the file if it has this html tag.


Solution 1:

To address the question that prompted you to ask this one, rather than the one you actually asked:

curl can tell you the status code in addition to downloading the file. You do not need to check the file's contents for that. An example of how to check the status is

status=$(curl -w '%{http_code}' "${url}" -o "${file}")
test "${status}" -eq 200 || rm -- "${file}"

The various options you can use with -w are documented in the manual, and depending on your needs, you may want to extend this to output more information and parse it, and/or change the check of the status code to allow more than merely 200.

Solution 2:

You could use this find command to delete all files only containing only the <html> pattern in the first line:

find . -type f -exec sh -c 'sed q "$0" | grep -qP "^<html>$" && rm "$0"' {} \;