Deleting millions of files

I had a dir fill up with millions of gif images. Too many for rm command.

I have been trying the find command like this:

find . -name "*.gif" -print0 | xargs -0 rm

Problem is, it bogs down my machine really bad, and causes time outs for customers since it's a server.

Is there any way that is quicker to delete all these files...without locking up the machine?


Quicker is not necessarily what you want. You may want to actually run slower, so the deletion chews up fewer resources while it's running.

Use nice(1) to lower the priority of a command.

nice find . -name "*.gif" -delete

For I/O-bound processes nice(1) might not be sufficient. The Linux scheduler does take I/O into account, not just CPU, but you may want finer control over I/O priority.

ionice -c 2 -n 7 find . -name "*.gif" -delete

If that doesn't do it, you could also add a sleep to really slow it down.

find . -name "*.gif" -exec sleep 0.01 \; -delete

Since you're running Linux and this task is probably I/O-bound, I advise to give your command idle I/O scheduler priority using ionice(1):

ionice -c3 find . -name '*.gif' -delete

Comparing to your original command, I guess this may even spare some more CPU cycles by avoiding the pipe to xargs.


No.

There is no quicker way, appart from soft-format of the disk. The files are given to rm at once (up to the limit of the command line, it could be also set to the xargs) which is much better than calling rm on each file. So no, there is definitely no faster way.

Using nice (or renice on a running process) helps only partially, because that is for scheduling the CPU resource, not disk! And the CPU usage will be very low. This is a linux weakness - if one process "eats up" the disk (i.e. works a lot with it), the whole machine gets stuck. Modified kernel for real time usage could be a solution.

What I would do on the server is to manually let other processes do their job - include pauses to keep the server "breathe":

find . -name "*.gif" > files
split -l 100 files files.
for F in files.* do
    cat $F | xargs rm
    sleep 5 
done

This will wait 5 seconds after every 100 files. It will take much longer but your customers shouldn't notice any delays.


If the number of files that are to be deleted vastly outnumbers the files which are left behind, it may not be the most efficient approach to walk the tree of files to be deleted and do all those filesystem updates. (It analogous to doing doing clumsy reference-counted memory management, visiting every object in a large tree to drop its reference, instead of making everything unwanted into garbage in one step, and then sweeping through what is reachable to clean up.)

That is to say, clone the parts of the tree that are to be kept to another volume. Re-create a fresh, blank filesystem on the original volume. Copy the retained files back to their original paths. This is vaguely similar to copying garbage collection.

There will be some downtime, but it could be better than continuous bad performance and service disruption.

It may be impractical in your system and situation, but it's easy to imagine obvious cases where this is the way to go.

For instance, suppose you wanted to delete all files in a filesystem. What would be the point of recursing and deleting one by one? Just unmount it and do a "mkfs" over top of the partition to make a blank filesystem.

Or suppose you wanted to delete all files except for half a dozen important ones? Get the half a dozen out of there and ... "mkfs" over top.

Eventually there is some break-even point when there are enough files that have to stay, that it becomes cheaper to do the recursive deletion, taking into account other costs like any downtime.