rm on a directory with millions of files

The data=writeback mount option deserves to be tried, in order to prevent journaling of the file system. This should be done only during the deletion time, there is a risk however if the server is being shutdown or rebooted during the delete operation.

According to this page,

Some applications show very significant speed improvement when it is used. For example, speed improvements can be seen (...) when applications create and delete large volumes of small files.

The option is set either in fstab or during the mount operation, replacing data=ordered with data=writeback. The file system containing the files to be deleted has to be remounted.


Update August 2021

This answer continues to attract a lot of attention and I feel as if its so woefully out of date it kind of is redundant now.

Doing a find ... -delete is most likely going to produce acceptable results in terms of performance.

The one area I felt might result in a higher performance is tackling the 'removing' part of the problem instead of the 'listing' part.

I tried it and it didn't work. But I felt it was useful to explain what I did and why.

In todays newer kernels, through the use of the IO uring subsystem in the kernel (see man 2 io_uring_setup) it is actually possible to attempt to perform unlinks asynchronously -- meaning we can submit unlink requests without waiting or blocking to see the result.

This program basically reads a directory, submits hundreds of unlinks without waiting for the result, then reaps the results later once the system is done handling the request.

It tries to do what dentls did but uses IO uring. Can be compiled with gcc -o dentls2 dentls2.c -luring.

#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <sched.h>

#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>

#include <linux/io_uring.h>
#include <liburing.h>

/* Try to keep the queue size to under two pages as internally its stored in
 * the kernel as contiguously ordered pages. Basically the bigger you make it
 * the higher order it becomes and the less likely you'll have the contiguous
 * pages to support it, despite not hitting any user limits.
 * This reduces an ENOMEM here by keeping the queue size as order 1
 * Ring size internally is rougly 24 bytes per entry plus overheads I haven't
 * accounted for.
 */
#define QUEUE_SIZE 256

/* Globals to manage the queue */
static volatile int pending = 0;
static volatile int total_files = 0;

/* Probes kernel uring implementation and checks if action is 
 * supported inside the kernel */
static void probe_uring(
    struct io_uring *ring)
{
  struct io_uring_probe *pb = {0};

  pb = io_uring_get_probe_ring(ring);

  /* Can we perform IO uring unlink in this kernel ? */
  if (!io_uring_opcode_supported(pb, IORING_OP_UNLINKAT)) {
    free(pb);
    errno = ENOTSUP;
    err(EXIT_FAILURE, "Unable to configure uring");
  }

  free(pb);
}


/* Place a unlink call for the specified file/directory on the ring */
static int submit_unlink_request(
    int dfd,
    const char *fname,
    struct io_uring *ring)
{
  char *fname_cpy = strdup(fname);
  struct io_uring_sqe *sqe = NULL;

  /* Fetch a free submission entry off the ring */
  sqe = io_uring_get_sqe(ring);
  if (!sqe)
    /* Submission queue full */
    return 0;

  pending++;
  /* Format the unlink call for submission */
  io_uring_prep_rw(IORING_OP_UNLINKAT, sqe, dfd, fname_cpy, 0, 0);
  sqe->unlink_flags = 0;

  /* Set the data to just be the filename. Useful for debugging
   * at a later point */
  io_uring_sqe_set_data(sqe, fname_cpy);

  return 1;
}


/* Submit the pending queue, then reap the queue
 * clearing up room on the completion queue */
static void consume_queue(
    struct io_uring *ring)
{
  char *fn;
  int i = 0, bad = 0;
  int rc;
  struct io_uring_cqe **cqes = NULL;

  if (pending < 0)
    abort();

  cqes = calloc(pending, sizeof(struct io_uring_cqe *));
  if (!cqes)
    err(EXIT_FAILURE, "Cannot find memory for CQE pointers");

  /* Notify about submitted entries from the queue (this is a async call) */
  io_uring_submit(ring);

  /* We can immediately take a peek to see if we've anything completed */
  rc = io_uring_peek_batch_cqe(ring, cqes, pending);

  /* Iterate the list of completed entries. Check nothing crazy happened */
  for (i=0; i < rc; i++) {
    /* This returns the filename we set earlier */
    fn = io_uring_cqe_get_data(cqes[i]);

    /* Check the error code of the unlink calls */
    if (cqes[i]->res < 0) {
      errno = -cqes[i]->res;
      warn("Unlinking entry %s failed", fn);
      bad++;
    }

    /* Clear up our CQE */
    free(fn);
    io_uring_cqe_seen(ring, cqes[i]);
  }

  pending -= rc + bad;
  total_files += rc - bad;
  free(cqes);
}



/* Main start */
int main(
    const int argc,
    const char **argv)
{
  struct io_uring ring = {0};
  struct stat st = {0};
  DIR *target = NULL;
  int dfd;
  struct dirent *fn;

  /* Check initial arguments passed make sense */
  if (argc < 2)
    errx(EXIT_FAILURE, "Must pass a directory to remove files from.");

  /* Check path validity */
  if (lstat(argv[1], &st) < 0)
    err(EXIT_FAILURE, "Cannot access target directory");

  if (!S_ISDIR(st.st_mode)) 
    errx(EXIT_FAILURE, "Path specified must be a directory");

  /* Open the directory */
  target = opendir(argv[1]);
  if (!target)
    err(EXIT_FAILURE, "Opening the directory failed");
  dfd = dirfd(target);

  /* Create the initial uring for handling the file removals */
  if (io_uring_queue_init(QUEUE_SIZE, &ring, 0) < 0)
    err(EXIT_FAILURE, "Cannot initialize URING");

  /* Check the unlink action is supported */
  probe_uring(&ring);

  /* So as of writing this code, GETDENTS doesn't have URING support.
   * but checking the kernel mailing list indicates its in progress.
   * For now, we'll just do laymans readdir(). These days theres no 
   * actual difference between it and making the getdents() call ourselves.
   */
  while (fn = readdir(target)) {
    if (fn->d_type != DT_REG)
      /* Pay no attention to non-files */
      continue;

    /* Add to the queue until its full, try to consume it
     * once its full. 
     */
    while (!submit_unlink_request(dfd, fn->d_name, &ring)) {
      /* When the queue becomes full, consume queued entries */
      consume_queue(&ring);
      /* This yield is here to give the uring a chance to 
       * complete pending requests */
      sched_yield();
      continue;
    }
  }

  /* Out of files in directory to list. Just clear the queue */
  while (pending) {
    consume_queue(&ring);
    sched_yield();
  }

  printf("Total files: %d\n", total_files);

  io_uring_queue_exit(&ring);
  closedir(target);
  exit(0);
}

The results were ironically opposite what I suspected, but why?

TMPFS with 4 million files

$ time ./dentls2 /tmp/many
Total files: 4000000

real    0m6.459s
user    0m0.360s
sys 0m24.224s

Using find:

$ time find /tmp/many -type f -delete

real    0m9.978s
user    0m1.872s
sys 0m6.617s

BTRFS with 10 million files

$ time ./dentls2 ./many
Total files: 10000000

real    10m25.749s
user    0m2.214s
sys 16m30.865s

Using find:

time find ./many -type f -delete

real    7m1.328s
user    0m9.209s
sys 4m42.000s

So it looks as if batched syscalls dont make an improvement in real time. The new dentls2 spends much more time working (four times as much) only to result in worse performance. So a net loss in overall efficiency and worse latency. dentls2 is worse.

The cause of this is because io_uring produces kernel dispatcher threads to do the unlink work internally, but the directory inode being worked on can only be modified by a single writer at one time.

Basically using the uring we're creating lots of little threads but only one thread is allowed to delete from the directory. We've just created a bunch of contention and eliminated the advantage of doing batched IO.

Using eBPF you can measure the unlink frequencies and watch what causes the delays.

In the case of BTRFS its the kernel function call btrfs_commit_inode_delayed_inode which acquires the lock when unlink is called.

With dentls2

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
    Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 18       |                                        |
       512 -> 1023       : 120      |                                        |
      1024 -> 2047       : 50982    |                                        |
      2048 -> 4095       : 2569467  |********************                    |
      4096 -> 8191       : 4936402  |****************************************|
      8192 -> 16383      : 1662380  |*************                           |
     16384 -> 32767      : 656883   |*****                                   |
     32768 -> 65535      : 85409    |                                        |
     65536 -> 131071     : 21715    |                                        |
    131072 -> 262143     : 9719     |                                        |
    262144 -> 524287     : 5981     |                                        |
    524288 -> 1048575    : 857      |                                        |
   1048576 -> 2097151    : 293      |                                        |
   2097152 -> 4194303    : 220      |                                        |
   4194304 -> 8388607    : 255      |                                        |
   8388608 -> 16777215   : 153      |                                        |
  16777216 -> 33554431   : 56       |                                        |
  33554432 -> 67108863   : 6        |                                        |
  67108864 -> 134217727  : 1        |                                        |

avg = 8533 nsecs, total: 85345432173 nsecs, count: 10000918

Using find ... -delete:

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 34       |                                        |
       512 -> 1023       : 95       |                                        |
      1024 -> 2047       : 1005784  |****                                    |
      2048 -> 4095       : 8110338  |****************************************|
      4096 -> 8191       : 672119   |***                                     |
      8192 -> 16383      : 158329   |                                        |
     16384 -> 32767      : 42338    |                                        |
     32768 -> 65535      : 4667     |                                        |
     65536 -> 131071     : 3597     |                                        |
    131072 -> 262143     : 2860     |                                        |
    262144 -> 524287     : 216      |                                        |
    524288 -> 1048575    : 22       |                                        |
   1048576 -> 2097151    : 6        |                                        |
   2097152 -> 4194303    : 3        |                                        |
   4194304 -> 8388607    : 5        |                                        |
   8388608 -> 16777215   : 3        |                                        |

avg = 3258 nsecs, total: 32585481993 nsecs, count: 10000416

You can see from the histogram that find spends 3258 nanoseconds on average in btrfs_commit_inode_delayed_inode but dentls2 spends 8533 nanoseconds in the function.

Also the histogram shows that overall io_uring threads spend at least twice as long waiting on the lock which the majority of calls taking 4096-8091 nanoseconds versus the majority in find taking 2048-4095 nanoseconds.

Find is single-threaded and isn't contending for the lock, whereas `dentls2 is multi-threaded (due to the uring) which produces lock contention and the delays that are experienced are reflected in the analysis.

Conclusion

All in all, on modern systems (as of writing this) there is less and less you can do in software to make this go faster than it is set to go.

It used to be reading a large buffer from the disk you could compound an expensive IO call down into one large sequential read, instead of seeky IO which small getdents() buffers could typically end up being.

Also due to other improvements there are smaller overheads to just invoking system calls and major improvements in sequential/random IO access times that eliminate the big IO bottlenecks we used to experience.

On my systems, this problem has become memory/cpu bound. Theres a single-accessor problem on (at least) BTRFS which limits the speed you can go to a single cpu/programs worth of unlinks per directory at a time. Trying to batch the IO's yields at best minor improvements even in ideal circumstances of using tmpfs and typically is worse on a real-world filesystem.

To top it off, we really dont have this problem anymore -- gone are the days of 10 million files taking 4 hours to remove.

Just do something simple like find ... -delete. No amount of optimization I tried seemed to yield major performance improvements worth the coding (or analysis) over a default simple setup.


Original Answer

Whilst a major cause of this problem is ext3 performance with millions of files, the actual root cause of this problem is different.

When a directory needs to be listed readdir() is called on the directory which yields a list of files. readdir is a posix call, but the real Linux system call being used here is called 'getdents'. Getdents list directory entries by filling a buffer with entries.

The problem is mainly down to the fact that that readdir() uses a fixed buffer size of 32Kb to fetch files. As a directory gets larger and larger (the size increases as files are added) ext3 gets slower and slower to fetch entries and additional readdir's 32Kb buffer size is only sufficient to include a fraction of the entries in the directory. This causes readdir to loop over and over and invoke the expensive system call over and over.

For example, on a test directory I created with over 2.6 million files inside, running "ls -1|wc-l" shows a large strace output of many getdent system calls.

$ strace ls -1 | wc -l
brk(0x4949000)                          = 0x4949000
getdents(3, /* 1025 entries */, 32768)  = 32752
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1025 entries */, 32768)  = 32760
getdents(3, /* 1025 entries */, 32768)  = 32768
brk(0)                                  = 0x4949000
brk(0x496a000)                          = 0x496a000
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1026 entries */, 32768)  = 32760
...

Additionally the time spent in this directory was significant.

$ time ls -1 | wc -l
2616044

real    0m20.609s
user    0m16.241s
sys 0m3.639s

The method to make this a more efficient process is to call getdents manually with a much larger buffer. This improves performance significantly.

Now, you're not supposed to call getdents yourself manually so no interface exists to use it normally (check the man page for getdents to see!), however you can call it manually and make your system call invocation way more efficient.

This drastically reduces the time it takes to fetch these files. I wrote a program that does this.

/* I can be compiled with the command "gcc -o dentls dentls.c" */

#define _GNU_SOURCE

#include <dirent.h>     /* Defines DT_* constants */
#include <err.h>
#include <fcntl.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

struct linux_dirent {
        long           d_ino;
        off_t          d_off;
        unsigned short d_reclen;
        char           d_name[256];
        char           d_type;
};

static int delete = 0;
char *path = NULL;

static void parse_config(
        int argc,
        char **argv)
{
    int option_idx = 0;
    static struct option loptions[] = {
      { "delete", no_argument, &delete, 1 },
      { "help", no_argument, NULL, 'h' },
      { 0, 0, 0, 0 }
    };

    while (1) {
        int c = getopt_long(argc, argv, "h", loptions, &option_idx);
        if (c < 0)
            break;

        switch(c) {
          case 0: {
              break;
          }
 
          case 'h': {
              printf("Usage: %s [--delete] DIRECTORY\n"
                     "List/Delete files in DIRECTORY.\n"
                     "Example %s --delete /var/spool/postfix/deferred\n",
                     argv[0], argv[0]);
              exit(0);                      
              break;
          }

          default:
          break;
        }
    }

    if (optind >= argc)
      errx(EXIT_FAILURE, "Must supply a valid directory\n");

    path = argv[optind];
}

int main(
    int argc,
    char** argv)
{

    parse_config(argc, argv);

    int totalfiles = 0;
    int dirfd = -1;
    int offset = 0;
    int bufcount = 0;
    void *buffer = NULL;
    char *d_type;
    struct linux_dirent *dent = NULL;
    struct stat dstat;

    /* Standard sanity checking stuff */
    if (access(path, R_OK) < 0) 
        err(EXIT_FAILURE, "Could not access directory");

    if (lstat(path, &dstat) < 0) 
        err(EXIT_FAILURE, "Unable to lstat path");

    if (!S_ISDIR(dstat.st_mode))
        errx(EXIT_FAILURE, "The path %s is not a directory.\n", path);

    /* Allocate a buffer of equal size to the directory to store dents */
    if ((buffer = calloc(dstat.st_size*3, 1)) == NULL)
        err(EXIT_FAILURE, "Buffer allocation failure");

    /* Open the directory */
    if ((dirfd = open(path, O_RDONLY)) < 0) 
        err(EXIT_FAILURE, "Open error");

    /* Switch directories */
    fchdir(dirfd);

    if (delete) {
        printf("Deleting files in ");
        for (int i=5; i > 0; i--) {
            printf("%u. . . ", i);
            fflush(stdout);
            sleep(1);
        }
        printf("\n");
    }

    while (bufcount = syscall(SYS_getdents, dirfd, buffer, dstat.st_size*3)) {
        offset = 0;
        dent = buffer;
        while (offset < bufcount) {
            /* Don't print thisdir and parent dir */
            if (!((strcmp(".",dent->d_name) == 0) || (strcmp("..",dent->d_name) == 0))) {
                d_type = (char *)dent + dent->d_reclen-1;
                /* Only print files */
                if (*d_type == DT_REG) {
                    printf ("%s\n", dent->d_name);
                    if (delete) {
                        if (unlink(dent->d_name) < 0)
                            warn("Cannot delete file \"%s\"", dent->d_name);
                    }
                    totalfiles++;
                }
            }
            offset += dent->d_reclen;
            dent = buffer + offset;
        }
    }
    fprintf(stderr, "Total files: %d\n", totalfiles);
    close(dirfd);
    free(buffer);

    exit(0);
}

Whilst this does not combat the underlying fundamental problem (lots of files, in a filesystem that performs poorly at it). It's likely to be much, much faster than many of the alternatives being posted.

As a forethought, one should remove the affected directory and remake it after. Directories only ever increase in size and can remain poorly performing even with a few files inside due to the size of the directory.

Edit: I've cleaned this up quite a bit. Added an option to allow you to delete on the command line at runtime and removed a bunch of the treewalk stuff which, honestly looking back was questionable at best. Also was shown to produce memory corruption.

You can now do dentls --delete /my/path

New results. Based off of a directory with 1.82 million files.

## Ideal ls Uncached
$ time ls -u1 data >/dev/null

real    0m44.948s
user    0m1.737s
sys 0m22.000s

## Ideal ls Cached
$ time ls -u1 data >/dev/null

real    0m46.012s
user    0m1.746s
sys 0m21.805s


### dentls uncached
$ time ./dentls data >/dev/null
Total files: 1819292

real    0m1.608s
user    0m0.059s
sys 0m0.791s

## dentls cached
$ time ./dentls data >/dev/null
Total files: 1819292

real    0m0.771s
user    0m0.057s
sys 0m0.711s

Was kind of surprised this still works so well!