MacOSX shell directory utilities very slow with large directories (millions of files) - any alternatives?
Due to a problem with contact synchronization (not sure what was the source of the problem, probably a program crash on powercut, which caused inconsistency in contact's database file), the synchronization process created nearly 7M files in Images/
:
hostname:Images username$ pwd
/Users/username/Library/Application Support/AddressBook/Sources/4D81D34B-C932-4578-8A31-4E2E244B3875/Images
hostname:Images username$ ls
^C
hostname:Images username$ ls | wc -l
6797073
(the result was after hours)
hostname:Images username$ cd ..
hostname:4D81D34B-C932-4578-8A31-4E2E244B3875 username$ ls -l
total 600224
-rw-r--r--@ 1 username staff 409600 Aug 2 17:43 AddressBook-v22.abcddb
-rw-r--r--@ 1 username staff 32768 Aug 3 00:13 AddressBook-v22.abcddb-shm
-rw-r--r--@ 1 username staff 2727472 Aug 2 23:26 AddressBook-v22.abcddb-wal
drwx------ 65535 username staff 231100550 Aug 2 23:26 Images
-rw-r--r--@ 1 username staff 45056 Dec 7 2017 MailRecents-v4.abcdmr
-rw-r--r--@ 1 username staff 32768 Dec 7 2017 MailRecents-v4.abcdmr-shm
-rw-r--r--@ 1 username staff 4152 Dec 7 2017 MailRecents-v4.abcdmr-wal
drwx------ 5 username staff 170 Feb 26 18:51 Metadata
-rwxr-xr-x 1 username staff 0 Dec 7 2017 OfflineDeletedItems.plist.lockfile
-rwxr-xr-x 1 username staff 0 Dec 7 2017 Sync.lockfile
-rwxr-xr-x 1 username staff 0 Dec 7 2017 SyncOperations.plist.lockfile
When I tried to use shell tools (ls
, find
), I did not get any result in time reasonable for interactive work (it was hours), regardless of disabling file sorting like ls -f
(what seems to help in case of other UNIX-like OSs) etc.
The ls
process has grown to around 1GB in size and worked for HOURS before outputting any result.
My question is - am I missing some tricky option to have this working reasonably for large directories (outputting the results on the way, eg. to process further, filter etc.) or these tools are just not written to scale? Or maybe there are better file/directory utilities for MacOSX? (I haven't tried any GUI app on that directory, thinking better not to...).
I have written a fairly trivial C program reading the directory entries and outputting the info on the way:
#include <dirent.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <unistd.h>
int main ( const int argc,
const char * const * argv )
{
const char * const dirpath = argv[1];
DIR *dirp = opendir ( dirpath );
if ( dirp == NULL )
return (-1);
int count = 0;
struct stat statbuf;
for ( struct dirent *entry = readdir ( dirp ) ;
entry != NULL ;
entry = readdir ( dirp ), count++ )
{
char filepath [ PATH_MAX + 1 ];
memset ( filepath, 0, PATH_MAX );
strncat ( filepath, dirpath, PATH_MAX );
strncat ( filepath, "/", PATH_MAX );
strncat ( filepath, entry->d_name, PATH_MAX );
stat ( filepath, &statbuf );
printf ("%s %llu\n", entry->d_name, statbuf.st_size );
}
closedir ( dirp );
printf ("%d", count );
return 0;
}
which actually does work (outputs the result after reading each entry) and has memory footprint of around 300K. So it is not a problem of the OS (filesystem, driver, standard library or whatever), but the tools which basically do not scale well (I know they support more options etc., but for a basic directory listing, without sorting or anything fancy ls
should work better, ie. not allocate 1GB of memory, while find
should do the action for each entry it finds (and matches), not read them all first, as it apparently does...).
Has anyone experienced this and has a good solution how to deal with such huge directories (what utilities to use) on MacOSX? (Or maybe writing some custom system utility is necessary in such case?)
(It is an exceptional situation of course, occurred for the first time on my system - but the OS supports such large directories, and the basic shell tools should deal with them in a reasonable way...)
EDIT: Small fix in the program (the filepath was lacking '/').
So it seems, I answer myself, maybe somebody will find my conclusions useful.
Standard Unix tools (MacOSX, Linux) are not good for managing directories in such extreme situations as described. They are very efficient (maybe as efficient as possible in what they do) in normal situations - but this design leads to extensive memory usage (in GBs) and a very long time time for getting any result when a directory contains millions of files. For instance, ls
reading all directory entries first (as fast as possible), and outputting them from a list/array in memory - it seems to be the most efficient way for such task. However, for extreme cases (with millions of files), this approach fails in practical use, eg. it is not possible to learn anything about the contents of the directory for hours (as it was in my case).
From practical point of view, it would be good if such tools had an option to turn on “stream” processing - process the entries in manageable amounts, eg. in bunches of 1000 entries (if one-by-one would be too slow).
As for now, it seems necessary to create a custom system utility that will do the job, ie. list / delete / move the files in such directory. It could be done probably something else than C (Python, Perl) but since it could be slower, so I wouldn’t advise this (even small time multiplied by millions may be significant overall).
It seems non-technical users are out of luck here, they have to ask for help.
One thing that might be interesting to note here is that the relation of time required for operations on a directory vs. number of files in it is not linear (this case is the MacOS’s HFS, other filesystems can vary both ways). Deleting the 6.7M files using the program above (just "unlinking" instead of printing) took few days (on an older iMac with an HDD, not an SSD). I was logging the names of the files as they were being deleted - at first, the rate was like 100K files a day, while removing the last 1-1.5 million took around 3 hours.
Thanks for the comments.