Is there a faster way than this to find all the files in a directory and all sub directories?

I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

Here's the recursive method I'm using now:

private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    files.AddRange(fiArr);

    DirectoryInfo[] diArr = di.GetDirectories();

    foreach (DirectoryInfo info in diArr)
    {
        GetFileList(fileSearchPattern, info.FullName, files);
    }
}

I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?


Update 1

I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);

    List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
    dirList.Add(rootDir);

    List<FileInfo> fileList = new List<FileInfo>();

    foreach (DirectoryInfo dir in dirList)
    {
        fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
    }

    return fileList;
}

Update 2

Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

  • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
  • GetFileListB(): Recursive method from the original question
  • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
  • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
    • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
  • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459

. . .so looks like Marc's is the fastest.


Solution 1:

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        try
        {
            tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        }
        catch (UnauthorizedAccessException)
        {
            continue;
        }
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

Solution 2:

Cool question.

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

Here is the meat of it

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
    var rootDir = new DirectoryInfo(rootFolderPath);
    var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

    return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
           select directoriesWithFiles;
}

private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
    foreach (DirectoryInfo dir in dirList)
    {
        yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    }
}

Solution 3:

The short answer of how to improve the performance of that code is: You cant.

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
    return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
    return depth == 0
        ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
        : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
            directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

Solution 4:

This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.

if (Directory.Exists(path))
{
    files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
    .Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
    .Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}