Is there a way to efficiently yield every file in a directory containing millions of files?
tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>
As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python. The low level functions are different for Windows and Posix/Linux systems.
- If you are on Windows, you should check if
win32api
has any call to read "the next entry from a dir" or how to proceed otherwise. - If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.
The documentation on the C functions is here: http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory
I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h
header file and verifying the Python snippet is correct (your Python Structure
must match the C struct
) before using the snippet.
Here is the snippet using ctypes
and libc
I've put together that allow you to get each filename, and perform actions on it. Note that ctypes
automatically gives you a Python string when you do str(...)
on the char array defined on the structure. (I am using the print
statement, which implicitly calls Python's str
)
#!/usr/bin/env python2
from ctypes import *
libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))
class Dirent(Structure):
_fields_ = [("d_ino", c_voidp),
("off_t", c_int64),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char * 2048)
]
while True:
p = libc.readdir64(dir_)
if not p:
break
entry = Dirent.from_address( p)
print entry.d_name
update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir
function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir
on large-directories listing under Windows (2-3 fold increase in Posix systems).
[footnote-1] The dirent64
C struct
is determined at C compile time for each system.
The glob module Python from 2.5 onwards has an iglob method which returns an iterator. An iterator is exactly for the purposes of not storing huge values in memory.
glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.
For example:
import glob
for eachfile in glob.iglob('*'):
# act upon eachfile
Since you are using Linux, you might want to look at pyinotify. It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.
Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.
It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.
@jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.
The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.
Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.
What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?
Let's say you have three files: a.a
, b.b
, c.c
.
Your magical "iterator" starts with a.a
. You process it.
The magical "iterator" moves to b.b
. You're processing it.
Meanwhile a.a
is copied to a1.a1
, a.a
is deleted. What now? What does your magical iterator do with these? It's already passed a.a
. Since a1.a1
is before b.b
, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?
The magical "iterator" moves to c.c
. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Don't use the naked file system for coordination.
Use a queue.
Process A writes files and enqueues the add/change/delete memento onto a queue.
Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.