Python code performance on big data os.path.getsize
- To get a feel for how fast you can get, try running and timing
du -k
on the directory. You probably won't be getting faster than that with Python for a full listing. - If you're running on Python < 3.5, try upgrading or using scandir for a nice performance improvement.
- If you don't really need the whole list of files but can live with e.g the largest 1000 files:
Avoid keeping the list and use heapq.nlargest with a generator
def get_sizes(root):
for path, dirs, files in os.walk(root):
dirs[:] = [d for d in dirs if not d.startswith('.')]
for file in files:
full_path = os.path.join(path, file)
try:
# keeping the size first means no need for a key function
# which can affect performance
yield (os.path.getsize(full_path), full_path)
except Exception:
pass
import heapq
for (size, name) in heapq.nlargest(1000, get_sizes(r"c:\some\path")):
print(name, size)
EDIT - to get even faster on Windows - os.scandir
yields entries that already contain the size helping avoid another system call.
This means using os.scandir
and recursing yourself instead of relying on os.walk
which doesn't yield that information.
There's a similar working example get_tree_size()
function in the scandir PEP 471 that can be easily modified to yield names and sizes instead. Each entry's size is accessible with entry.stat(follow_symlinks=False).st_size
.