Determining disk space usage & file counts for extremely large data sets (any tricks?)

Solution 1:

I'd suggest breaking the data up into multiple partitions if that is at all feasible. No matter what tools you use, scanning through that many files is going to take time. If it were on separate partitions, you could at least narrow the problem down to a single partition first. But that may not be an option for what you are doing.

du is likely the best tool for what you are looking for. Here's how I use it:

If your directory structure looks like:

/mount/1/abc/123/456/789, 
/mount/1/def/stuff/morestuff/evenmorestuff
/mount/2/qwer/wer/erty

I would run:

du -s /mount/*/* | sort -n

That will give you the total usage of each second level directory, sorted by size. If it takes a long time to run, direct it to a file and run it overnight.

Your output will look like this:

10000 /mount/1/abc
20000 /mount/1/def
23452 /mount/2/qwer

Then you just hope that breaks it down enough to see where the problem spots are.

If this is a regular issue, you could have it run that command every night at a time when your system isn't as busy and save the output to a file. Then you immediately have some recent data to look at when you notice the problem.

One other option you may wish to look at is quotas - if this is shared storage and they are all using different user accounts, setting very high quotas might work to prevent runaway processes from using gobs of storage space.

Solution 2:

I make this recommendation often to augment the usual df -i and du -skh solutions...

Look into the ncdu utility. It's an ncurses-based disk utilization graphing tool. You'll get output similar to below with file counts and a summary of directory sizes. It's available for CentOS/RHEL.

Also see: https://serverfault.com/questions/412651/console-utility-to-know-how-disk-space-is-distributed/412655#412655

ncdu 1.7 ~ Use the arrow keys to navigate, press ? for help                                                         
--- /data ----------------------------------------------------------------------------------------------------------
  163.3GiB [##########] /docimages                                                                                  
   84.4GiB [#####     ] /data
   82.0GiB [#####     ] /sldata
   56.2GiB [###       ] /prt
   40.1GiB [##        ] /slisam
   30.8GiB [#         ] /isam
   18.3GiB [#         ] /mail
   10.2GiB [          ] /export
    3.9GiB [          ] /edi   
    1.7GiB [          ] /io     

Solution 3:

I use this command to check what the largest files are in a dir / on the system. But I'm not sure this is scalable in an environment that you use:

find / -type f -size +100000k -exec ls -lh {} \; 2>/dev/null| awk '{ print $8 " : " $5}'

if you want you can leave out the awk statement ( I just use it to clean up the output). the find command will recurse trough directories searching for files larger then the given amount k. Then it will execute ls -lh on that file giving something like:

-rw-r--r-- 1 username group 310K Feb 25  2011 filename

the AWK statement cleans up the output in the form of:

filename : 310K

The thing I find most usefull about this command is the fact that you can specify the minimum size of the files. As said before, I have no idea how CPU / time intensive this is on your environment.