How is DaisyDisk so fast?

DaisyDisk scans my Mac's HD blazing fast compared to, say, du. I'm wondering what's the trick. I suspect it wouldn't be so fast on non-Mac filesystems, but I haven't tried. Any clues?


Solution 1:

I've never used DaisyDisk, but judging by the video demo on their site, it seems they're using a few tricks to make it fast.

First off, are you sure that du is slower? Try running du / >/dev/null and see if it is any faster than DaisyDisk .. mind you, the filesystem might have been cached after whichever went first for timing purposes, so the second run will have that advantage.

du is pretty quick since it just looks at directory entries and reports path and file sizes. The only way you would know what a file is would be to make a guess by the file extension or look at the file to determine its type (e.g. UNIX "magic" bytes). The file extension route is fast, examining the file is obviously much slower since you have to pull up the file.

With the du output, you could quickly determine the view of the top level directories; you can filter that in code and make a representation as DaisyDisk does and just show sized pie slices. If they drill down into the directories (e.g. /Users) then you repeat the process but only focusing on that level. Since DaisyDisk is only showing you the top 10 or 20 space hogs, it doesn't need to go into details to figure out what the smaller files are (notice it conveniently lumps them into "Smaller Files 750MB" or some label). At this point, it still didn't need to dig that deep into the actual files and perhaps if it didn't guess by extension it only had to determine the "magic" of a few large files which goes very quick.

So what we're probably seeing is that it quickly determined the name, path and size of every file on the drive (as du can demonstrate), but cleverly only showing the top offenders to help you get to what you're interested in since most people won't use a tool like this to get the nit-picky little files; you'll go to the Finder to examine the files if you even bother at that level. There doesn't really appear to be anything special here except that it only needs to examine the file type or content when the user specifically asks for it and the program avoids doing that heavy work most of the time.

So what makes DaisyDisk special versus (say) the free "Grand Perspective" application for OSX? Slick interface for one (I do like the examine file and drag to collect/delete feature), but also I think GP does examine the files while it scans since it shows everything in its graphical view and colors by file type. You could do the color by type mechanism with a du approach as well, but you're only guessing based on file extension and/or where you found the file.

All in all, it is a slick application with a clever interface. Why is it fast? Because they appear to have taken short cuts to avoid heavy lifting until it would absolutely be necessary. For me, I'm fine with Grand Perspective :-)

Solution 2:

I'm the developer of DaisyDisk. I'd have to go into lengths to explain how we achieve this on the engineering side, but I can reassure you that the app doesn't make any "shortcuts" or "tricks". The scanning is real and full.

As already mentioned, it's kind of hard to make a precise measurement because of the disk caching. Every experiment will give you a different time, depending on many factors. But it's true that DaisyDisk is by far cry faster than any other disk scanner. This is especially noticeable on SSD drives. Haven't tried to compare it with "du" though.

Solution 3:

I'm not able to measure any large difference in du or DaisyDisk other than the native app is slower in some cases than the command line tool.

time du ~ > /dev/null 2>&1

The first run of du was 0m7.947s and the second was 0m5.465s and DaisyDisk was about 8 seconds with a stopwatch both times.

My guess is you are seeing delays due to screen output of the command line tools measuring more of the disk. Are you using DaisyDisk to scan as administrator?

Solution 4:

I have noticed that du uses getattrlist() in a single threaded mode. I had found some code that apple open sourced sometime back with the High Sierra release. Daisydisk probably uses getattrlistbulk() with multiple threads, since I noticed a more even CPU core usage.

On my Mac, daisydisk is always 2x - 3x faster than du. I have tested this with my home folder of 100G, 20K directories, 350K files. The file system does some caching, so it is better to test this on a cold system, or with a large directory to reduce the caching benefits.

I found this code snippet helpful for getattrlistbulk - https://www.snip2code.com/Snippet/526248/A-sample-on-how-to-properly-use-getattrl

This snippet still uses a single thread, and the thread was IO bound, my CPU usage was around 35% only. If you somehow spread the work onto multiple threads, I do expect this to be super fast.