why is the output of `du` often so different from `du -b`
Solution 1:
Apparent size is the number of bytes your applications think are in the file. It's the amount of data that would be transferred over the network (not counting protocol headers) if you decided to send the file over FTP or HTTP. It's also the result of cat theFile | wc -c
, and the amount of address space that the file would take up if you loaded the whole thing using mmap
.
Disk usage is the amount of space that can't be used for something else because your file is occupying that space.
In most cases, the apparent size is smaller than the disk usage because the disk usage counts the full size of the last (partial) block of the file, and apparent size only counts the data that's in that last block. However, apparent size is larger when you have a sparse file (sparse files are created when you seek somewhere past the end of the file, and then write something there -- the OS doesn't bother to create lots of blocks filled with zeros -- it only creates a block for the part of the file you decided to write to).
Solution 2:
Minimal block granularity example
Let's play a bit to see what is going on.
mount
tells me I'm on an ext4 partition mounted at /
.
I find its block size with:
stat -fc %s .
which gives:
4096
Now let's create some files with sizes 1 4095 4096 4097
:
#!/usr/bin/env bash
for size in 1 4095 4096 4097; do
dd if=/dev/zero of=f bs=1 count="${size}" status=none
echo "size ${size}"
echo "real $(du --block-size=1 f)"
echo "apparent $(du --block-size=1 --apparent-size f)"
echo
done
and the results are:
size 1
real 4096 f
apparent 1 f
size 4095
real 4096 f
apparent 4095 f
size 4096
real 4096 f
apparent 4096 f
size 4097
real 8192 f
apparent 4097 f
So we see that anything below or equal to 4096
takes up 4096
bytes in fact.
Then, as soon as we cross 4097
, it goes up to 8192
which is 2 * 4096
.
It is clear then that the disk always stores data at a block boundary of 4096
bytes.
What happens to sparse files?
I haven't investigated what is the exact representation is, but it is clear that --apparent
does take it into consideration.
This can lead to apparent sizes being larger than actual disk usage.
For example:
dd seek=1G if=/dev/zero of=f bs=1 count=1 status=none
du --block-size=1 f
du --block-size=1 --apparent f
gives:
8192 f
1073741825 f
Related: How to test if sparse file is supported
What to do if I want to store a bunch of small files?
Some possibilities are:
- use a database instead of filesystem: Database vs File system storage
- use a filesystem that supports block suballocation
Bibliography:
- https://serverfault.com/questions/565966/which-block-sizes-for-millions-of-small-files
- https://askubuntu.com/questions/641900/how-file-system-block-size-works
Tested in Ubuntu 16.04.
Solution 3:
Compare (for example) du -bm
to du -m
.
The -b
sets --apparent-size --block-size=1
,
but then the m
overrides the block-size to be 1M
.
Similar for -bh
versus -h
:
the -bh
means --apparent-size --block-size=1 --human-readable
, and again the h
overrides that block-size.
Solution 4:
Files and folders have their real size and the size on disk.
-
--apparent-size
is file or folder real size -
size on disk is the amount of bytes the file or folder takes on disk. Same thing when using just
du
.
If you encounter that apparent-size is almost always several magnitudes higher than disk usage then it means that you have a lot of (`sparse') files of files with internal fragmentation or indirect blocks.
Solution 5:
Because by default du gives disk usage, which is the same or larger than the file size. As said under --apparent-size
print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be
larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like