CEPH's raw space usage

I can't understand ceph raw space usage.

I have 14 HDD (14 OSD's) on 7 servers , 3TB each HDD ~ 42 TB raw space in total.

ceph -s 
     osdmap e4055: 14 osds: 14 up, 14 in
      pgmap v8073416: 1920 pgs, 6 pools, 16777 GB data, 4196 kobjects
            33702 GB used, 5371 GB / 39074 GB avail

I created 4 block devices, 5 TB each:

df -h
 /dev/rbd1       5.0T  2.7T  2.4T  54% /mnt/part1
/dev/rbd2       5.0T  2.7T  2.4T  53% /mnt/part2
/dev/rbd3       5.0T  2.6T  2.5T  52% /mnt/part3
/dev/rbd4       5.0T  2.9T  2.2T  57% /mnt/part4

df shows that 10,9 TB is used in total, ceph shows that 33702 GB is used. If I have 2 copies, it must be ~ 22 TB, but now I have 33,7 TB used - 11 TB missed.

ceph osd pool get archyvas size
size: 2


ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED
    39074G     5326G       33747G         86.37
POOLS:
    NAME          ID     USED      %USED     MAX AVAIL     OBJECTS
    data          0          0         0         1840G           0
    metadata      1          0         0         1840G           0
    archyvas      3      4158G     10.64         1840G     1065104
    archyvas2     4      4205G     10.76         1840G     1077119
    archyvas3     5      3931G     10.06         1840G     1006920
    archyvas4     6      4483G     11.47         1840G     1148291

Block devices and OSD FS - XFS


One possible source of confusion is GB vs. GiB/TB vs. TiB (base 10/base 2), but that cannot explain all of the difference here.

Ceph/RBD will try to "lazily" allocate space for your volumes. This is why although you created four 5TB volumes, it reports 16TB used, not 20. But 16TB is more than the sum of the "active" contents of your RBD-backed filesystems, which is only around 11TB, as you say. Several things to note:

When you delete files in your RBD-backed filesystems, the filesystems will internally mark the blocks as free, but usually not try to "return" them to the underlying block device (RBD). If your kernel RBD version is recent enough (3.18 or newer), you should be able to use fstrim to return freed blocks to RBD. I suspect that you have created and deleted other files on these file systems, right?

There is also some file system overhead beyond the net data usage that is shown by df. Besides "superblocks" and other filesystem-internal data structures, some overhead is to be expected from the granularity at which RBD allocates data. I think RBD will always allocate 4MB chunks, even when only a portion of that is used.


I am no ceph expert but let me guess a little.

The block devices are not mounted without discard option. So any data you write and delete does not show up on the filesystem (/mnt/part1), but as it was once written and not trimmed, it stays on the underlying filesystem.

If you look at USED for your pools and add those together, you get 16777GB, which equals to what ceph -s shows. And if you multiply that by two (two copies), you get 33554GB, which is pretty much the space used.