CEPH's raw space usage
I can't understand ceph raw space usage.
I have 14 HDD (14 OSD's) on 7 servers , 3TB each HDD ~ 42 TB raw space in total.
ceph -s
osdmap e4055: 14 osds: 14 up, 14 in
pgmap v8073416: 1920 pgs, 6 pools, 16777 GB data, 4196 kobjects
33702 GB used, 5371 GB / 39074 GB avail
I created 4 block devices, 5 TB each:
df -h
/dev/rbd1 5.0T 2.7T 2.4T 54% /mnt/part1
/dev/rbd2 5.0T 2.7T 2.4T 53% /mnt/part2
/dev/rbd3 5.0T 2.6T 2.5T 52% /mnt/part3
/dev/rbd4 5.0T 2.9T 2.2T 57% /mnt/part4
df shows that 10,9 TB is used in total, ceph shows that 33702 GB is used. If I have 2 copies, it must be ~ 22 TB, but now I have 33,7 TB used - 11 TB missed.
ceph osd pool get archyvas size
size: 2
ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
39074G 5326G 33747G 86.37
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 0 0 1840G 0
metadata 1 0 0 1840G 0
archyvas 3 4158G 10.64 1840G 1065104
archyvas2 4 4205G 10.76 1840G 1077119
archyvas3 5 3931G 10.06 1840G 1006920
archyvas4 6 4483G 11.47 1840G 1148291
Block devices and OSD FS - XFS
One possible source of confusion is GB vs. GiB/TB vs. TiB (base 10/base 2), but that cannot explain all of the difference here.
Ceph/RBD will try to "lazily" allocate space for your volumes. This is why although you created four 5TB volumes, it reports 16TB used, not 20. But 16TB is more than the sum of the "active" contents of your RBD-backed filesystems, which is only around 11TB, as you say. Several things to note:
When you delete files in your RBD-backed filesystems, the filesystems will internally mark the blocks as free, but usually not try to "return" them to the underlying block device (RBD). If your kernel RBD version is recent enough (3.18 or newer), you should be able to use fstrim
to return freed blocks to RBD. I suspect that you have created and deleted other files on these file systems, right?
There is also some file system overhead beyond the net data usage that is shown by df
. Besides "superblocks" and other filesystem-internal data structures, some overhead is to be expected from the granularity at which RBD allocates data. I think RBD will always allocate 4MB chunks, even when only a portion of that is used.
I am no ceph expert but let me guess a little.
The block devices are not mounted without discard
option. So any data you write and delete does not show up on the filesystem (/mnt/part1
), but as it was once written and not trimmed, it stays on the underlying filesystem.
If you look at USED
for your pools and add those together, you get 16777GB, which equals to what ceph -s
shows. And if you multiply that by two (two copies), you get 33554GB, which is pretty much the space used.