SSD, Erase Block Size & LVM: PV on raw device, Alignment

Solution 1:

Yes, I also checked out all on-disk layout of MBR/PBR/GPT/MD/LVM, and came to the same conclusion.

For your case (LVM on raw disk), if LVM-PE(physical extent) is 1MB-aligned with pvcreate, you can be sure all futher data allocation will be aligned, as long as you keep allocation size to (1MB * N).

Since both "vgcreate -s" and "lvcreate -L" handles size-without-unit as MB value by default, you probably do not need to care much about alignment once you've done your pvcreate properly. Just make sure not to give size in %/PEs (for lvcreate -l) and B(byte)/S(512B - sector is always 512B in LVM)/K(KB) (for vgcreate -s and lvcreate -L).

=== added for clarification ===

Just as a followup, while a SSD may have 1024KB erase block size as a whole device, each internal flash chip's erase block size / rw page size is probably about 32KB-128KB / 512B-8KB.

Although this depends on each SSD's controller, I/O penalty due to extra read-modify-write cycle probably won't happen as long as you keep your write aligned to erase block size of each internal chip, which is 32KB-128KB in above example. It's just you want single write request to be big enough (= erase block size of SSD-as-a-whole-device), so you can expect better performance by efficiently driving all internal chips/channels.

My understanding is that 1024KB-alignment is only a safety measure, as controller chip function varies by a vendor, and flash chip's spec changes rapidly. It's more important to have OS-level write request to be done in a large bundle (1024KB, in this case).

Now, having said that, doing mkfs(8) on 1MB-aligned LVM block will almost certainly break 1MB-alignment for filesystem-level data/metadata. Most filesystems only cares to do 4KB-alignment, so it's probably not perfect for SSDs (but, IIRC, recent fs like btrfs tries to keep 64KB+ alignment when allocating internal contiguous block). But many fs do have a feature to bundle writes (ex: stripe-size configuration) to get performance out of RAID, so that can be used to make write request to SSD near-optimal.

I really want to back my statement with actual data, but it was really difficult to prove as today's SSD controller is so intelligent, and won't show much performance degration once both alignment size and write size is "big enough". Just make sure it's not ill-aligned (avoid <4KB-aligment at all cost) and not too small (1024KB is big enough).

Also, if you really care about IO penalty, double check by disabling device cache and benchmarking with sync-ed read-write-rewrite test.

Solution 2:

To my understanding, the defaults are already good enough. I don't think you need to worry about the --dataalignment option as LVM will automatically try to align everything based on sysfs exported values, see the "data_alignment_detection" option in lvm.conf:

# By default, the start of a PV's data area will be a multiple of
# the 'minimum_io_size' or 'optimal_io_size' exposed in sysfs.
# - minimum_io_size - the smallest request the device can perform
#   w/o incurring a read-modify-write penalty (e.g. MD's chunk size)
# - optimal_io_size - the device's preferred unit of receiving I/O
#   (e.g. MD's stripe width)
# minimum_io_size is used if optimal_io_size is undefined (0).
# If md_chunk_alignment is enabled, that detects the optimal_io_size.
# This setting takes precedence over md_chunk_alignment.
# 1 enables; 0 disables.
data_alignment_detection = 1

Furthermore, it is not necessary to specify a physicalextentsize to vgcreate as the default is already 4MB.