xfs on lvm on hardware RAID: correct parameters?
I have 10 disks with 8 TB each in a hardware RAID6 (thus, 8 data disks + 2 parity). Following the answer of a very similar question, I hoped for an automatic detection of all necessary parameters. However, when creating the XFS file system at the end, I got
# mkfs.xfs /dev/vgdata/lvscratch
meta-data=/dev/vgdata/lvscratch isize=256 agcount=40, agsize=268435455 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=10737418200, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
This looks like that striping has not been used. Due to the different terms I found on different sites (strip size, stripe size, stripe chunk, ...), I would like to ask whether I got the manual parameters right.
The RAID 6 has been set-up with a strip size of 256KB:
# ./storcli64 /c0/v1 show all | grep Strip
Strip Size = 256 KB
Thus, the stripe size is 8*256KB = 2048KB = 2MB. Is this correct? According to this (and if I understand it correctly), the pvcreate
has to use the strip (or chunk) size as argument to dataalignment
:
# pvcreate --dataalignment 256K /dev/sdb
Physical volume "/dev/sdb" successfully created
Note that I used the whole RAID device without partitions. Now a
# vgcreate vgdata /dev/sdb
Volume group "vgdata" successfully created
with a default PE Size of 4MB should be fine because it is a multiple of the stripe size of 2MB. Correct?
Now, a part of the vgroup is assigned to a logical volume:
# lvcreate -L 40T vgdata -n lvscratch
Logical volume "lvscratch" created.
Finally, the file system is created but now with the correct arguments (stripe size of 2MB, stripe width of 8):
# mkfs.xfs -d su=2048k,sw=8 /dev/vgdata/lvscratch
meta-data=/dev/vgdata/lvscratch isize=256 agcount=41, agsize=268434944 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=10737418240, imaxpct=5
= sunit=512 swidth=4096 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Is this approach correct? Is there anything to keep in mind for an extension of the logical volume or the volume group? I suppose that if the volume group would be extended with another RAID6 system, the strip size should be equal to the present RAID6.
EDIT: My confusion seems to be mainly based on the different usage of terms connected to stripe. The manufactor of my RAID controller, LSI or Avago, defines the terms in the following way:
Stripe Width
Stripe width is the number of drives involved in a drive group where striping is implemented. For example, a four-disk drive group with disk striping has a stripe width of four.
Stripe Size
The stripe size is the length of the interleaved data segments that the RAID controller writes across multiple drives, not including parity drives. For example, consider a stripe that contains 64 KB of disk space and has 16 KB of data residing on each disk in the stripe. In this case, the stripe size is 64 KB, and the strip size is 16 KB.
Strip Size
The strip size is the portion of a stripe that resides on a single drive.
Wikipedia (and IBM) seem to use other definitions:
The segments of sequential data written to or read from a disk before the operation continues on the next disk are usually called chunks, strides or stripe units, while their logical groups forming single striped operations are called strips or stripes. The amount of data in one chunk (stripe unit), often denominated in bytes, is variously referred to as the chunk size, stride size, stripe size, stripe depth or stripe length. The number of data disks in the array is sometimes called the stripe width, but it may also refer to the amount of data within a stripe.
The amount of data in one stride multiplied by the number of data disks in the array (i.e., stripe depth times stripe width, which in the geometrical analogy would yield an area) is sometimes called the stripe size or stripe width. Wide striping occurs when chunks of data are spread across multiple arrays, possibly all the drives in the system. Narrow striping occurs when the chunks of data are spread across the drives in a single array.
Even in the Wikipedia text above stripe size is used with two different meanings. However, I suppose now, when creating the xfs file system, the size of a single chunk stored on a single drive has to be given as argument to su. This, it should be mkfs.xfs -d su=256k,sw=8
in the command above. Correct?
Solution 1:
Rather than "strip size" and "stripe size", the XFS man pages use the terms "stripe unit" and "stripe width" respectively.
This makes it possible to decode the otherwise confusing text in the mkfs.xfs(8)
man page:
sunit=value
This is used to specify the stripe unit for a RAID
device or a logical volume. The value has to be
specified in 512-byte block units. Use the su subop‐
tion to specify the stripe unit size in bytes. This
suboption ensures that data allocations will be
stripe unit aligned when the current end of file is
being extended and the file size is larger than
512KiB. Also inode allocations and the internal log
will be stripe unit aligned.
su=value
This is an alternative to using sunit. The su sub‐
option is used to specify the stripe unit for a RAID
device or a striped logical volume. The value has to
be specified in bytes, (usually using the m or g
suffixes). This value must be a multiple of the
filesystem block size.
So, with your array reporting a strip size of 256KiB, you would specify either su=256K
or sunit=512
(because 512 512-byte blocks equals 256KiB).
swidth=value
This is used to specify the stripe width for a RAID
device or a striped logical volume. The value has to
be specified in 512-byte block units. Use the sw
suboption to specify the stripe width size in bytes.
This suboption is required if -d sunit has been
specified and it has to be a multiple of the -d
sunit suboption.
sw=value
suboption is an alternative to using swidth. The sw
suboption is used to specify the stripe width for a
RAID device or striped logical volume. The value is
expressed as a multiplier of the stripe unit, usu‐
ally the same as the number of stripe members in the
logical volume configuration, or data disks in a
RAID device.
When a filesystem is created on a logical volume
device, mkfs.xfs will automatically query the logi‐
cal volume for appropriate sunit and swidth values.
With 10 spindles (8 data, 2 parity) you would specify either sw=8
(data spindles) or swidth=2M
(the strip size multiplied by data spindles).
Note that xfs_info
and mkfs.xfs
interpret sunit
and swidth
as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however. xfs_info
and mkfs.xfs
report them in multiples of your basic block size (bsize
) and not in 512B sectors.
TL;DR:
The easiest way to specify these is usually by strip size and spindle count, thus su=
strip size and sw=
spindle count.