How to best tune Dell PowerVault MD3600i SAN/Initiators for best performance?

Recent owner of a Dell PowerVault MD3600i i'm experiencing some weird results.

I have a dedicated 24x 10GbE Switch (PowerConnect 8024), setup to jumbo frames 9K.

The MD3600 has 2 RAID controllers, each has 2x 10GbE ethernet nics. There's nothing else on the switch; one VLAN for SAN traffic.

Here's my multipath.conf

defaults {
    udev_dir        /dev
    polling_interval    5
    selector        "round-robin 0"
    path_grouping_policy    multibus
    getuid_callout      "/sbin/scsi_id -g -u -s /block/%n"
    prio_callout        none
    path_checker        readsector0
    rr_min_io       100
    max_fds         8192
    rr_weight       priorities
    failback        immediate
    no_path_retry       fail
    user_friendly_names yes
#   prio            rdac
}
blacklist {
    device {
               vendor "*"
        product "Universal Xport"
        }
#   devnode "^sd[a-z]"
}

devices {
    device {
           vendor "DELL"
           product "MD36xxi"
           path_grouping_policy group_by_prio
           prio rdac 
        #  polling_interval  5
           path_checker rdac
           path_selector "round-robin 0"
           hardware_handler "1 rdac"
           failback immediate
           features "2 pg_init_retries 50"
           no_path_retry 30
           rr_min_io 100
           prio_callout "/sbin/mpath_prio_rdac /dev/%n"
       }
}

And iscsid.conf :

node.startup = automatic
node.session.timeo.replacement_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144

After my tests; i can barely come to 200 Mb/s read/write.

Should I expect more than that ? Providing it has dual 10 GbE my thoughts where to come around the 400 Mb/s.

Any ideas ? Guidelines ? Troubleshooting tips ?

EDIT:

The array is setup as a single logical volume of 5.7TB Disks are all 1TB 7.2k SAS 6GB (ST1000NM0001) RAID is RAID10

Some lines of the Swith Configuration:

interface Te1/0/23
storm-control broadcast
storm-control multicast
spanning-tree portfast
mtu 9000
switchport access vlan 40
exit
...
iscsi cos vpt 5
management access-list "default"
permit service ssh priority 1
permit service http priority 2
permit service https priority 3

And multipath output:

[root@xnode4 ~]# multipath -ll -v2
multipath.conf line 30, invalid keyword: prio
mpath1 (36d4ae520009bd7cc0000030e4fe8230b) dm-2 DELL,MD36xxi
[size=5.5T][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=400][active]
 \_ 7:0:0:0   sdc 8:32  [active][ready]
 \_ 9:0:0:0   sde 8:64  [active][ready]
 \_ 11:0:0:0  sdi 8:128 [active][ready]
 \_ 13:0:0:0  sdn 8:208 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 10:0:0:0  sdj 8:144 [active][ghost]
 \_ 12:0:0:0  sdh 8:112 [active][ghost]
 \_ 8:0:0:0   sdd 8:48  [active][ghost]
 \_ 6:0:0:0   sdb 8:16  [active][ghost]

Solution 1:

Judging by your comments and edits, your bottleneck might be the storage. First thing, assuming you have write caching enabled, all your writes until the cache is full should be done at line speed. You can measure this fairly easily by figuring out how much cache you have and doing a 100% write benchmark with less data than that. Secondly, once the cache starts destaging data to disk, the write performance on RAID-10 (assuming that the controllers are not introducing bottlenecks) will be half that of the read performance. That's because each write is done to two disks, but reading is only done from one. One benefit of RAID-10 is that there's no parity to calculate, so it's unlikely that the controllers' processors are simply not able to keep up.

Next up, if your benchmark is measuring a mixture of reads and writes, the performance you'll get from the storage controller will depend on the type of IO. If it's sequential, you'll get a higher number of MB/s, but a lower number of IO/s. If it's random small-block, you'll get low numbers of MB/s, but as many IO/s as your disks can provide. Each 7200 RPM disk will provide a certain number of IO/s when you're reading unpredictably, so the number of drives in your raid times the number of IO/s per drive will be your theoretical performance cap.

Lastly, if you have all the storage in one big volume presented as a single LUN, your command queue might be saturated. Normal operating systems have a configurable command queue depth (the number of outstanding IOs they'll line up for the storage), and each volume/LUN has its own queue. Another problem with having all the storage in the same LUN is that generally, IO is sent to a single controller for that LUN. Even on active/active storage systems (which I'm not sure yours is), they can have an affinity for one controller over another. The goal would be to create a bunch of volumes and split them evenly between controllers.

Solution 2:

Benchmark it with a single disk and do it again with all your disks in a raid 0.

Raid 0 will not have any raid10 or raid 5 overhead.

Also look at your cache on the MD. The default is 4k blocks but it can go up to 32k blocks. I was up to 30% difference in speed between those 2 values. Test it for your workloads though.

Use something like SQLIO where you can use more threads. My numbers finally started to look good once I worked it harder.

And verify the MD is configured for 10G. The option is 10G or 1G on the ports, it does not auto negotiate.

Solution 3:

Maybe you want to increase the cache block size on the array from 4k to 16k or 32k (especialy if you are looking for a sequential workload)