Fiber multipath fails: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

I am trying to copy(~7TB of data using rsync) between two server in same data center in the backend its using EMC VMAX3

After copying ~30-40GB of data multipath start failing

Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1
Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK 

[root@test log]# multipath -ll |grep -i fail
 |- 1:0:0:15 sdq  65:0   failed ready running
  - 3:0:0:15 sdai 66:32  failed ready running

We are using default multipath.conf

HBA driver version  8.07.00.26.06.8-k

HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter

OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64
Hardware:Intel/HP ProLiant DL380 Gen9

Already verified this solution and checked with EMC everything looks good https://access.redhat.com/solutions/438403

Some more info

- There is no drop/error packet on the network side.

Filesystem is mounted with noatime,nodiratime
Filesystem ext4(Already tried xfs but same error)
LVM is in striped mode(Started with linear option and then converted to striped)
Already disabled THP
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Whenever multipath start failing process goes to D state
System firmware upgraded
Tried with latest version of qlogic driver
Tried with different scheduler(noop,deadline,cfq)
Tried with different tuned profile(enterprise-storage)

Vmcore collected during the time of issue

I am able to collect vmcore during the time of issue

  KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
    CPUS: 36
    DATE: Fri Dec 16 00:11:26 2016
  UPTIME: 01:48:57
  LOAD AVERAGE: 0.41, 0.49, 0.60
   TASKS: 1238
NODENAME: test.example.com
 RELEASE: 2.6.32-642.6.2.el6.x86_64
 VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016
 MACHINE: x86_64  (2297 Mhz)
  MEMORY: 511.9 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018"
     PID: 15840
 COMMAND: "kjournald"
    TASK: ffff884023446ab0  [THREAD_INFO: ffff88103def4000]
     CPU: 2
   STATE: TASK_RUNNING (PANIC)

After Enbaling Debug mode on the qlogic sid

qla2xxx [0000:0b:00.0]-3822:5: FCP command status: 0x2-0x0 (0x70000) nexus=5:1:0 portid=1f0160 oxid=0x800 cdb=2a200996238000038000 len=0x70000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d42580 cp=ffff88276d249480.
qla2xxx [0000:84:00.0]-3822:7: FCP command status: 0x2-0x0 (0x70000) nexus=7:0:3 portid=450000 oxid=0x4de cdb=2a20098a5b0000010000 len=0x20000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d421c0 cp=ffff8880237e0880.

Solution 1:

This is an HP ProLiant DL380 Gen9 server. Pretty standard enterprise-class server.

Can you give me information on the server's firmware revision?
Is EMC PowerPath actually installed? If so, check here.

Do you have the HP Management Agents installed? If so, do you have the ability to post the output of hplog -v.

Have you seen anything in the ILO4 log? Is the ILO accessible?

Can you describe all of the PCIe cards installed in the system's slots?

For RHEL6-specific tuning, I highly recommend XFS, running tuned-adm profile enterprise-storage and ensuring your filesystems are mounted nobarrier (the tuned profile should handle that).

For the volumes, please ensure that you're using the dm (multipath) devices instead of /dev/sdX. See: https://access.redhat.com/solutions/1212233

Looking at what you've presented so far and the check listed at Redhat's support site (and the description here), I can't rule out the potential for HBA failure or PCIe riser problems. Also, there's a slight possibility that there's an issue on the VMAX side.

Can you swap PCIe slots and try again? Can you swap cards and try again?

Is the firmware on the HBA current? Here's the most recent package from December 2016.

Firmware 6.07.02 BIOS 3.21

A DID_ERROR typically indicates the driver software detected some type of hardware error via an anomaly within the returned data from the HBA.

A hardware or san-based issue is present within the storage subsystem such that received fibre channel response frames contain invalid or conflicting information that the driver is not able to use or reconcile.

Please review the systems hardware, switch error counters, etc. to see if there is any indication of where the issue might lie. The most likely candidate is the HBA itself.

Solution 2:

This looks to me like one of your SFPs has soft-failed... Look in your storage switch for errors on the port while you are doing a large copy.

I had a similar issue recently where everything looked great. Server vendor signed off on their stuff, storage vendor said their stuff looks good, swore the SFPs are all fine... SFP still showed as up and functional, until large amounts of data were sent across the MPIO interface and lots of errors on the storage switch port would start getting logged.

I had to replace all fiber cables with new ones, then switch SFPs with spares I had on hand to prove to the vendor that the SFP was bad, even though it appeared fine otherwise.

Solution 3:

I know that if you will change in /etc/sysconfig/mkinitrd/multipath MULTIPATH=NO on MULTIPATH=YES and at file /etc/multipath.conf - comment next:

blacklist {devnode "*"}

Turn on auto-load:

chkconfig multipathd on

Turn on module download:

modprobe dm-multipath

modprobe dm-round-robin

On autocfg:

multipath -v2

Reload server, cheeking all up:

lsmod | grep dm_

watching multi-path :

multipath -ll

Solution 4:

Finally issue is resolved

Error: TECH PREVIEW: DIF/DIX support may not be fully supported.

I constantly saw this message in dmesg during the time of issue and Keep on ignoring this message

On further debugging, I found out Kernel is in tainted state

 cat /proc/sys/kernel/tainted **So it's a combination of  TAINT_TECH_PREVIEW and TAINT_WARN**
 536871424

 lsmod |egrep -i "dif|dix" 
 crc_t10dif              1209  1 sd_mod

 modinfo crc_t10dif
 filename:       /lib/modules/2.6.32-642.6.2.el6.x86_64/kernel/lib/crc-t10dif.ko
 softdep:        pre: crct10dif
 license:        GPL
 description:    T10 DIF CRC calculation
 srcversion:     52BC47DEA6DD58B87A2D9C1
 depends:        
 vermagic:       2.6.32-642.6.2.el6.x86_64 SMP mod_unload modversions

As per RedHat

DIF is a new feature recently added to the SCSI Standard. It increases the size of the commonly-used 512-byte disk block from 512 to 520 bytes. The extra bytes comprise the Data Integrity Field (DIF). The basic idea is that the HBA will calculate a checksum value for the data block on writes, and store it in the DIF. The storage device will confirm the checksum on receive, and store the data plus checksum. On a read, the checksum will be checked by the storage device and by the receiving HBA.

The Data Integrity Extension (DIX) allows this check to move up the stack: the application calculates the checksum and passes it to the HBA, to be appended to the 512 byte data block. This provides a full end-to-end data integrity check

Some vendors have adopted the name Protection Information (PI) to refer to the DIF/DIX functionality. There is one difficulty associated with DIF/DIX on Linux - the memory management system may change the data buffer while it is queued for a write. If it does this, then the memory management system must remember to keep that page marked dirty after the I/O succeeds. If the memory management system changes the data in the buffer after the checksum is calculated, but before the write is done, then the checksum test will fail, the write will fail, and the filesystem will go read-only, or some similar failure will occur.

Because of this, users of Red Hat Enterprise Linux 6 should note the following: The DIF/DIX hardware checksum feature must only be used with applications that exclusively issue O_DIRECT I/O. These applications may use the raw block device, or the XFS file system in O_DIRECT mode. (XFS is the only filesystem that does not fall back to buffered IO when doing certain allocation operations). Only applications designed for use with O_DIRECT I/O and DIF/DIX hardware should enable this feature.

DIF/DIX is a Tech Preview in RHEL 6.0. There are currently just two driver/hba combinations that have this support: Emulex lpfc and LSI mpt2sas. There are just a few storage vendors who support it: the Netapp Engenio FC RAID array, and certain Hitachi SAS disks. We expect additional storage vendors to support this feature in the future.

As we are using EMC we decided to disable this feature and that did the trick

   cat /etc/modprobe.d/qla2xxx.conf
   options qla2xxx ql2xenabledif=0 ql2xenablehba_err_chk=0
   Back up existing initramfs:  cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
   Rebuild initramfs:  dracut -f -v
   Verify that /etc/modprobe.d/qla2xxx.conf is the same as the one in initramfs (time and size should be the same):     lsinitrd | grep qla2xxx.conf; ls -al /etc/modprobe.d/qla2xxx.conf