LSI RAID controller errors on DB import - How to troubleshoot?
We're running an import of a database dump on an Oracle system - (RHEL 5.9, 2.6.18-348.6.1.el5). The import does not complete, eventually erroring-out with:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 248 logical extent 0 of file 280 in group 1 on disk 1 allocation unit 986
Errors in file /u01/app/oracle/diag/rdbms/dbprod/DBPROD/trace/DBPROD_lgwr_24520.trc:
ORA-00345: redo log write error block 509314 count 2023
ORA-00312: online log 1 thread 1: '+DATA/dbprod/redo01.log'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
There are corresponding errors in the ring buffer and /var/log/messages
:
Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: timing out command, waited 360s
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: Unhandled error code
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
Jun 12 18:54:42 db1-test kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
The drive array containing the import is a 10-disk SAS array in RAID 1+0 using 300GB 10k disks. The RAID controller is an LSI MegaRAID SAS 9260-8i. No disk or adapter errors are reported via MegaCLI.
- Is this a hardware issue?
- Is there any way to troubleshoot? The RAID controller status is fine. The disks and logical drives report healthy.
- Is this a Linux OS or tuning issue? I'll try with different I/O schedulers to be sure. CFQ is default.
Edit:
Other schedulers have been tried with the same result. There is a third-party (Vormetric) filesystem encryption module running in this setup. Removing it allows the import to complete. So now I'm wondering if this is a deficiency in the module or if it is triggering a bad condition in the LSI driver.
During the import, we're hitting 14,000 write IOPS.
In recent attempts, the system stalls entirely with the following on the console.
Last top
output before freeze.
Ultimately Sergey is right - this is a driver problem. But let's check things out first:
First off you'll want to use the deadline I/O scheduler rather than CFQ. deadline
, as its name implies, ensures that all IOPs complete in a timely manner.
Grab the events from the megaraid card:
megacli -adpeventlog -getevents -f /tmp/megaraid-$(date +%F_%T) -aALL
Check the SMART data on the disks (you will need to build a new smartmontools for this to work):
# megacli -pdlist -a0 |grep 'Device Id'
Device Id: 10
Device Id: 9
# smartctl -a /dev/sda -d megaraid,9
«…»
# smartctl -a /dev/sda -d megaraid,10
«…»
If everything looks OK, go ahead and try out the latest driver from LSI.
There is a third-party (Vormetric) filesystem encryption module running in this setup. Removing it allows the import to complete. So now I'm wondering if this is a deficiency in the module or if it is triggering a bad condition in the LSI driver.
The Voretric module is likely doing something incompatible, yes. I would start by talking with them about how their module is screwing up your system under high load.