Data Transfer Pauses on LSI 9271 RAID Controller

I have a server equipped with a LSI 9271-8i RAID controller, with 4 x 4TB organized as RAID-5 and 1 x 8TB as JBOD (which is called RAID-0 in the controller).

When I copy bigger amounts of data (~1 TB), I can observe the following: for the first few gigabytes the transfer speed is fine and limited by the disk or network speeds, usually ~100MB/s. But after a while, the transfer completely pauses for approx. 20-30 seconds, and continues then with the next approx. 1 GB. I copy a lot of files with each between 10MB and 500MB, and during the pause robocopy stays at a file and continues to the next after the pause. That way the overall transfer rate drops to ~20MB/s.

During the pause, browsing the drives' files is not possible, and in one case I received an controller reset error message ("Controller encountered a fatal error and was reset"). Also accessing controller data with the CLI tool is not possible during that pause (the result is displayed when the pause is over).

I could observe this behaviour when copying

  • gigabit network to RAID-5 volume
  • gigabit network to JBOD volume
  • JBOD to RAID-5
  • RAID-5 to JBOD

There is nothing going on that looks suspicious to me: temperatures (disks, BBU) are within the valid range, controller temp seems a bit high, but also within specs. No checks are running on the RAID, no rebuild in progress.

Any guesses?

Before I replace the controller, I want to try optimizing the thermal situation. Does this behaviour sound like a possibly thermal issue?

I find it strange that the first 20-30 GB are working fine, and the pauses are not ocurring before that. If I leave the server alone for a while and retry, then again a few GBs are copied fine. The only naive explanation for me is that the controller gets too hot. Why the controller and not the disks? The RAID-5 disks are 7200rpm and stacked very closely, while the JBOD single disk is 5400rpm and with a lot of air around. Would be strange if both would show the same overheating symptoms.


Solution 1:

I had a similar issue with a 9260-16i. It was not temps as I have dual 92mm fans blowing right on the LSI. I have a second server set up same way and it was fine. What I discovered was the server with the issues was set with a 64K strip size and working server had 256K stripe size. I backed up the problem server and rebuilt the drive group with 256K stripe and then formatted the OS drive with 64K clusters (since I have multi-GB file). I have been moving data back and no hesitations and basically running at full gigabit NIC speed on writes moving over 350GB per hour non-stop no pauses.