RAID Read/Write Speed Gradually Slows
This is actually a server at home, but I felt it was sufficiently complicated as to not have it on SuperUser and could easily apply to a professional situation.
I have a file server running Debian (Lenny 5.0.4), and it has an XFS LVM on top of a RAID 5 with the OS drive separate from the RAID. It's also running apache, samba, and postgresql. Side note: before the RAID5 critics crucify me, I'm using RAID5 because I get more bang for the buck on raw drive space, and still have some fault tolerance.
When the box is started (via shutdown or reboot) reading/writing to it's samba share maxes out the gigabit network connection. Over time, this slowly degrades eventually becoming < 10MB/s; however, when rebooted the speed returns to maxing out the connection.
Why is this happening, and is there a way to 'clear' out whatever's causing it without taking the server down?
Thanks in advance!
EDIT: To answer @LapTop006's question, the output of cat /proc/mdstat is the same after it reboots and when it's slow:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdd1[0] sda[5] sdb[4] sdf[3] sdg1[2] sde1[1]
4883799680 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
unused devices: <none>
According to xfs_db's frag command:
actual 58969, ideal 23904, fragmentation factor 59.46%
EDIT 2: I'm using the standard Debian kernel. cat /etc/fstab outputs this for my OS drive and raid:
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/sda1 / ext3 errors=remount-ro 0 1
/dev/mapper/oomox-lvm /raid xfs defaults 0 2
To be honest I'm not exactly the biggest Linux guru and I didn't make the raid or lvm via command-line (i.e. mkfs_xfs); I used the UI based Debian RAID install setup thing when you first install the OS, and only used the command-line when I needed to add drives to the array.
When it starts slowing down again I'll post the iostat output.
EDIT 3:
When slow or fast, the iostat output shows bytes read and written equally among all the drives. I also tried setting
socket options = TCP_NODELAY
in the samba config as per @Avery Payne's advice, but it was still slow. However at least the problem has been narrowed down, since only restarting samba fixed the issue. This is pretty odd though, since I've never had this issue until somewhat recently.
FINAL EDIT: I tried @David Spillett's suggestion of running
time dd if=/dev/sda of=/dev/null
For each drive when it's slow to see if there's any difference to when it's fast, and there isn't. So, the problem is clearly with Samba.
I'm awarding the correct answer to @Avery Payne. Although @David Spillett's answer has a great sleuth of troubleshooting techniques, technically @Avery Payne pointed me in the most correct direction of solving this issue. I'll post if I ever find the final solution to this.
Thanks everyone!
When the box is started (via shutdown or reboot) reading/writing to it's samba share maxes out the gigabit network connection. Over time, this slowly degrades eventually becoming < 10MB/s; however, when rebooted the speed returns to maxing out the connection.
The problem is mostly likely not in the OS or hardware, but in your Samba config. Do you have your TCP options set correctly in Samba? There are some options that will cause client access to degrade, either by causing TCP flows to slow down, or by causing additional overhead.
Your RAID and fstab look fine.
Follow-up to comment(s):
In smb.conf you should have the following line in your global section:
socket options = TCP_NODELAY
More information can be found in the Samba Performance Tuning section of their FAQs
http://samba.org/samba/docs/man/Samba-HOWTO-Collection/speed.html
A few thoughts that might help you rule some things out:
Could you have a memory leak somewhere that is resulting in the machine swapping like mad after a while? Check free -m
when the problem is apparent.
Also, could you have a problem with the RAID software deciding it needs to perform a resync? Check /proc/mdstat
when you are experiencing slowness to check for this (though I wouldn't expect this to be resolved by a reboot - any such resync should restart after the restart).
Have you ruled out local I/O issues? How fast does the array perform to local processes when the problem is apparent. If local processes can't access the array at normal speeds then Samba is not the issue (conversely, if they can when network accesses can't this supports the opposite). If the drives do seem slow locally you can look for further evidence by verifying that the network is not slow as well as the drives by running simple tests with netcat
and pv
(see http://www.interphero.com/?p=116 or search for "netcat speedtest" for other examples).
Could it be a firmware issue with one or more of your drives? Check to see if there have been any such updates from the manufacturer. Also, it could just be one drive that is playing oddly. When the speed problem presents itself try time dd if=/dev/sda of=/dev/null
, repeating for each drive a few times and taking an average. If one drive comes out much slower than the others then perhaps it has a problem and needs replacing (or a firmware update, if it has a known problem).
Have you ruled out a network card problem (hardware or driver)? You could try swapping it out for another Gbit card (with a different chipset) to see if that makes a difference.
If the problem does appear to be Samba and not the RAID array, network card, or anything else, is a full reboot required to fix the problem, or is simply restarting Samba enough? (Or restarting both Samba and winbindd if the server participates on a domain that way?)
A side note on your RAID5 comment:
The main problem with RAID5 is write performance, especially for significant numbers of small writes. This can kill performance for heavy database work but for a basic file server role (which your situation sounds like) that spends the majority of its time performing bulk reads it has little or no noticeable effect for the most part. If you do find write performance to ever be a problem, try the shiney new RAID10 driver in 3-drive mode (similar read performance to 3-drive RAID5 (or 2 drive RAID0) but write performance more like that of a 2-drive RAID1, while maintaining the same redundancy as any one drive can die at a time). The RAID10 driver may still be classified as "experimental" in all but the newest kernels though.
The other issue with RAID5 is how long it takes to rebuild the array if one drive is replaced. I doubt 3-drive RAID10 is any better in that regard.
As a point of reference: Linux's RAID10 over three drives is similar to what RAID controllers in some IBM servers call RAID1E.