Tools for Troubleshooting SAN Performance Bottlenecks

What are the best tools for troubleshooting SAN performance bottlenecks?


Solution 1:

A lot depends on the hardware you're playing with. Bottlenecks can come from a variety of sources:

  • Host based bottlenecks Sometimes, a server just can't shovel I/O blocks fast enough. For that you'll need to use whatever performance metrics your operating system (or application) have to diagnose.
  • Fabric based bottlenecks Brocade switches have performance metrics in handy charts, as well as actual numbers. Following these can illuminate where you're running into issues, perhaps with saturated ISLs. SMI-S should help you here, if you have the ability to use it.
  • Array bottlenecks These can come in a variety of flavors, like saturated controllers, overworked disk groups, and the like. Like the switches, newer arrays should support SMI-S for tracking things down.

Solution 2:

Sorry this is so windows-centric, but the PAL (Performance Analysis of Logs) tool - http://www.codeplex.com/PAL is useful for identifying problems with SAN setups, though you may have to pull a fairly long time period of .blg performance counter logs. Hope this helps.

Solution 3:

Your choice of tool depends on your hardware platform. In any case bottlenecks will manifest in one of 3 points in your architecture:

  1. Host
  2. Switch Fabric
  3. Storage Array

You will need a tool (or tools) that have the capability to monitor each of these components. You might want to adopt a best of breed strategy and use 3 different tools or you might prefer a Lord of the Rings approach and select a single tool "to rule them all." Whatever works for you. Start by contacting your vendor(s) and see which tools are available for your devices. You should be able to start gathering metrics at each of these points in your I/O chain which will let you identify where you need to be focusing your effort.

Solution 4:

Monitor disk queue length on servers:

  • perfmon/scom on Windows
  • sar on unix
  • Virtual Center/ esxtop on VMware

Solution 5:

If you want an all-in-one enterprisey solution, take a look at TPC for Disk/Fabric from IBM. You can monitor any components of your SAN (that support SMI-S as well as other standards) from one interface and be able to view or query historical data.

If this isn't an option, you can query the various SAN devices for their statistics and setup some sort of RRD monitoring to graph the performance and identify the bottlenecks.

Most disk subsystems and switches have some sort of built-in performance monitoring in the form of live graphs - try looking at those as well.

(disclaimer: my company sells TPC)