How does a SAN architecture work and more importantly scale?

I'm trying to understand some SAN infrastructure and I was hoping some of you with more experience than me could help me understand scaling with a SAN.

Imagine that you have some computer servers that have a HBA. They connect either directly or via a switch to a SAN Controller. The SAN Controller then provides one or more LUNs which are mapped to most likely a RAID array on a storage device.

So if I understand correctly, the "controller" represents a performance bottleneck. If you need lots of performance then you add more controllers with connections to their own storage, which then get mapped to the servers that need them.

I imagine you can get some very high performance controllers with huge storage capacities and lower performance controllers with a lower maximum performance? But if you have a switch you can then add several lower performance controllers to your network as you need them?

Please tear apart my understanding if I have it wrong, but I'm trying to work out how you connect HBAs from a server to storage without the fabric simply representing "magic".


Solution 1:

The controller as a performance bottleneck is quite true, and it can represent a single-point-of-failure as well in some architectures. This has been known for quite some time. For a while there were vendor-specific techniques for working around this, but since then the industry as a whole has converged upon something called MPIO, or Multi-Path I/O.

With MPIO you can present the same LUN across multiple paths across a storage fabric. If the the server's HBA and the storage array's HBA each have two connections to the storage fabric, the server can have four separate paths to the LUN. It can go beyond this if the storage supports it; it is quite common to have dual-controller setups in the larger disk array systems with each controller presenting an active connection to the LUN. Add in a server with two separate HBA cards, plus two physically separate paths connecting the controller/HBA pairs, and you can have a storage path without single points of failure.

The fancier controllers will indeed be a full Active/Active pair, with both controllers actually talking to the storage (generally there is some form of shared cache between the controllers to help with coordination). Middle-tier devices may pretend to be active/active, but only a single device is actually performing work at any given time but the standby controller can pick up immediately should the first go silent and no I/O operations are dropped. Lower tier devices are in simple active/standby, where all I/O goes along one path, and only moves to other paths when the active path dies.

Having multiple active controllers can indeed provide better performance than a single active controller. And yes, add enough systems hitting storage and enough fast storage behind the controller, and you can indeed saturate the controllers enough that all attached servers will notice. A good way to simulate this is to cause a parity RAID volume to have to rebuild.

Not all systems are able to leverage MPIO to use multiple active paths, that's still somewhat new. Also, one of the problems that has to be solved on the part of all of the controllers is ensuring that all I/O operations are committed in-order despite the path the I/O came in on and on whatever controller received the operation. That problem gets harder the more controllers you add. Storage I/O is a fundamentally serialized operation, and doesn't work well with massive parallization.

You can get some gains by adding controllers, but the gains rapidly fade in the light of the added complexity required to make it work at all.

Solution 2:

The problem with trying to gang low-performance devices is the nature of the way software accesses storage. A typical program will issue a read request and the semantics of the read operation require the operating system to provide the process the results of the read. The operating system, in general, has no way to know what that process or thread will want next. It can try to guess with readahead, but it's not always right.

So if you try to use lots of mediocre controllers, you wind up with high latency because mediocre controllers take longer to get their requests on the wire and deliver their requests from the wire. The fact that you have a bunch of other controllers idle doesn't get you much more speed.

There is some application dependence here. Some workloads issues lots of requests from many different places or use asynchronous file reading APIs that allow the same thread to issue multiple requests before waiting for any of them to complete. Some applications benefit greatly from readahead, removing a lot of the latency. But if you want a general-purpose solution that performs well, you need controllers that provide low latency so that you're not waiting around for the controller before you can even figure out what data you need next.