SQL Server disk design on an ISCSI SAN
Its standard practice to separate log and data files to separate disks away from the OS (tempdb, backups and swap file also) Does this logic still make sense when your drives are all SAN based and your LUNS are not carved of specific disk or raid sets -they are just part of the x number of drives on the SAN and the LUN is just space allocation
Solution 1:
Logs and data drives have different data access patterns that are in conflict with each other (at least in theory) when they share a drive.
Log Writes
Log access consists of a very large number of small sequential writes. Somewhat simplistically, DB logs are ring buffers containing a list of instructions to write data items out to particular locations on the disk. The access pattern consists of a large number of small sequential writes that must be guaranteed to complete - so they are written out to disk.
Ideally, logs should be on a quiet (i.e. not shared with anything else) RAID-1 or RAID-10 volume. Logically, you can view the process as the main DBMS writing out log entries and one or more log reader threads that consume the logs and write the changes out to the data disks (in practice, the process is optimised so that the data writes are written out immediately where possible). If there is other traffic on the log disks, the heads are moved around by these other accesses and the sequential log writes become random log writes. These are much slower, so busy log disks can create a hotspot which acts as a bottleneck on the whole system.
Data Writes
(updated) Log writes must be committed to the disk (referred to as stable media) for a transaction to be valid and eligible to commit. One can logically view this as log entries being written and then used as instructions to write data pages out to the disk by an asynchronous process. In practice the disk page writes are actually prepared and buffered at the time the log entry is made, but they do not need to be written immediately for the transaction to be committed. The disk buffers are written out to stable media (disk) by the Lazy Writer process (Thanks to Paul Randal for pointing this out) which This Technet article discusses in a bit more detail.
This is a heavily random access pattern, so sharing the same physical disks with logs can create an artificial bottleneck on system performance. The log entries must be written for the transaction to commit, so having random seeks slowing down this process (random I/O is much slower than sequential log I/O) will turn the log from a sequenital into a random access device. This creates a serious performance bottleneck on a busy system and should be avoided. The same applies when sharing temporary areas with log volumes.
The role of caching
SAN controllers tend to have large RAM caches, which can absorb the random access traffic to a certain extent. However, for transactional integrity it is desirable to have disk writes from a DBMS guaranteed to complete. When a controller is set to use write-back caching, the dirty blocks are cached and the I/O call is reported as complete to the host.
This can smooth out a lot of contention problems as the cache can absorb a lot of I/O that would otherwise go out to the physical disk. It can also optimise the parity reads and writes for RAID-5, which lessens the effect on performance that RAID-5 volumes have.
These are the characteristics that drive the 'Let the SAN deal with it' school of thought, althoug this view has some limitations:
Write-back caching still has failure modes that can lose data, and the controller has fibbed to the DBMS, saying blocks have been written out to disk where in fact they haven't. For this reason, you may not want to use write-back caching for a transactional application, particlarly something holding mission-critical or financial data where data integrity problems could have serious consequences for the business.
SQL Server (in particular) uses I/O in a mode where a flag (called FUA or Forced Update Access) forces physical writes to the disk before the call returns. Microsoft has a certification program and many SAN vendors produce hardware that honours these semantics (requirements summarised here). In this case no amount of cache will optimise disk writes, which means that log traffic will thrash if it is sitting on a busy shared volume.
If the application generates a lot of disk traffic its working set may overrun the cache, which will also cause the write contention issues.
If the SAN is shared with other applications (particularly on the same disk volume), traffic from other applications can generate log bottlenecks.
Some applications (e.g. data warehouses) generate large transient load spikes that make them quite anti-social on SANs.
Even on a large SAN separate log volumes are still recommended practice. You may get away with not worring about layout on a lightly used application. On really large applications, you may even get a benefit from multiple SAN controllers. Oracle publish a series of data warehouse layout case studies where some of the larger configurations involve multiple controllers.
Put responsibility for performance where it belongs
On something with large volumes or where performance could be an issue, make the SAN team accountable for the performance of the application. If they are going to ignore your recommendations for configuration, then make sure that management are aware of this and that responsibility for system performance lies in the appropriate place. In particular, establish acceptable guidelines for key DB performance statistics like I/O waits or page latch waits or acceptable application I/O SLA's.
Note that having responsibility for performance split across multiple teams creates an incentive to finger-point and pass the buck to the other team. This is a known management anti-pattern and a formula for issues that drag out for months or years without ever being resolved. Ideally, there should be a single architect with authority to specify application, database and SAN configuration changes.
Also, benchmark the system under load. If you can arrange it, secondhand servers and direct-attach arrays can be purchased quite cheaply on Ebay. If you set up a box like this with one or two disk arrays you can frig with the physical disk configuration and measure the effect on performance.
As an example, I have done a comparison between an application running on a large SAN (an IBM Shark) and a two-socket box with a direct attach U320 array. In this case, £3,000 worth of hardware purchased off ebay outperformed a £1M high-end SAN by a factor of two - on a host with roughly equivalent CPU and memory configuration.
From this particular incident, it might be argued that having something like this lying around is a very good way to keep SAN administrators honest.
Solution 2:
I'm assuming that the Equallogic tag and the content of the request means that you are tallking about an Equallogic SAN. What follows is specifically about Equallogic and it does not apply to other SAN types.
With Equallogic arrays the specific disks used for volumes can't be specified as precisely as they can with, say, EMC Clariion arrays so the approach has to be a bit different.
Equallogic architecture is very automated and dynamic. Its basic building block is the array unit not RAID packs\groups within an array as seen in other SANs. Each array is entirely configured for RAID 5, 6, 10 or 50 although this does not imply that there is only one RAID group per array, you just never get to decide or interact with them at that level. You put arrays into Storage pools and your pools then belong to a Storage Group. The Storage Group has a cluster\virtual ip-address that you use as the iSCSI Discovery target for all volumes within that group - the EQL Group management software and host MPIO stack handles the ip level rediredection needed to actually route to the most appropriate port on the individual arrays when requesting blocks of data but that is something you have little or no ability to control.
Storage volumes are assigned from the total free space in each pool. All volumes within a pool are spread across all arrays in that pool (well up to a max of 4 separate arrays) in order to distribute network IO across the total number of network interfaces (2-4 per Eql array depending on model) and IO across as many controllers as possible. The Equallogic management software monitors volume\array performance over time and dynamically optimizes the distribution of blocks across the member arrays. In general unless you know what you are doing you should put all arrays in a single pool and let it do its thing just remember to ensure that you configure your high speed disks (SAS 10k\15k) with RAID 10, medium speed with RAID 50 or 5 in order to ensure that the optimization process actually chooses the real high performance drives. It may take quite a few days (7+) to actually get to an optimum state but in general it should hit a balanced distribution quite fast as it immediately distributes volumes over as many arrays as it can (again up to 4) when they are initially created.
To a rough approximation you will have somewhere between 2500-5000 IOPs per PS array depending on drive type and RAID type. If you provide enough total IOPs then the automated management process should eventually give you good performance even if you simply lump all volumes into a single pool.
However if you want to guarantee that your logs, databases, temp stores, OS drives etc are actually isolated from each other you can do a couple of things. Firstly you can define a RAID preference for a volume which will guarantee that the specific volume is always stored only on arrays of that RAID type (if they are present in the pool the volume belongs to). Secondly you can define tiered storage pools that only contain arrays that deliver the various grades of performance you require for that particular tier and then distribute your volumes into the appropriate pools. The health warning that comes with this approach is that you will generally need a lot of arrays for this to actually deliver better overall performance - that may be less important to you than guaranteeing the performance on your critical volumes though so it is often still the best choice. Dell's reference architecture for Oracle DB's uses one pool with 2 RAID 10 arrays for Data, Voting disk and the OCR, and a separate pool with a single RAID 5 array for the Flash Recovery Area.
At all points in time with Equallogic you should ask yourself if the decisions you are making with regard to enforced partitioning are going to provide better aggregate performance for your volumes in terms of available network interfaces, disk spindles and controllers. If you can't answer that then opt for the minimum number of pools and leave it handle the details or get an Equallogic specialist in to do a real design. If you only have one array then there is nothing you can do in terms of separating volumes.
Solution 3:
We store our DBs on single SAN boxes but with separate data, log and backup LUNs, each on different disk groups, tiered by speed - with have our logs on RAID 10 15Krpm LUNs, data on RAID 1 10/15krpm LUNs and backup onto RAID 5 7.2krpm LUNs. We also present logs and data through different controllers on the same SAN.
Solution 4:
Great question!
First take a look at Brent Ozar's "Steel Cage BlogMatch" debate on this issue.
At our company, for most servers, we put Data and Logs on the same SAN drive, and leave it up to the SAN team to make sure everything works right.
I'm starting to think this isn't the best strategy, especially for higher-volume servers. The underlying problem is that I really have no way to verify that the SAN team is really doing anything more than slapping together enough drives for the space we need. We don't run IO benchmarks against the SAN drives from our side or anything, we kind of just assume that they are "doing their job" (adjusting for performance as well as space), which is probably a bit naive.
My other thought is that the kind of access that data vs logs need is different. I'll try to find the article I read recently that was talking about how the two different drive types really should be optimized in very different ways (I think one needed optimization for sequential writes, the other needed optimization for random reads, something like that.)