RAID 0 for Development
Hello fellow administrators, I am seeking high level guidance on the following situation:
First the context of the environment: internal, all virtual (vmware), development use only, performance optimized across the full stack, down time is acceptable (1-2 days for a few servers at a time), budget conscious, heavy write OLTP workload, 10Gbps links between SAN (Synology all flash SAS) and hosts, small team none of us are formal DBAs, all databases have simple recovery model, san volumes are ext4, thick provision on the LUNs as well..
Since I was just a baby administrator backups and redundancy has been pounded into my head. Which I have followed until now, because the budget is limited and there is significant amount of data 90 TBs across 20 servers (SQL Server on Linux (Ubuntu to avoid windows licensing costs)) with about 40 databases. Therefore we use RAID 0. This is done because we have heavy write workloads and the use case / application / business requires high throughput even for development, all drives are on the support list synology.
There are many situations that lead to the current configuration. The config being, single volume storage pools (4 x 4/8TBs SSDs in RAID 0), single volume, single LUN, single VMFS, if 4TB drives volume has 2-6 VMs (6 to 2TBs) double that for 8TBs, thick eager provisioning, SAN LUNs uses 98% of available capacity, everything else uses 100%. I know this reduces visibility across the board for capacity planning, that is handled else how not covered here. Because we use RAID 0 for cost savings and performance we limited it to 4 drives to reduce affected servers if a drives fails. This also helps with the servers not running into each other, there is low appetite to use vmware IO limiting.
For the sake of the conversation lets say significant budget increases ($2,000+) is not possible. It should be known that we have full c level sign off on the risk of downtime.
Last piece, we have to have a couple of 50TBs datastores where storage pools are config'd with RAID 10 8 x 7.2K HDDs instead of RAID 0 with SSD and this level of performance is not enough because the workloads are just too much for the IOPS that HDDS can produce.
This brings us to my question, given the restrictions, is this a good approach for performance? What have others done with similar goals and restrictions? Please remember downtime is acceptable for some servers at a time in the case of a drive failure because this is not production workloads, those are at AWS and Azure.
I know this question cross many areas but I also know many DBAs now a days have had to become familiar with those areas, I am really looking for advice for those with similar situations.
Thank you
Complete a backup restore test during the day. Destroy the storage volumes to simulate RAID 0 storage pool failure, which will take the test systems down. Copy from backup media, and complete a restore. If the organization is satisfied with the recovery and tolerates that amount of downtime, then a RAID 0 scheme could work. (I am skeptical they will tolerate a few hours down, but maybe.)
Restore tests are useful on any storage, but extra important if restores are necessary on the first drive failure.
Doing such a restore test during business hours is important. Drive failures don't wait for after hours. So, this forces users to appreciate how much downtime a restore really means. Also, you the system admin should not have to work odd hours for a test system that is documented to be of lesser importance.
Regarding performance, for your capacity planning define an IOPS budget. Look at IOPS numbers from the database, host, or storage array level, and observe when performance is acceptable.
7200 RPM drives under small block random load might get 70 IOPS each, raw. Not a lot. Divide your IOPS requirement by this to approximate the number of spindles required. Do the same for solid state, which should be thousands of IOPS per drive. Compare price per IOPS as well as price per capacity.
This barely covers the beginning of the possibilities for storage design. For example, hybrid arrays with both SSDs and spindles are possible. However, those work best in storage that has a caching tier, or an obvious bottleneck like RAID 4. Uniform storage is simpler to manage for most RAID types.