How can I safely use storage thin provisioning?
I have storage that allows me to thin provision my volumes presented to the clients. Is this safe? What are the best practices?
Generically, whether you're talking about SCSI LUNs (SAN) or network file systems (NAS), thin provisioned storage is when you tell the storage client that it has more space than you've actually allocated to it. This has no risks on its own, but if you don't have enough actual storage to allow every single container to grow to the full promised size, that's called overprovisioning and it entails risk.
Advantages
The advantages of overprovisioning and thin provisioning are compelling. Many consumers of storage (servers, file share users, etc.) will request far more storage than they initially need, and continue to ensure they have a safe margin for growth as they grow. A centrally provisioned safe margin for growth is far more efficient than hundreds of small ones. The utilization of the underlying storage without thin/overprovisioning can be very low, and this allows a higher rate of utilization.
Risks
All the risks of this scenario are linked with overprovisioning. The more you overprovision, the higher your risk. The danger is the potential for the utilization of storage resources to completely fill the available storage, which will generally cause all the storage containers to fail in one way or another. Filesystems will go read only or offline and LUNs will go offline.
Best practice
In order to get the benefits of higher utilization that come with overprovisioning while mitigating the risk, you need to constantly monitor the storage and be able to take action when required.
- Use software to monitor and alert on pool utilization conditions. If there's nothing in a box that will do this, write it yourself. Most storage supports CLI commands that can be read by a script that you schedule to run frequently. The frequency should be high enough that none of your pools is capable of filling up between polling events.
- Establish a baseline threshold. All new pools of storage with overprovisioned clients should get this applied by default. This threshold should be the most conservative one in your environment.
- For smaller pools, use a lower threshold. If you give yourself 30% of warning on a 100TB pool, you have a lot more time to add disk than if you have 30% warning on a 10TB pool, assuming they are both capable of ingesting writes at the same speed.
- Adjust the threshold up if you're less overprovisioned. If you have a pool that's only 106% overprovisioned, hitting 70% utilization isn't nearly as risky as a pool that's at 200% overprovisioning.
- Adjust your thresholds based on how much time you need to add space to a pool. In my shop, we keep online storage in each box held back for growth in any pool, and more storage on a shelf ready to be installed into any storage box. We do this for enough types of storage that we can handle growth in any pool.
- Wherever possible and applicable, thin out your storage. Deduplication works to decrease your utilization, and if you are using LUNs, zero page reclaim and clients that are able to perform storage unallocates when they delete data both help.
The point and purpose of thin provisioning is similar to the reason to use a consolidated storage in the first place - by consolidating, you get a better peak capacity, with a lower average needed.
But be under no illusions - thin provisioning is pretending to allocate something, without actually doing so. There are many reasons this is useful. Two key ones being:
Higher utilization - unless your volumes are completely full, the disk space is wasted. Most systems don't run at 100% full all the time (and are generally assumed to be 'in trouble' if they are).
Deferred spending - if I give you 10TB today, but you fill it at 2TB per year, I can probably pay less if I wait before buying the disks.
You have two gotchas arising from this though:
running out of disk too fast - someone who starts filling 'their' disks can run the rest of the enterprise out of space.
spindle counts - buying fewer disks means you've got fewer spindles and thus fewer IOPs. Which means your disks will run hotter, and your performance will be worse.
Things I would suggest as a best practices for thin provisioning:
- Get management 'buy in' to the risks involved.
- set an 'acceptable' oversubscription ratio. (This is a business risk decision, so hand it upwards).
- Also consider individual volume sizes. A 20TB volume is more likely to gobble up space than a lot of 100GB volumes.
- Have capacity (or a purchase order) ready to go when you start running low (based on 'free space' or 'volume size'. You don't get as much warning that you're about to run out, and you probably can't wait until the next quarter/financial year to back fill - you're not buying new capacity any more, you're back filling stuff you've already 'sold'.
- Consider theoretical max capacity of your storage system. Think very carefully about what you'll do if go past it.
- pay close attention to your performance. IOPs/throughput both. You probably won't get a good response to 'how much performance do you need' questions. But you may find you 'run out' of performance faster than you would otherwise. Set a threshold for this too.
- consider your charging accordingly. You save money by thin provisioning, but you will NEED some of it back to keep up with your thin provisioning model.
I can't overstate that last point enough. You may well have customers who ask for storage and never use it. That's money you didn't spend and represents a saving. However, that's not the same as the customers who take a while to use it (e.g. more than a financial year) - you save money by buying bigger/cheaper disks next year. But you DON'T get away with 'selling' the space up front and just hoping that no one ever uses it. You may well end up filling up the whole lot over time, and you need to be ready to back fill.