calculating days until disk is full

We keep a "mean time till full" or "mean time to failure" metric for this purpose, using the statistical trend and its standard deviation to add the smarter (less dumb) logic over a simple static threshold.

Simplest Alert: Just an arbitrary threshold. Doesn't consider anything to do with the actual diskspace usage.

Example: current% > 90%

Simple TTF: A little smarter. Calculate the unused percentage minus a buffer and divide by the zero protected rate. Not very statistically robust, but has saved my butt a few times when my users upload their cat video corpus (true story).

Example: (100% - 5% - current%) / MAX(rate(current%), 0.001%)

Better TTF: But I wanted to avoid alerting for static read-only volumes at 99% (unless they ever have any changes), and I wanted more proactive notice for noisy volumes, and to detect applications with un-managed diskspace footprints. Oh, and the occasional negative values in the Simple TTF just bothered me.

Example: MAX(100% - 1% - stdev(current%) - current%, 0) / MAX(rate(current%), 0.001%)

I still keep a static buffer of 1%. Both the standard deviation and the consumption rate increase on abnormal usage patterns, which sometimes over compensates. In graphana or alertmanager speak you'll end up with some rather expensive sub-queries. But I did get the smoother timeseries, and less noisy alert I was seeking.

Example: clamp_min((100 - 1 - stddev_over_time(usedPct{}[12h:]) - max_over_time(usedPct{}[6h:])) / clamp_min(deriv(usedPct{}[12:]),0.00001), 0)

Quieter drives make for very smooth alerts.

Longer ranges tame even the noisiest public volumes.

Honestly "Days Until Full" is really a lousy metric anyway -- filesystems get REALLY STUPID as they approach 100% utilization.
I really recommend using the traditional 85%, 90%, 95% thresholds (warning, alarm, and critical you-really-need-to-fix-this-NOW, respectively) - this should give you lots of warning time on modern disks (let's say a 1TB drive: 85% of a terabyte still leaves you lots of space but you're aware of a potential problem, by 90% you should be planning a disk expansion or some other mitigation, and at 95% of a terabyte you've got 50GB left and should darn well have a fix in motion).

This also ensures that your filesystem functions more-or-less optimally: it has plenty of free space to deal with creating/modifying/moving large files.

If your disks aren't modern (or your usage pattern involves bigger quantities of data being thrown onto the disk) you can easily adjust the thresholds.

If you're still set on using a "days until full" metric you can extract the data from graphite and do some math on it. IBM's monitoring tools implement several days-until-full metrics which can give you an idea of how to implement it, but basically you're taking the rate of change between two points in history.

For the sake of your sanity you could use the derivative from Graphite (which will give you the rate of change over time) and project using that, but if you REALLY want "smarter" alerts I suggest using daily and weekly rate of change (calculated based on peak usage for the day/week).

The specific projection you use (smallest rate of change, largest rate of change, average rate of change, weighted average, etc....) depends on your environment. IBM's tools offer so many different views because it's really hard to nail down a one-size-fits-all pattern.

Ultimately no algorithm is going to be very good at doing the kind of calculation you want. Disk utilization is driven by users, and users are the antithesis of the Rational Actor model: All of your predictions can go out the window with one crazy person deciding that today is the day they're going to perform a full system memory dump to their home directory. Just Because.

We've recently rolled out a custom solution for this using linear regression.

In our system the primary source of disk exhaustion is stray log files that aren't being rotated.

Since these grow very predictably, we can perform a linear regression on the disk utilization (e.g., z = numpy.polyfit(times, utilization, 1)) then calculate the 100% mark given the linear model (e.g, (100 - z[1]) / z[0])

The deployed implementation looks like this using ruby and GSL, though numpy works quite well too.

Feeding this a week's worth of average utilization data at 90 minute intervals (112 points) has been able to pick out likely candidates for disk exhaustion without too much noise so far.

The class in the gist is wrapped in a class that pulls data from scout, alerts to slack and sends some runtime telemetry to statsd. I'll leave that bit out since it's specific to our infrastructure.

calculating days until disk is full

Related

Recent Posts