How to gracefully avoid divide by zero in Prometheus

There are times when you need to divide one metric by another metric.

For example, I'd like to calculate a mean latency like that:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN. If I do some aggregation over the results (avg() or sum() or whatever), the whole aggregation result becomes NaN.

So I add a check for zero in divider:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)

This removes NaNs from the result vector. And also tears the line on the graph to shreds.

Let's mark periods of inactivity with 0 value to make the graph continuous again:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0

This effectively replaces NaNs with 0, graph is continuous, aggregations work OK.

But resulting query is slightly cumbersome, especially when you need to do more label filtering and do some aggregations over results. Something like that:

avg(
    1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
    /
    (increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
    or
    increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)

Long story short: Are there any simpler ways to deal with zeros in divider? Or any common practices?


Solution 1:

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

This is the correct behaviour, NaN is what you want the result to be.

aggregations work OK.

You can't aggregate ratios. You need to aggregate the numerator and denominator separately and then divide.

So:

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))

Solution 2:

Finally I have a solution for my specific problem:

Having a devision by zero leads to a NaN display - that is fine as a technical result and correct but not what the user wants to see (does not fulfil the business requirement).

So I searched a bit and found a "solution" for my problem in the grafana community:

Surround your problematic value with max(YOUR_PROLEMATIC_QUERY, or vector(-1)). An additional value mapping then leads to a useful output.

(Of course you have to adapt the solution to your problem... min/max... vector(42)/vector(101)/vector(...))

Update (1)

Okay. However. It seems to be a bit more tricky based on the query. For example I have another query that fails with NaN as a result of a devision by zero. The above solution does not work. I had to surround the query with brackets and added > 0 or on() vector(100).