PyMC3/Arviz: CDF value from trace
I have a sample from PyMC3 and I'm trying to get a cumulative probability from it, e.g. P(X < 0). I currently use this:
trace = pymc3.sample(return_inferencedata=True)
prob_x_lt_zero = (trace.posterior.X < 0).sum() / trace.posterior.X.size
Is there a better way to do this, either with some helper function from Arviz or XArray?
I haven't found any .cdf()
method or anything similar.
It's weird that such basic functions are missing, but more advanced ones are there, such as trace.posterior.X.quantile()
.
I would recommend your original approach evaluating the condition and averaging (that is basically using the empirical cdf) instead of using the KDE.
There is no equivalent that I know of, probably also because there is no equivalent in numpy either (which has both quantile
and percentile
). There is one in scipy: scipy.stats.percentileofscore but I wouldn't recommend it either unless you are working with discrete data and need the kind
argument to evaluate ties (i.e. would you care or notice any difference between using <
or <=
?). This scipy function also takes only a scalar as value to evaluate the ecdf against.
My recommendation therefore is to stick with your method, but modify a bit the implementation, so it also works when evaluating multiple values at the same time and when not reducing all the dimensions:
import arviz; import xarray
x = xarray.DataArray([-.1, 0, .1]) # skip that if working with scalars
post = arviz.load_arviz_data("rugby").posterior
prob_x_lt_zero = (post.atts < x).mean(("chain", "draw"))
which returns the probabilities for each of the 3 values we are evaluating at all 6 teams.
<xarray.DataArray (team: 6, dim_0: 3)>
array([[0. , 0. , 0.0485],
[0.347 , 0.975 , 1. ],
[0. , 0.004 , 0.4245],
[0.64 , 0.994 , 1. ],
[1. , 1. , 1. ],
[0. , 0. , 0. ]])
Coordinates:
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_0
You could approximate the CDF with a kernel density estimate, but I am not convinced that this is better than your current approach:
import arviz
grid, pdf = arviz.kde(trace.posterior.X.values, cumulative=True)
idx = np.sum(grid < 0) - 1
prob_x_lt_zero = pdf[idx]