Solution 1:

Update I realized a problem with my previous post, here's an update with a general philosophy of "marginalizing" out unwanted covariates.

Introduction

Suppose we have a mean model \begin{align*} \mathbb{E}[Y|X^1, X^2] &= \beta_0 + \beta_1 X^1 + \beta_2 X^2 \tag{$\heartsuit$}\\ \mathbb{E}[Y|X^1, X^2] &= e^{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \tag{$\diamondsuit$} \end{align*} Assuming $X^1 \perp \!\!\! \perp X^2$, we have from $(\heartsuit)$ the marginal model \begin{align*} \mathbb{E}[Y|X^1] = \mathbb{E}[\mathbb{E}[Y|X^1, X^2]] = \beta_0 + \beta_1 X^1 + \beta_2\mathbb{E}[X^2|X^1] = \beta_0^* + \beta_1^* X^1 \end{align*} with $(\beta_0^*, \beta_1^*) = (\beta_0 + \beta_2\mathbb{E}[X^2|X^1], \beta_1)$ or from $(\diamondsuit)$ the marginal model \begin{align*} \mathbb{E}[Y|X^1] = \mathbb{E}[\mathbb{E}[Y|X^1, X^2]] = e^{\beta_0^* + \beta_1^* X^1} \end{align*} where $(\beta_0^*, \beta_1^*) = (\beta_0 + \log \mathbb{E}[e^{\beta_2 X^2}|X_1], \beta_1)$. In both these cases, where $Y$ is assumed to follow a linear or log-linear model with respect to covariates $X^1, X^2$, we have that the marginal coefficient $\beta_1^*$ is numerically equal to the full conditional model coefficient $\beta_1$.

Consistently estimating a parameter Generalized estimating equations set up equations of the form \begin{align*} \sum_{k=1}^{K} \mathbf{D}_k^\intercal \mathbf{V}_k^{-1}(\mathbf{Y}_k - \boldsymbol{\mu}_k) = \mathbf{0} \end{align*} where $\mathbf{Y}_k$ is a vector of observations which are believed with be correlated within itself, but $\mathbf{Y}_k \perp \!\!\! \perp \mathbf{Y}_{k'}$ (vectors are independent of each other). Without diving too deep into the theory of GEEs, the relevance here is that, as long as a mean model is correctly specified, fitting a GEE will always return consistent estimates of the desired coefficients. And a GLM is a special case of a GEE! What this implies, for example, if we have a model with \begin{align*} Y_i = \beta_0 + \beta_1 X^1_i + \beta_2 X^2_i + \epsilon_i, \qquad \epsilon \sim F \end{align*} for any distribution $F$ with mean 0 and finite second moments (say, $\epsilon_i \sim \text{Exponential}(\frac{1}{2021}) - 2021$), fitting as if these errors were normal would still yield consistent estimates of $(\beta_0, \beta_1, \beta_2)$! Of course, the standard errors for these estimators would be incorrect and would need to be adjusted with sandwich estimators, but the consistency is not affected.

Application to current problem The mean of $\text{Beta}(p, 1)$ is $\frac{p}{p+1}$, which is not a tractable form to perform the marginalization trick because of the "non-separability" of $X^2$ from $X^1$. My idea was to search for a transformation to make it more tractability, and once close transformation is setting $\widetilde{Y}_i = -\log(Y_i)$, which provides the mean model \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1, X^2] = \frac{1}{p} = \frac{1}{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \end{align*} This still isn't quite there, but now if you're willing to accept the functional form \begin{align*} p = e^{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \end{align*} (which somewhat makes more sence, since this guarantees $p > 0$ as needed in the beta distribution), we have that \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1, X^2] = e^{-\beta_0 - \beta_1 X^1 - \beta_2 X^2} \end{align*} and therefore the marginalization trick reveals \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1] = e^{\beta_0^* + \beta_1^* X^1 } \end{align*} where $(\beta_0^*, \beta_1^*) = (-\beta_0 +\log \mathbb{E}[e^{-\beta_2 X^2}|X^1], -\beta_1)$. Essentially, we're treating the $X^2$ like errors in a normal regression model, but noting that they may not be mean 0 and therefore would bias the intercept. That's not a problem, since we are interested in $\beta_1$!

Final procedure

  1. Transform $\widetilde{Y} = -\log(Y)$
  2. Regress $\widetilde{Y}$ on $X^1$ with log link with any distribution you want (exponential, normal). The resulting coefficient is consistent for $-\beta_1$.

Solution 2:

Maybe one can go numerical. In the presence of latent variables we have to maximize the marginal likelyhood, that here reads:

$log L=\sum_i log (\int dx^2_i p(x^1_1)p(x^2_i)y^{(\beta_1x_i^1+\beta_2x^2_i)}(\beta_1x_i^1+\beta_2x^2_i)$ [1]

One could try the EM algorithm and see if it simplifies things (but the nasty additive term mixing the parameters is not simplifying things...).

An other approach is upgrading $x^2_i$,$\beta_1$ and $\beta_2$ all to r.v. with a prior distribution and sample from the posterior or take a $MAP$ for all of them.

This can be done using PyMC3:

https://docs.pymc.io/en/v3/

https://discourse.pymc.io/t/maximum-likelihood-estimation-of-a-bayesian-model/967

I would try to modify the code from here ( the first linear regression example ) in two ways:

https://docs.pymc.io/en/v3/pymc-examples/examples/getting_started.html

  1. The distribution of $Y$ should follow your beta distribution ;
  2. Upgrade $X_2$ to be a random variable with a prior distribution, rather than a fixed one ;

Having done this, you should be able to obtain MAP estimates for all variables and samples from the posterior. Each $x^i_2$ will be than treated as a parameter of the model so you would be sampling a space of dimensions of the number of samples that you have, therefore I am not sure how much this approach would be stable.