What does it mean to do MLE with a continuous variable

I am struggling with the semantics of continuous random variables.

For example, we do maximum likelihood estimation, in which we try to find the parameter $\theta$ which, for some observed data $D$, maximizes the likelihood $P(\theta|D)$.

But my understanding of this is $$P(\theta = x) = P(x\leq\theta\leq x) = \int_x^xp(t)dt = 0$$ so I am not sure how any $\theta$ can result in a non-zero probability.

Intuitively I understand what it means to find the "most probable" $\theta$, but I am having trouble uniting it with the formal definition.


EDIT: In my class we defined $L(\theta:D)=P(D|\theta)=\prod_i P(D_i|\theta)$ (assuming i.i.d, where $D_i$ are the observations). Then we want to find $\text{argmax}_\theta \prod_i P(D_i|\theta)$.

I was incorrect above about finding $P(\theta)$, but it seems to me we're still trying to find the maximal probability, where all probabilities are zero. Some answerers suggested that we're actually trying to find the max probability density but I don't understand why this is true.


It seems to me like whoever defined it for you was being hand-wavy (and, I would argue, careless). For continuous random variables, the likelihood is defined to be the joint density of the data $\mathcal D$ when taken as a function of the unknown parameter $\theta$, i.e. for a vector-valued observation $\mathbf x$ we have $$L(\theta | \mathbf x) := f(\mathbf x | \theta) $$ where $f$ is the density of a random vector $\mathbf X$. For discrete random vectors, replace take $f$ to be a probability mass function. This is the definition of likelihood that you should take. Then, to get the MLE, take an $\arg \max$ over $\theta$. To get a definition that is less arbitrary in the distinction between continuous and discrete rvs, one needs to introduce measure theory and the notion of Radon-Nikodym derivative, in which case we can generalize the notion of density so that mass functions are a type of density, and the arbitrariness vanishes.


"The most probable $\theta$" is a misleading way of saying it, although it is very frequently encountered. The MLE is actually the value of $\theta$ that makes the observed data more probable than it would be with any other value of $\theta$. Except that, as you note, with continuous random variables, the probability of the observed data is always $0$.

"$P(\theta=x)$" is nowhere involved.

Suppose we will observe the realized value of the random variable $X$. Given the value of $\theta$, the probability that $X$ is in any measurable set $A$ is $\displaystyle\int_A f(x \mid \theta)\,dx$. Given the observation $X=x$, the MLE is the value of $\theta$ that maximizes the likelihood function $L(\theta) = f(x\mid\theta)$.

Notice that, for example of $X$ has a Poisson distribution with expected value $\theta$, then $\theta$ could be, for example $3.2781$, but $X$ must always be in $\{0,1,2,3,\ldots\}$. If it is observed that $X=4$, it makes no sense to say that then we're considering $P(\theta=4)$. So "$P(\theta=x)$" does not enter what we're doing.


Although the others are correct, to address your confusion, the likelihood function is the joint probability density for the observed x as a function of the parameter. So finding the maximum likelihood estimate you find the parameter value that makes this density the highest. The density function will be positive even though the probability of exactly observing the given x is 0. The likelihood for continuous distributions is a density function. Your misunderstanding of this fact is what is causing you confusion. For discrete distributions it is the probability mass function at the observed values of the data as a function of the parameter theta. Also L(θ:D)=P(D|θ) is the joint density. It only factors into the product of the marginal densities when the observations are independent (which is the usual case).


When using the maximum likelihood estimation principle, the parameter that you are trying to estimate is not a random variable. That is, there is some true parameter $\theta^*$, which is a fixed (non-random), but unknown quantity.

The maximum likelihood estimator is formed as $\hat \theta = \arg\min_\theta \prod_{i=1}^n f(x_i|\theta)$, which is a random variable as it depends on ${\cal D} = \{x_i\}_{i=1}^n$.

Now, as you pointed out $P(\hat \theta = \theta^*) = 0$. However, there are a number of ways in which you can evaluate goodness of your estimator. For example, under some conditions, you could show that $E(\hat \theta - \theta^*)^2$ is small, or that $\hat \theta \xrightarrow{a.s.} \theta^*$ as $n \rightarrow \infty$.


Here's still another way to view the MLE, that really helped clarify it for me:

You're taking the derivative of the pmf (With respect to whatever variable you're trying to isolate) and finding a local maximum by setting that derivative equal to 0.

That's what the MLE is. To look at it from the viewpoint of a normal distribution, you're finding the exact value (Or the formula for it) of the peak (highest probability of occurring - the mean, in the case of the normal), because that's where its derivative changes direction (So it, for an instant, is 0 there)

The log step is because all taking a log does is flattens a curve (Such as you see here, in situations where it won't just flatten the curve you can't take the log) which doesn't change the local maximum at all, just lowers it -- but it almost always makes the derivative more straightforward.

Hope this helps!