I am a bit confused about the derivation of MLE of Uniform$(0,\theta)$.

I understand that $L(\theta)={\theta}^{-n}$ is a decreasing function and to find the MLE we want to maximize the likelihood function.

What is confusing me is that if a function is decreasing, then wouldn't the function be maximized at the smallest input rather than the largest?

Thank you in advance for your help.


Welcome back to MSE.

This is one of those things that once you're explained it correctly the first time, without any gaps in explanation, that it makes sense. Unfortunately, most answers and even professors don't explain all of the details, in my experience.

Suppose $X_1, \dots, X_n$ are independent and distributed $\text{Uniform}(0, \theta)$, with $\theta > 0$.

Let $\mathbf{I}$ denote the indicator function, where $$\mathbf{I}(\cdot) = \begin{cases} 1, & \cdot \text{ is true} \\ 0, & \cdot \text{ is false.} \end{cases}$$

The probability density function of any of the $X_i$, for $i \in \{1, \dots, n\}$, can be written like so: $$f_{X_i}(x_i \mid \theta) = \dfrac{1}{\theta}\cdot\mathbf{I}(0<x_i<\theta)\text{.}$$ The likelihood function is thus given by $$\begin{align} L(\theta)&=f_{X_1, \dots, X_n}(x_1, \dots, x_n \mid \theta)\\ &=\prod_{i=1}^{n}f_{X_i}(x_i \mid \theta) \\ &= \dfrac{1}{\theta^n}\prod_{i=1}^{n}\mathbf{I}(0 < x_i < \theta)\text{.} \end{align}$$

The following claim, although used, is often omitted from explanations:

Claim. Let $A$ and $B$ be events. Then $\mathbf{I}(A)\cdot \mathbf{I}(B)=\mathbf{I}(A \cap B)$.

I leave the proof of this to you. Note that $ 0 < x_i < \theta$ is the same as requiring both $x_i > 0$ and $x_i < \theta$. Hence, we write $$\begin{align} L(\theta)&=\dfrac{1}{\theta^n}\prod_{i=1}^{n}\mathbf{I}(0 < x_i < \theta) \\ &= \dfrac{1}{\theta^n}\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)\mathbf{I}(x_i < \theta)] \\ &= \dfrac{1}{\theta^n}\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)]\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)]\text{.} \end{align}$$

It will be clear why I split the product as above in a bit.

The claim given above is true if we were to extend to an arbitrary number of events as well. Thus,

$$\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)] = \mathbf{I}(x_1 > 0 \cap x_2 > 0 \cap \cdots \cap x_n > 0)$$ and $$\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)] = \mathbf{I}(x_1 < \theta \cap x_2 < \theta \cap \cdots \cap x_n < \theta)\text{.}$$

The next claims are often omitted as well from explanations:

Claim 1. Given $x_1, \dots, x_n \in \mathbb{R}$, $x_1, \dots, x_n < k$ if and only if $$x_{(n)}:=\max_{1 \leq i \leq n}x_i < k\text{.}$$

Claim 2. Given $x_1, \dots, x_n \in \mathbb{R}$, $x_1, \dots, x_n > k$ if and only if $$x_{(1)}:=\min_{1 \leq i \leq n}x_i > k\text{.}$$

Thus $$\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)] = \mathbf{I}(x_1 > 0 \cap x_2 > 0 \cap \cdots \cap x_n > 0) = \mathbf{I}(x_{(1)} > 0)$$ and $$\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)] = \mathbf{I}(x_1 < \theta \cap x_2 < \theta \cap \cdots \cap x_n < \theta) = \mathbf{I}(x_{(n)} < \theta)\text{.}$$ The likelihood function is thus $$L(\theta) = \dfrac{1}{\theta^n}\mathbf{I}(x_{(1)} > 0)\mathbf{I}(x_{(n)} < \theta)\text{.}\tag{*}$$ Now, consider the above as a function of $\theta$. For all intents and purposes, $\mathbf{I}(x_{(1)} > 0)$ is irrelevant when it comes to maximization of $L$ with respect to $\theta$, because it is independent of $\theta$. So, the part that really matters is $$L(\theta) \propto \dfrac{1}{\theta^n}\mathbf{I}(x_{(n)} < \theta) = \dfrac{1}{\theta^n}\mathbf{I}(\theta > x_{(n)})\text{.}\tag{**}$$ Generally, when doing maximum-likelihood estimation, we assume that the observed $x_i$ fall within the support of the given distribution, so we'll just assume $x_{(1)} > 0$.

Remember to view (**) as a function of $\theta$. If $\theta \leq x_{(n)}$, note that $L(\theta) = 0$ because of the indicator function. This is not the maximized value of $L$; $L$ is, at its crux, a probability density function: $0$ is in fact the smallest value that a probability density function can take.

So, in attempting to maximize $L$, suppose that $\theta > x_{(n)}$. For $n$ fixed, we obtain $$L(\theta) \propto\dfrac{1}{\theta^n}\text{.}$$ Now, note that $\dfrac{1}{\theta^n}$ is indeed a decreasing function of $\theta$ with $n$ fixed. Thus, we must make $\theta$ as small as possible, given our restriction of $\theta > x_{(n)}$.

Note. Technically, no such $\theta$ exists (because $\theta$ is strictly greater than $x_{(n)}$ per our assumptions). This is often ignored in many textbooks.

Most textbooks will then say that the maximum likelihood estimator of $\theta$ is $$\hat{\theta}_{\text{MLE}} = X_{(n)}\text{.}$$

Note. Technically, the above result is false. The MLE does not exist, because $\theta$ cannot take on the value $x_{(n)}$ itself. For this answer to be correct, the support of the uniform PDF must include $\theta$ itself (because the maximum likelihood estimator equals one of the $X_i$). The reason for this is discussed in the Lecture 2: Maximum Likelihood Estimators from MIT OpenCourseWare 18-443 Statistics for Applications, found here. As the question currently stands, $(0, \theta)$ should be $(0, \theta]$.