Why we consider log likelihood instead of Likelihood in Gaussian Distribution
- It is extremely useful for example when you want to calculate the joint likelihood for a set of independent and identically distributed points. Assuming that you have your points: $$X=\{x_1,x_2,\ldots,x_N\} $$ The total likelihood is the product of the likelihood for each point, i.e.: $$p(X\mid\Theta)=\prod_{i=1}^Np(x_i\mid\Theta) $$ where $\Theta$ are the model parameters: vector of means $\mu$ and covariance matrix $\Sigma$. If you use the log-likelihood you will end up with sum instead of product: $$\ln p(X\mid\Theta)=\sum_{i=1}^N\ln p(x_i\mid\Theta) $$
-
Also in the case of Gaussian, it allows you to avoid computation of the exponential:
$$p(x\mid\Theta) = \dfrac{1}{(\sqrt{2\pi})^d\sqrt{\det\Sigma}}e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)}$$ Which becomes:
$$\ln p(x\mid\Theta) = -\frac{d}{2}\ln(2\pi)-\frac{1}{2}\ln(\det \Sigma)-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)$$
-
Like you mentioned $\ln x$ is a monotonically increasing function, thus log-likelihoods have the same relations of order as the likelihoods:
$$p(x\mid\Theta_1)>p(x\mid\Theta_2) \Leftrightarrow \ln p(x\mid\Theta_1)>\ln p(x\mid\Theta_2)$$
-
From a standpoint of computational complexity, you can imagine that first of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is even more important, likelihoods would become very small and you will run out of your floating point precision very quickly, yielding an underflow. That's why it is way more convenient to use the logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.
Additionally in the classification framework you can simplify calculations even further. The relations of order will remain valid if you drop the division by $2$ and the $d\ln(2\pi)$ term. You can do that because these are class independent. Also, as one might notice if variance of both classes is the same ($\Sigma_1=\Sigma_2 $), then you can also remove the $\ln(\det \Sigma) $ term.
First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $\ln(ab) = \ln(a) + \ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.