I'm taking a machine learning course. The professor has a model for linear regression. Where $h_\theta$ is the hypothesis (proposed model. linear regression, in this case), $J(\theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$

$$h_\theta = \theta_1x$$

$$J(\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

What I don't understand is why he is dividing the sum by $2m$.


The $\frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).

So now the question is why there is an extra $\frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=\frac{1}{m} \sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.

The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.


I assume that the $\frac{1}{m}$ component is obvious and therefore I will focus on the $\frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.

When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:

$$ \epsilon \thicksim \mathcal{N}(0, 1) $$

Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $\hat{y}$ equals our target value $y$ up to this measurement error:

$$ \hat{y} + \epsilon = y $$

We can treat $\hat{y} + \epsilon$ as a new random variable $\widetilde{y} \sim \mathcal{N}(\hat{y}, 1)$. We have just added a constant $\hat{y}$ to our zero-centered random variable $\epsilon$. This random variable $\widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $\hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $\hat{y}$. We can compute the likelihood of actually observing the $\hat{y}$ or $y$ as well as any other number using the Gaussian PDF:

$$ p(x) = \frac{1}{{\sigma \sqrt {2\pi } }}exp\left({{\frac{ - \left( {x - \mu } \right)^2 }{2\sigma^2}}}\right) \\ $$

In our case $\mu = \hat{y}$ and $\sigma = 1$:

$$ p(y) = \frac{1}{{\sqrt {2\pi } }}exp\left({{\frac{ - \left( {y - \hat{y} } \right)^2 }{2}}}\right) \\ $$

Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.

$$ log~p(y) = \frac{ - \left( {y - \hat{y} } \right)^2 }{2} + const $$

Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.

$$ MSE(y, \hat{y}) = \frac{1}{2m}\sum_i^m (y - \hat{y})^2 $$

Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.