Logistic regression - Prove That the Cost Function Is Convex
I'm reading about Hole House (HoleHouse) - Stanford Machine Learning Notes - Logistic Regression.
You can do a find on "convex" to see the part that relates to my question.
Background:
$h_\theta(X) = sigmoid(\theta^T X)$ --- hypothesis/prediction function
$y \in \{0,1\}$
Normally, we would have the cost function for one sample $(X,y)$ as:
$y(1 - h_\theta(X))^2 + (1-y)(h_\theta(X))^2$
It's just the squared distance from 1 or 0 depending on y.
However, the lecture notes mention that this is a non-convex function so it's bad for gradient descent (our optimisation algorithm).
So, we come up with one that is supposedly convex:
$y * -log(h_\theta(X)) + (1 - y) * -log(1 - h_\theta(X))$
You can see why this makes sense if we plot -log(x) from 0 to 1:
$\infty$ to 0">
i.e. if y = 1 then the cost goes from $\infty$ to 0 as the hypothesis/prediction moves from 0 to 1.
My question is:
How do we know that this new cost function is convex?
Here is an example of a hypothesis function that will lead to a non-convex cost function:
$h_\theta(X) = sigmoid(1 + x^2 + x^3)$
leading to cost function (for y = 1):
$-log(sigmoid(1 + x^2 + x^3))$
which is a non-convex function as we can see when we graph it:
Here I will prove the below loss function is a convex function. \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{equation}
Then will show that the loss function below that the questioner proposed is NOT a convex function.
\begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}
To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).
$ \newcommand{\reals}{{\mathbf{R}}} \newcommand{\preals}{{\reals_+}} \newcommand{\ppreals}{{\reals_{++}}} $
Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by
\begin{equation} \sigma(z) = 1/(1+\exp(-z)) \end{equation}
The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.
Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since \begin{eqnarray} f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)), \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z), \end{eqnarray} the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).
Likewise, since \begin{eqnarray} f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1. \end{eqnarray} Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.
Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is \begin{equation} \nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n, \end{equation} and the Hessian of $g$ with respect to $y$ is \begin{equation} \nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}. \end{equation} Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$, \begin{equation} z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z = (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0, \end{equation} hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).
Now the object function to be minimized for logistic regression is \begin{equation} \begin{array}{ll} \mbox{minimize} & L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{array} \end{equation} where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.
Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!
Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.
Now the new loss function proposed by the questioner is \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}
First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have \begin{equation} f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z) = 2 \exp(-z) / (1+\exp(-z))^3. \end{equation} Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.
Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!
However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).
I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.
Thank you. - Sunghee
You are looking at the wrong variable. The need is for $J(\theta)$ to be convex (as a function of $\theta$), so you need $Cost(h_{\theta}(x), y)$ to be a convex function of $\theta$, not $x$. Note that the function inside the sigmoid is linear in $\theta$.
Consider a twice differentiable function of one variable $f(z)$. If the second derivative of $f(z)$ is (always) non-negative, then $f(z)$ is convex. So consider the function $$ j(z) = -y\log(\sigma(z)) - (1-y)\log(1-\sigma(z)) $$ where $\sigma(x) =sigmoid(x)$ and $0\leq y \leq 1$ is a constant. You can show that $j(z)$ is convex by taking the second derivative. Compare with the case that you take $$ k(z) = y\sigma(z)^2 + (1-y)(1-\sigma(z))^2 $$ is $k(z)$ convex? What if you take $\tilde{\sigma}(z) = sigmoid(1+z^2+z^3)$ instead of $\sigma$(z)?
Now, the composition of a convex function with a linear function is convex (can you show this?). Note that $Z(\theta) := \theta^T \cdot X $ is a linear function in $\theta$ (where $X$ is a constant matrix). Therefore, $J(\theta) := j(Z(\theta))$ is convex as a function in $\theta$.