What is the definition of a first order method?
The term "first order method" is often used to categorize a numerical optimization method for say constrained minimization, and similarly for a "second order method".
I wonder what is the exact definition of a first (or second) order method.
Does this term related to the convergence rate of the method or to the fact that one utilize explicitly first (or second) derivatives of the function to be minimized?
More particularly, Newton is considered second order method. It uses the second order derivatives (Hessian) and has quadratic convergence rate. But what if for example the energy is not convex? In this case the Hessian has to be modified to be (symmetric) positive definite. In such a case, the second order derivatives are used but it's not clear what exactly the convergence rate is. It can be linear depending on the actual modification of the Hessian. So is that a second order or first order method?
Solution 1:
Difference between 1st and 2nd order algorithm:
Any algorithm that requires at least one first-derivative/gradient is a first order algorithm. In the case of a finite sum optimization problem, you may use only the gradient of a single sample, but this is still first order because you need at least one gradient.
A second order algorithm is any algorithm that uses any second derivative, in the scalar case. To elaborate, Newton's step requires use of the Hessian which has second order derivatives i.e. $H_{i,j} = \partial_i \partial_j f(x)$.
In general, an algorithm can be classified as $n^{th}$ order optimization if it uses a tensor of rank $n$.
Important Distinguishment:
When an algorithm uses an approximated version of the second order information (Hessian) for optimization, it may or may not be a second order algorithm. I am emphasizing this because, for example, you may approximate the Hessian by the outer product of the gradient $H = \nabla f(x) \nabla f(x)^T$. In this case you are only using first order information, so this would be considered a first order algorithm. On the other hand, if you are approximating the Hessian by only computing the diagonal elements $\partial_i \partial_i f(x)$, this would be a second order algorithm.
The order of an algorithm can be thought to be the order of error resulting from approximation of the derivative. Recall the definition of a derivative: $$\partial_i f(x) \approx \frac{f(x + \epsilon \Delta x_i) - f(x)}{\epsilon}$$
and the second derivative: $$\partial_{ij} f(x) \approx \frac{\partial_j f(x + \epsilon \Delta x_i) - \partial_j f(x)}{\epsilon},$$
where $x_i$ is equal to $x$ only in the $i^{th}$ coordinate and $0$ elsewhere. Hence if you use an $\epsilon$ approximation for each derivative computation, you will get an $\epsilon^2$ error inherently in your optimization algorithm.
Because it is sort of included in your question, I would like to add that first order algorithms do not necessarily converge linearly. Stochastic Gradient Descent (SGD) converges sub-linearly and many second order algorithms do not converge quadratically.
Solution 2:
"First-order", in the term "first-order method", does not refer to the convergence rate of the method. It refers to the fact that only first-order derivatives are used by the method.
Solution 3:
As far as I know, there's not an exact, precise term. Personally, I refer to first-order methods as those methods that use first derivatives and second-order methods as those that use second-derivative information. I also tend to consider second-order methods those methods that use second-order derivative information, but not the entire second-order derivative. This means that something like a Gauss-Newton approximation to the Hessian from a least squares problem would be considered a second-order method. For example, if we have the problem $$ J(x) = \frac{1}{2}\|g(x)\|^2 $$ where the exact Hessian is: $$ \nabla^2 J(x)\partial x= (g^{\prime\prime}(x)\partial x)^*g(x) + g^\prime(x)^*g^\prime(x)\partial x $$ I'd still consider a Newton method based on the Hessian approximation $$ H = g^\prime(x)^*g^\prime(x)\partial x $$ to be a second-order method. Technically, we only see first-derivatives here, but to obtain this expression, we needed information about how the second-derivative of $J$ is constructed and not just $g$. Also, under the right assumptions, we can get superlinear convergence with this approximation.
All that said, it is not correct to say that we need a Hessian that is positive definite everywhere to obtain quadratic convergence. If we satisfy the second-order sufficient conditions, we know that the Hessian will be positive definite near the optimal solution, so we can obtain quadratic convergence locally. Really, Newton's method is only guaranteed to converge quadratically within a certain radius, unless we have something nice like a self-concordant function, so convergence rate analysis doesn't generally apply when we're far from the solution. A well-designed globalization strategy such as a trust-region or line-search method should guarantee convergence to a stationary point from an arbitrary starting point, but they generally don't say anything about the convergence rate and they don't require the Hessian to be positive definite everywhere.
Really, the game is to rely on globalization to get us close to the optimal solution and then let Newton's method take over and not interfere with these iterates. Since the second-order optimality conditions tell us that near the solution the Hessian will be positive definite, we can then rely on our convergence rate analysis under this assumption.