Recently, while learning about the regularization methodologies(in machine learning), i came across orthogonal polynomials ( of Legendre's polynomials), I looked up on the Internet and there are formulas for the same, but i didn't get any intuition behind this. Can someone explain in simple terms what are these orthogonal polynomials and when should we use them?


Okay, here is a primer on the stuff you need to know to see why we like Orthogonal Polynomials.


Vectors

In physics, vectors are usually things that live in three-dimensional space. In computer science, they tend to be lists of numbers. In mathematics, a vector is a thing that lives in a vector space. A vector space is designed so that all the simple things you want to do to vectors work: you can add them, there's a zero and additive inverses, you can scale them by multiplying by numbers, which are things in the field of scalars (which we can assume is $\mathbb{R}$ for the purposes of this discussion, so we are talking about real vector spaces), and all the usual properties extend from the vector space $\mathbb{R}^3$ over the field $\mathbb{R}$ that physics uses (what does not extend at this stage is the notion of product).

A vector space has a positive integer called the dimension associated to it, which is the smallest number of fixed vectors $v_i$ you need to generate the whole thing by using linear combinations $\sum_i \alpha_i v_i$, where $\alpha_i$ are scalars (e.g. $\mathbb{R}^n$ has dimension $n$). This number is the most important quantity associated to a vector space, because one can prove that over the same field, if two vector spaces have the same dimension, they are isomorphic. A set of $v_i$ of minimal size that generates the whole space in this way is called a basis.

The sensible sort of map on vector spaces is one that keeps the addition and the scalar multiplication structure intact: $$ L\left(\sum_i \alpha_i u_i\right) = \sum_i \alpha_i L(u_i) $$ for any set of vectors $\{u_i\}$. These are called linear maps.

The definition of vector space covers the spaces we expect, but includes many objects that we would not initially expect to be vectors:

  • $\mathbb{R}^n$ is a vector space of dimension $n$, as it should be.
  • Polynomials of degree $n$ or less are a vector space of dimension $n+1$ (addition is defined pointwise). An example of a linear map is sending a polynomial $p$ to its value at a point, $p(a)$.
  • Polynomials of all degrees are also a vector space.
  • Continuous functions on an interval $[a,b]$ (written $C[a,b]$) form a vector space using pointwise addition.
  • Infinite sequences of real numbers $(a_1,a_2,\dotsc)$ are also a vector space

The last three of these are infinite-dimensional, so the previous discussion does not apply completely, but we'll need infinite-dimensional spaces. Worth remembering is that your ordinary basis is defined to generate the space using finite linear combinations, so an infinite-dimensional space can have a very weird (and very big) basis in this sense. (Polynomials aren't so bad because they are finite linear combinations already, so $\{1,x,x^2,\dotsc\}$ is a perfectly reputable basis.)

Inner products and Orthogonality

The easiest product on $\mathbb{R}^3$ to generalise is the dot product $u \cdot v = \sum_{i=1}^3 u_i v_i$. Over a real vector space $V$, an inner product is a function $\langle \cdot,\cdot\rangle : V \times V \to \mathbb{R}$ that is

  • symmetric ($\langle u,v\rangle=\langle v,u\rangle$),
  • nondegenerate ($\langle u,u\rangle>0$ for all $u \neq 0$), and
  • bilinear. (Linear in each argument. Symmetry means we only need to check the first argument, $\langle \lambda u+\mu v,w\rangle = \lambda \langle u,w\rangle + \mu \langle v,w\rangle $ for any scalars $\lambda,\mu$).

The ordinary dot product is one of these. The example we shall need to worry about is $$ \langle f,g \rangle = \int_a^b f(x) g(x) w(x) \, dx, $$ where $f,g \in C[a,b]$ and $w$ is a positive function.

A vector space equipped with one of these is called an inner-product space.

The inner product essentially allows us to generalise the idea of angles, so we borrow the terminology from geometry: two vectors $u,v$ are said to be orthogonal if $$ \langle u,v\rangle = 0. $$ A vector is normalised if $\langle u,u\rangle=1$. A set of vectors where each vector is normalised and each pair of vectors is orthogonal is called orthonormal.

In a finite-dimensional inner product space, if I give you any old basis, there is an algorithm (called Gram–Schmidt) that will produce an orthonormal basis.

One final thing: given a vector $u$ and an orthonormal basis $\{v_i\}$, there is a unique way to express $u$ as a linear combination $u=\sum_i \alpha_i v_i$: taking inner products of both sides with $v_j$ gives $$ \langle v_j,u \rangle =\sum_i \alpha_i \langle v_i,v_j \rangle = \alpha_j, $$ using linearity of the inner product and orthonormality of the $v_i$. So $$ u = \sum_i \langle v_i,u \rangle v_i. $$

This is where we have to take a short deviation into infinite-dimensional spaces and analysis, where things become rather more complicated.

Hilbert space

A Hilbert space is a complete inner-product space. Complete technically means that Cauchy sequences converge, but you can think of it as a space where every convergent sequence converges to something still in the space. Given an inner-product space, we can make a Hilbert space by adding the appropriate limits in. Here, convergence means that $v_n \to v$ if $\langle v_n-v,v_n-v \rangle \to 0$ (this is either called convergence in mean square or strong convergence, but it doesn't really matter since it's the only sort we consider).

Any finite-dimensional inner-product space is a Hilbert space (no converging needs to happen), but this is not true in infinite dimensions. $C[a,b]$ is not a Hilbert space (one can construct continuous functions that converge to a function with a jump in it, for example), but we can turn it into a Hilbert space using the inner product given above; this new space is called $L^2[a,b]$, the space of square-integrable functions on $[a,b]$. Note that convergence is not pointwise: functions in $L^2[a,b]$ "look the same" if they differ on a finite set of points (more is true, but we shan't need the fully general statement).

A orthonormal basis in a Hilbert space $H$ is a bit different: we mean an orthonormal set of vectors so that given any element $u$ of $H$, we can approximate $u$ as closely as we like by a finite linear combination of elements of the basis (we say that these finite linear combinations are dense in $H$). In the case of $L^2[0,2\pi]$ with $w=1$, the classic example is Fourier series: any element of $L^2[0,2\pi]$ can be written as $\sum_{n=-\infty}^{\infty} a_n e^{inx}$, so $\{e^{inx}\}$ form an orthonormal basis of $L^2[a,b]$.


Orthogonal Polynomials

With the preliminaries finally out of the way, we can finally move on to what orthogonal polynomials are. Given $L^2[a,b]$ with a $w$ so that polynomials have finite integral (e.g. $1$ on [-1,1], $e^{-x}$ on $[0,\infty)$), polynomials form a dense subspace of $L^2[a,b]$ (this is a consequence of something called the Stone–Weierstrass theorem, but never mind about that). Hence, one might want to approximate functions in $L^2[a,b]$ by polynomials. To do so, one requires an orthonormal set of polynomials, and this is where orthogonal polynomials come in.

Definition: Let $\langle f,g \rangle = \int_a^b f(x) g(x) w(x) \, dx$ (i.e. the inner product on $L^2[a,b]$ with weight $w$). A sequence of polynomials $(P_n)_{n\geqslant 0}$ are called orthogonal polynomials for this space if

  • $P_n$ has degree $n$.
  • $\langle P_n,P_m \rangle = 0 $ when $n \neq m$.

One can show that each polynomial is unique up to an overall scaling. How? Apply Gram–Schmidt to the sequence $1,x,x^2,\dotsc$! For example, a simple computation of this kind gives the first few Legendre polynomials, $1,x,(3x^2-1)/2,\dotsc$. For various reasons, the scaling of orthogonal polynomials is not usually chosen so that $\langle P_n,P_n \rangle = 1$. The Legendre polynomials, for example, are normally defined to be scaled so that $P_n(1)=1$.

Gram–Schmidt is rather inefficient when you want a lot of terms. More commonly, orthogonal polynomials are defined by generating functions of one sort or another, which is often where the funny scalings come from.

Given a function in $L^2[a,b]$, we can expand it in terms of orthogonal polynomials in exactly the same way as one employs a Fourier series: it gives an expansion of the function in terms of simpler functions that are orthogonal to one another.

The usefulness of this expansion over the Fourier expansion often lies in the polynomials being solutions to certain differential equations. For example, Legendre polynomials appear entirely naturally when solving Laplace's equation in spherical coordinates, and one can use them to expand an axisymmetric solution to Laplace's equation in what is called the multipole expansion. This ability to expand functions in terms of solutions of a differential equation is a specific case of a more general phenomenon, which goes under the general title of Sturm–Liouville theory. Orthogonal polynomials are a particularly nice example of this.

Perhaps the best way to think of expansion in terms of orthogonal polynomials is that the coefficients $\alpha_i$ in $f(x) - \sum_{i=0}^n \alpha_i P_i(x) = R(x)$, where $R(x)$ is the error in the approximation, is that the $\alpha_i$ are chosen so that for a given $n$, $ \int_a^b (R(x))^2 w(x) \, dx $ is minimised: $\sum_{i=0}^n \alpha_i P_i$ is the best approximation of $f$ (in this sense) by a polynomial of degree at most $n$. The orthogonality makes it easy to derive what the coefficients must be, and it corresponds precisely with the $\alpha_i = \langle f,P_i \rangle / \langle P_i,P_i\rangle$ that we obtain by analogy with both the Fourier series idea and the calculation in finite-dimensional space.

One may also understand $\sum_{i=0}^n \alpha_i P_i$ as the projection of $f$ onto the finite-dimensional space spanned by $P_0,P_1,\dotsc,P_n$; we note the analogy with orthogonal projection of a point onto a plane in three dimensions, which also gives the closest point in the plane to the original point.