Why do we use $x - y$ rather than $x + y$ in the definition of the convolution? Is it just convention? (If we are thinking of convolutions as weighted averages, for instance against "good kernels," it should make no difference.)

Why $(f * g) (x) = \int f(y) g(x - y) dy$ rather than $(f * g) (x) = \int f(y) g(x + y) dy$?

Edit: I'm finding it really hard to choose a best answer. There are at least three very good ones here.


Intuitively, and abusing the notation a bit, you can consider the convolution as

$$ (f*g)(x) = \int_{p+q=x} f(p)g(q) $$

This makes it clear that $f*g = g*f$. On the other hand with your alternative definition we would get $$ (f*'g)(x) = \int_{q-p=x} f(p)g(q) $$ and therefore $(f*'g)(x) = (g*'f)(-x)$, which is untidy for no good reason.


Consider the discrete analogue: Given two functions $a:\>k\mapsto a(k)$ and $b:\>l\mapsto b(l)$ we are collecting (i.e., summing up) for given $r$ all products $a(k)\,b(l)$ where $k+l=r$. This is the right thing to do, e.g., when multiplying two power series $$a(z):=\sum_{k=0}^\infty a_k z^k, \quad b(z):=\sum_{l=0}^\infty b_lz^l\ .$$ Then $c(z):=a(z)b(z)$ can be written as $c(z)=\sum_{r=0}^\infty c_r z^r$ with $$c_r:=\sum\nolimits_{k+l=r} a_k b_l=\sum_{l=0}^r a_{r-l}\, b_l\qquad(r\geq0)\ .$$ This is expressed by saying that the sequence $c:=(c_r)_{r\geq0}$ is the convolution of the two sequences $a:=(a_k)_{k\geq0}$ and $b:=(b_l)_{l\geq0}$, in short: $c=a*b$.

A similar argument can be put forward when dealing with the sum of two independent random variables $X$ and $Y$ having probabilities $p_k$ and $q_l$ of assuming the values $k$ and $l$, respectively.

Translating this into a continuous setting we have $$(f*g)(x)=\int_{-\infty}^\infty f(x-t)\,g(t)\ dt\ ,$$ assuming that the integral on the right hand side makes sense.


You could think of simple examples as this:

Impulse response $g(x)$ is zero except for $x=10$, $g(10) = 1$. This could mean "dog is barking 10 seconds after he has seen a cat".

Then the convolution could be explained as: The volume at which the dog is barking at time t is the amount of cats he has seen 10 seconds before time $t$. Which is $t$ minus $10$ seconds.