Divergence as transpose of gradient?
In his online lectures on Computational Science, Prof. Gilbert Strang often interprets divergence as the "transpose" of the gradient, for example here (at 32:30), however he does not explain the reason.
How is it that the divergence can be interpreted as the transpose of the gradient?
Here is a considerably less sophisticated point of view than some of the other answers. Recall that the dot product of vectors can be obtained by transposing the first vector. That is, $$ \textbf{v}^T \textbf{w} \;=\; \begin{bmatrix}v_x & v_y & v_z\end{bmatrix}\begin{bmatrix}w_x \\ \\ w_y \\ \\w_z\end{bmatrix} \;=\; v_x w_x + v_y w_y + v_z w_z \;=\; \textbf{v}\cdot \textbf{w}. $$ (Here we are thinking of column vectors as being the "standard" vectors.)
In the same way, divergence can be thought of as involving the transpose of the $\nabla$ operator. First recall that, if $g$ is a real-valued function, then the gradient of $g$ is given by the formula $$ \nabla g \;=\; \begin{bmatrix}\partial_x \\ \\ \partial_y \\ \\ \partial_z \end{bmatrix}g \;=\; \begin{bmatrix}\partial_xg \\ \\ \partial_yg \\ \\ \partial_zg \end{bmatrix} $$ Similarly, if $F=(F_x,F_y,F_z)$ is a vector field, then the divergence of $F$ is given by the formula $$ \nabla^T F \;=\; \begin{bmatrix}\partial_x & \partial_y & \partial_z\end{bmatrix}\begin{bmatrix}F_x \\ \\ F_y \\ \\ F_z\end{bmatrix} \;=\; \partial_xF_x + \partial_yF_y + \partial_zF_z. $$ Thus, the divergence corresponds to the transpose $\nabla^T$ of the $\nabla$ operator.
This transpose notation is often advantageous. For example, the formula $$ \nabla^T (gF) \;=\; (\nabla^T g)F \,+\, g(\nabla^T F) $$ (where $\nabla^T g$ is the transpose of the gradient of $g$) seems much more obvious than $$ \text{div} (gF) \;=\; (\text{grad } g)\cdot F \,+\, g\;\text{div } F. $$ Indeed, this is the formula that leads to the integration by parts used in the video: $$ \int\!\!\int g (\nabla^T F)\,dx\,dy \;=\; -\int\!\!\int (\nabla g)^T F \,dx\,dy. $$
A "dual pair" in functional analysis consists of a topological vector space E and its dual space $E'$ of continuous linear functionals, or some subspace of this.
That is for real vector spaces, for every element $e \in E$ and $e' \in E'$, we can write $$ \langle e, e' \rangle \in \mathbb{R} $$ Example: Let $E$ be a Hilbert space, then $E = E'$ and the dual pairing is given by the scalar product.
In the case at hand we have two function spaces and the dual pairing is defined to be $$ \int_{\Omega} u(x, y) v(x, y) d x d y $$ When you have some operator $$ T: E \to E $$ it is often possible to define the "transposed operator" T' to be the operator $$ T': E' \to E' $$ by the requirement that $$ \langle T e, e' \rangle = \langle e, T' e' \rangle $$ for all e, e'. In the context of Hilbert spaces, it is more common to talk about "adjoint operators". The name "transpose" is motivated by the fact that for linear operators on finite dimensional vector spaces, the transpose is given by the transposed (conjugate, for complex ground field) matrix of the matrix that represents $T$ with respect to a fixed basis.
In the case at hand, when we write down $$ \int_{\Omega} (- div \; grad u(x, y)) v(x, y) d x d y $$ you'll see that this is the same as $$ \int_{\Omega} (grad \; u(x, y)) \cdot (grad \; v(x, y)) d x d y $$ by integration by parts, if the boundary terms are zero. The $\cdot$ denotes the canonical scalar product of vectors in $\mathbb{R}^n$. So, if the boundary terms are zero, we have
$$ \langle - div \; e, e' \rangle = \langle e, grad \; e' \rangle $$ where - strictly speaking - the dual pairing on each side is different, because the first is a dual pairing of functions with values in $\mathbb{R}$, while the second is for functions with values in $\mathbb{R}^2$. But neglecting this technical detail, the operator grad is in this sense the transposed operator of the operator div.
Here is a discrete analogue of the situation in which one really can literally take the transpose of a matrix that is analogous to gradient and get a matrix that is analogous to divergence.
Let $G$ be a finite graph with vertex set $V$ and edge set $E$, and let $\mathbb{R}^V, \mathbb{R}^E$ be the vector spaces of functions $V \to \mathbb{R}$ resp. $E \to \mathbb{R}$. (Actually $\mathbb{R}^E$ is slightly more complicated than this: we want to be able to refer to an edge from $u$ to $v$ as both $uv$ and as $vu$, subject to the condition that $f(uv) = -f(vu)$.) Given a function $f \in \mathbb{R}^V$, which we think of as a discrete analogue of a scalar function on $G$, we can define
$$\text{grad}(f)(uv) = f(v) - f(u)$$
and this gives a function $\text{grad}(f) \in \mathbb{R}^E$, which we think of as a discrete analogue of the gradient, but which is more typically known as the (oriented) incidence matrix of $G$. Note here the "fundamental theorem of discrete line integrals": if $v_1 \to v_2 \to ... \to v_n$ is a path, then
$$\sum_{i=1}^{n-1} \text{grad}(f)(v_{i+1} v_i) = f(v_n) - f(v_1)$$
as expected. Now, both spaces $\mathbb{R}^V, \mathbb{R}^E$ come equipped with inner products given by
$$\langle a, b \rangle_V = \sum_{v \in V} a(v) b(v)$$
and
$$\langle a, b \rangle_E = \sum_{e \in E} a(e) b(e)$$
respectively. Generally speaking, if $A, B$ are a pair of inner product spaces and $T : A \to B$ is a linear operator, then under nice conditions there exists a unique linear operator $T^{\dagger} : B \to A$ such that
$$\langle Ta, b \rangle_B = \langle a, T^{\dagger} b \rangle_A.$$
$T^{\dagger}$ is the adjoint of $T$, and this is the abstract definition of the transpose of a matrix. You can verify that if you pick orthonormal bases of $A, B$ and write $T$ in terms of those bases, then $T^{\dagger}$ is precisely the transpose of $T$ in the usual sense. Thus the operator $\text{grad} : \mathbb{R}^V \to \mathbb{R}^E$ has an adjoint. If we simply take the transpose of the matrix representing $\text{grad}$ (with respect to the orthonormal bases given by the functions which are equal to $1$ on a particular vertex or edge and $0$ otherwise) we get that for $g \in \mathbb{R}^E$,
$$\text{div}(g)(u) = \sum_{uv \in E} g(uv).$$
If we think of $g$ as a flow on the graph $G$, then this is precisely a measure of how much is flowing in or out of a particular vertex, so is an appropriate discrete analogue of the divergence, and in fact a discrete analogue of the divergence theorem holds.
In multivariable calculus, something similar is happening as above, except that the spaces and inner products are more complicated.