Interpretation of Symmetric Normalised Graph Adjacency Matrix?
I'm trying to follow a blog post about Graph Convolutional Neural Networks. To set up some notation, the above blog post denotes a graph $\mathcal{G}$, it's adjacency matrix $A$, and the degree matrix $D$.
A section of that blog post then says:
I understand how an adjacency matrix can be row-normalised with $A_{row} = D^{-1}A$, or column normalised with $A_{col} = AD^{-1}$.
My question: is there some intuitive interpretation of a symmetrically normalized adjacency matrix $A_{sym} = D^{-1/2}AD^{-1/2}$?
Solution 1:
TL;DR: $\mathrm{A}_{sym}$ is doing some sort of average of your neighbours while taking into account their number of neighbours (being connected to a node connected to all nodes gives less information than if it's connected only to you). The square roots make sure that the largest eigenvaue is $\lambda_1=1$ to be able to stack a large number of layers.
As requested this answer will be based on intuition, although math stack exchange might not be the best place for those.
Preliminaries:
- suppose we have a connected graph $\mathcal{G}$
- $\mathcal{E}$ and $\mathcal{V}$ the set of sedges and vertices respectively
- $\hat{\mathrm{A}}= \mathrm{A} + \mathrm{I} \in \mathbb{R}^{|\mathcal{V}| \times |\mathcal{V}|}$ the (weighted) adjacency matrix plus the identity
- $\hat{\mathrm{D}} \in \mathbb{Z}^{|\mathcal{V}| \times |\mathcal{V}|}$ is the degree matrix (diagonal matrix with number of neighbours on the diagonal) of $\hat{\mathrm{A}}$
- $\mathrm{W} \in \mathbb{R}^{C \times C}$ is the trainable weight matrix ($C$ is the number of channels at each layer)
- $\mathrm{H}^{l+1}=\sigma(\tilde{\mathrm{A}}\mathrm{H}^{l}\mathrm{W}) \in \mathbb{R}^{|\mathcal{V}| \times C}$ is the input to the $(i+1)^{th}$ layer. For simplicity let's consider $\mathrm{H}^{l+1}=\hat{\mathrm{A}}\mathrm{H}^{l}$.
Answer:
Here is an intuitive explanation of what would happen for different $\hat{\mathrm{A}}_{?}$:
- $\hat{\mathrm{A}}$: no normalization. So $h_{ij}^{l+1}=\sum_{(i,k) \in \mathcal{V}} \hat{a}_{ik} h_{kj}^{l}$, meaning that well connected nodes will have values associated with them that are larger. This will be hard to optimise as the features of the nodes will be in very different range. Additionally, the deeper the network, the larger the values will be (as well connected nodes will get large values making their neighbours larger and thus themselves larger ...). The last point is easy to prove as $\mathrm{A}$ always has a trace of 0 meaning that a non empty graph will have one eigen-value larger than $0$. There thus exists an eigenvalue of $\hat{\mathrm{A}}$ larger than 1 (after adding the identity), meaning that when stacking layers the largest eigenvalue will explode.
- $\mathrm{A}_{row}=\mathrm{D}^{-1}\mathrm{A}$ is the row normalized matrix (i.e. normalized by the degree of the start node of each edge). This would correspond to setting $h_{ij}^{l+1}=\frac{\sum_{(i,j) \in \mathcal{V}}a_{ik} h_{kj}^{l}}{\mathrm{d}_{i,i}}$, effectively taking the mean value of the neighbours. This makes optimisation easier as every node should be in a similar range and independent of the depth of the network (it is a stochastic matrix its largest eigenvaleu will be 1). One issue is that you don't consider the connectivity of your neighbours. But intuitively if your neighbour is only connected to you, then you have a stronger connection than if it's connected to every node.
- $\mathrm{A}_{col}=\mathrm{A}\mathrm{D}^{-1}$ is the column normalized matrix (i.e. normalized by the degree of the end node of each edge). This would correspond to setting $h_{ij}^{l+1}=\frac{\sum_{(i,j) \in \mathcal{V}}a_{ik} h_{kj}^{l}}{\mathrm{d}_{j,j}}$, effectively summing your neighbours values while taking into account how many neighbors they have. This is similar to what is done in the page rank algorithm, in which you take into account the number of outgoing url of your neigbours to know if there's a high probability of starting there and ending up on your site. This does not suffer from values that grow indefinitely when stacking layer (same eigen values as $\mathrm{A}_{row}$), but you will still end up with nodes having very larger values if they are well connected. Although the degree of connectivity is important, this should not be the driving factor: we are interested in clusters and more complex connectivity patterns.
- $\mathrm{A}_{naive}=\mathrm{D}^{-1}\mathrm{A}\mathrm{D}^{-1}$ the naively symmetric normalized matrix could be a way to naively solve the previous issues. This would correspond to setting $h_{ij}^{l+1}=\frac{\sum_{(i,j) \in \mathcal{V}}a_{ik} h_{kj}^{l}}{\mathrm{d}_{k,k}\mathrm{d}_{i,i}}$. A new issue appears which is the opposite as with $\mathrm{A}$, i.e. the deeper the network the smaller the values will become until reaching $0$. Indeed the largest eigenvalue will be strictly smaller than for $\mathrm{A}_{row}$ so it will be less than 1.
- $\mathrm{A}_{sym}=\mathrm{D}^{-1/2}\mathrm{A}\mathrm{D}^{-1/2}$ the symmetric normalized matrix we want to understand intuitively. This would correspond to setting $h_{ij}^{l+1}=\frac{\sum_{(i,j) \in \mathcal{V}}a_{ik} h_{kj}^{l}}{\sqrt{\mathrm{d}_{k,k}\mathrm{d}_{i,i}}}$. This achieves the same thing as $\mathrm{A}_{naive}$, i.e. combining our intuition for $\mathrm{A}_{row}$ and $\mathrm{A}_{col}$. In other words it takes into account both your number of neighbours and their own number of neighbours. Note that the weight is symmetric (i.e. $w_{ik}=w_{ki}=\frac{a_{ik}}{\sqrt{\mathrm{d}_{k,k}\mathrm{d}_{i,i}}}$) which is very natural. Finally, we do not suffer from the previous issues as all values will be in a similar range and independent of the number of layers. Indeed $\mathrm{A}_{sym}$ is similar to $\mathrm{A}_{row}$ because $\mathrm{A}_{sym} = \mathrm{D}^{1/2} \mathrm{A}_{row} \mathrm{D}^{-1/2}$ meaning that they have the same eigevalues (the largest being 1).