regards graph convolutions as a diffusion process. It assumes that information is transferred from one node to one of its neighboring nodes with a certain transition probability so that information distribution can reach equilibrium after several rounds. DCNN defines the diffusion graph convolution (DGC) as
(5.49)
where f(·) is an activation function, and the probability transition matrix P ∈ Rn × n is computed by P = D−1 A. Note that in DCNN, the hidden representation matrix H(k) remains the same dimension as the input feature matrix X and is not a function of its previous hidden representation matrix H(k − 1). DCNN concatenates H(1), H(2), ⋯, H(K) together as the final model outputs. As the stationary distribution of a diffusion process is the sum of the power series of probability transition matrices, DGC [49] sums up outputs at each diffusion step instead of concatenating them. It defines the DGC by
where W(k) ∈ RD × F and f(·) is an activation function. Using the power of a transition probability matrix implies that distant neighbors contribute very little information to a central node.
Parametric graph convolution (parametric GC ‐DGCNN) [12, 39] increases the contributions of distant neighbors based on shortest paths. It defines a shortest path adjacency matrix S(j). If the shortest path from a node v to a node u is of length j, then
(5.51)
where
Partition graph convolution (PGC) [50] partitions a node’s neighbors into Q groups based on certain criteria not limited to shortest paths. PGC constructs Q adjacency matrices according to the defined neighborhood by each group. Then, PGC applies GCN [14] with a different parameter matrix to each neighbor group and sums the results:
(5.52)
where
MPNN [51] outlines a general framework of spatial‐based ConvGNNs. It treats graph convolutions as a message passing process in which information can be passed from one node to another along edges directly. MPNN runs K‐step message passing iterations to let information propagate further. The message passing function (namely the spatial graph convolution) is defined as
(5.53)
where
(5.54)
where R(·) represents the readout function with learnable parameters. MPNN can cover many existing GNNs by assuming different forms of Uk(·), Mk(·), and R(·).
Graph Isomorphism Network (GIN): Reference [52] finds that previous MPNN‐based methods are incapable of distinguishing different graph structures based on the graph embedding they produced. To amend this drawback, GIN adjusts the weight of the central node by a learnable parameter ε(k). It performs graph convolutions by
(5.55)
where MLP(·) represents a multi‐layer perceptron. As the number of neighbors of a node can vary from one to a thousand or even more, it is inefficient to take the full size of a node’s neighborhood. GraphSAGE [23] adopts sampling to obtain a fixed number of neighbors for each node. It performs graph convolutions by
(5.56)
where
GAT [53] assumes that contributions of neighboring nodes to the central node are neither identical like GraphSAGE [23], nor pre‐determined like GCN [14]. GAT adopts attention mechanisms to learn the relative weights between two connected nodes. The graph convolutional operation according to GAT is defined as
(5.57)
where
(5.58)
where g(·) is a LeakyReLU activation function and a is a vector of learnable parameters. The softmax function ensures that the attention weights sum up to one over all neighbors of the node v.
Graph pooling modules: After a GNN generates node features, we can use them for the final task. But using all these features directly can be computationally challenging, and thus a downsampling strategy is needed. Depending on the objective and the role it plays in the network, different names are given to this strategy: (i) the pooling operation aims to reduce the size of parameters by downsampling the nodes to generate smaller representations and thus avoid overfitting, permutation invariance, and computational complexity issues and (ii) the readout operation is mainly used to generate graph‐level representation based on node representations (see Figure 5.5). Their mechanism is very similar. In this section, we use pooling to refer to all kinds of downsampling strategies applied to GNNs. Nowadays, mean/max/sum pooling, already illustrated in Section 5.1, is the most primitive and effective way to implement downsampling since calculating the mean/ max /sum