Graph Neural Networks

CONVOLUTIONAL NEURAL NETWORKS

Convolutional Neural Networks (CNNs) are a class of deep neural networks particularly effective for processing grid-structured data, such as images.

They use filters (kernels) stacked in successive layers to increase depth and promote a hierarchical extraction of features.

Up to now, we have seen that the 2D spatial cross-correlation between an input image $I$ and a square kernel $K \in R^{d \times d}$ is defined as:

S(i,j) = \sum_{m=-k}^{k} \sum_{n=-k}^{k} I(i+m, j+n) \cdot K(m,n)

Where:

$d = 2k + 1$ is the kernel size (square)

e.g. $d = 3 \rightarrow k = 1$
$I(i + m, j + n)$ : input pixel
$K(m, n)$ : kernel weight

THE EUCLIDEAN LATTICE ASSUMPTION

From a geometric point of view, the input image can be seen as a regular 2D grid of values. Conceptually, each pixel exists on a discrete 2D lattice (e.g., $\mathbb{Z}^2$ ), which can be viewed as a 2D Euclidean plane.

A discrete lattice is a regular grid of points in Euclidean space where each point has integer coordinates, and the distance between neighboring points is fixed.

image-removebg-preview (1).png

Having such a regular and predictable structure is important because it allows the kernel to slide uniformly over the image, applying the same operation everywhere.

The cross-correlation operation slides a kernel across this grid, systematically visiting each location. So at each position, the kernel interacts with a local patch of the input. This means the operation captures local relationships — how each pixel relates to its immediate neighbors in that 2D space.

Note: Although the operation is discrete, the grid itself can be viewed as a sampling of a continuous 2D Euclidean domain.

CROSS-CORRELATION 1D ON A 1D EUCLIDEAN LATTICE

Just like in 2D, the 1D grid can be seen as a discretization of a continuous Euclidean line $\mathbb{R}$ , sampled at regular intervals.

Both the input signal and the kernel are defined on a 1D Euclidean lattice, and the kernel slides along this lattice, computing local interactions at each position.

A GRAPH-BASED PERSPECTIVE

Under this new viewpoint, the input image can be seen as a signal defined on a (Euclidean) graph.

A graph $G = (V, E)$ consists of:

A set of nodes (vertices) $V$
A set of edges $E \subseteq V \times V$

Graphs can be:

Directed or undirected
Weighted or unweighted

A graph is defined as a lattice graph when its nodes are arranged on a regular grid (e.g., 1D, 2D, or 3D). In this model, we can map the image structure as follows:

Each pixel corresponds to a node in the graph.
Each node is associated with a feature vector (e.g., a triplet of raw RGB values).
The edge set $E$ connects each node to its spatial neighbors using a fixed connectivity rule (e.g., 4-connected: top, bottom, left, right).

PROBLEM OF STANDARD N-D CONVOLUTION

Traditional convolutional architectures assume data is defined over regular Euclidean grids (e.g., 1D signals, 2D images).

However, not all types of data exhibit this regular structure. Many real-world domains — such as:

are non-Euclidean.

In these domains, entities (or nodes) have irregular connections: some nodes are highly connected while others are not, and the relationships among nodes do not follow a fixed grid.

Graph Neural Network (GNN)

OBJECTIVE

The objective is to define a neural network that can handle graph-structured data.

The main challenges in a Graph Neural Network (GNN) are:

Irregular structure: nodes may have different numbers of neighbors, meaning there is no fixed-size receptive field.
No spatial locality: unlike images, there is no inherent notion of distance or order between nodes.

Permutation invariance: the output should not depend on the order in which nodes are processed.
No global coordinate system: there is no fixed layout or orientation to rely on.
Dynamic topology: edge connectivity may evolve over time or depend on node features.

GRAPH CONVOLUTIONAL NETWORKS

Graph Convolutional Networks (GCNs) generalize the concept of convolution to graph-structured data. As with standard neural networks:

A Graph Neural Network is composed of multiple successive layers.
It includes input features, hidden activations (often called node embeddings), activation functions, and output responses.
Gradient descent and backpropagation are used to update the model parameters.

The key idea in GCNs is message passing, meaning that the information is propagated across the graph by allowing each node to communicate with its neighbors.

In practice, this means that each node updates its state by aggregating information from its neighbors and updating its own representation accordingly. So we have two main steps:

Aggregation (Leveraging the graph structure)

Each node aggregates information from its neighbors by combining their embeddings using an aggregation function (e.g., sum, mean, max, or a learned function).
Node update

The aggregated information is then passed through a neural network layer (e.g., a simple MLP) to generate a new node representation (embedding).

This updated embedding captures both the node’s own features and those of its local neighborhood.

This process can be stacked across multiple layers, enabling nodes to gradually capture information from larger neighborhoods.

AN EXAMPLE: GCN

GCN Layer represents one of the most prominent examples of graph-based neural networks.

It takes in input:

Node features matrix $X \in R^{n \times d}$ where each row corresponds to the feature vector of a node.
Adjacency matrix $A \in R^{n \times n}$ $A \in R^{n \times n}$ which encodes the graph structure — which nodes are connected. It can be:
- binary
- weighted.

In a simple, unweighted graph, if node i and node j are connected, then $A_{ij} = 1$ ; otherwise, $A_{ij} = 0$

THE ROLE OF THE ADJACENCY MATRIX

The adjacency matrix $A$ is a key component of GNNs:

Acts as a structural prior

The adjacency matrix encodes geometric or relational constraints of the domain.
Enables the model to reflect spatial or structural constraints of the data.
Helps reduce overfitting by:
- Enforcing locality, meaning that each node only aggregates from its true neighbors.
- It introduces an inductive bias, built-in assumptions about how the data is structured, which helps the model generalize better on unseen data.

The adjacency matrix is typically derived from domain knowledge — it is fixed manually and remains constant across layers. In contrast, the node features (embeddings) are updated layer by layer through the message-passing mechanism.

When the structure is not known, the adjacency matrix $A$ can be constructed automatically using heuristic methods, such as $k$ -nearest neighbors in feature space or distance thresholds.

MESSAGE PASSING

Layer-wise propagation rule:

H^{(l)} = \sigma \left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} H^{(l-1)} W^{(l)} \right)

$\hat{A} = A + I$ → adjacency matrix with self-loops added
$\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$ → symmetric normalization of the adjacency matrix
$\hat{D}$ with $\hat{D}_{i,i} = \sum_{j} \hat{A}_{i,j}$ is the (diagonal) degree matrix
$H^{(l-1)} \in R^{|V| \times F_l}$ → node embeddings from the previous layer
$W^{(l)} \in R^{F_{l-1} \times F_l}$ → learnable weight matrix (projection)
$\sigma(\cdot)$ → activation function (e.g., ReLU)

PROJECTION

The same operation can be expressed more intuitively in a node-wise (local) form:

Aggregation: $a_v^{(l)} = \sum_{u \in N(v) \cup \{v\}} \frac{1}{\sqrt{\hat{d}_v \hat{d}_u}} h_u^{(l-1)}$

Each node $v$ collects (aggregates) features from its neighbors $u \in N(v)$ and from itself, weighting each contribution by the inverse of the node degrees. This keeps the aggregation balanced regardless of how many neighbors each node has.
Projection: $z_v^{(l)} = a_v^{(l)} \cdot W^{(l)}$

The aggregated message $a_v^{(l)}$ is then transformed (linearly projected) into a new feature space through the learnable weights $W^{(l)}$ .
Non-linearity: $h_v^{(l)} = \sigma\left(z_v^{(l)}\right)$

Finally, a non-linear activation (e.g., ReLU) is applied to obtain the updated embedding of node $v$

RECEPTIVE FIELD AND THEORETICAL MOTIVATIONS

A GCN layer aggregates information from each node’s immediate (1-hop) neighbors. Stacking multiple layers allows the model to capture information from higher-order neighborhoods:

1 layer → information from direct neighbors (1-hop)
2 layers → information from neighbors of neighbors (2-hop)
…
k layers → information from the k-hop neighborhood

This defines the receptive field of each node — the portion of the graph that influence a node’s representation.

The full two-layer GCN can be written as:

$Z = \text{softmax}\left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} \sigma\left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} X W^{(0)}\right) W^{(1)}\right)$

The GCN can be theoretically motivated by spectral graph theory:

It represents a first-order approximation of spectral graph convolutions.
Based on the spectral formulation of convolution on graphs.

Applications of GNNs

Node Classification

In node classification, the goal is to predict a label for each node in the graph. Every node $v \in V$ is associated with an input feature vector $x_v$ , which encodes its attributes or properties.

The model processes these features through multiple GNN layers, allowing each node to aggregate information from its neighbors and produce a node embedding $h_v \in \mathbb{R}^d$ .

A final classification layer then maps these embeddings to class scores:

$z_v = W h_v + b \in \mathbb{R}^C$

The predicted label for each node is obtained by taking the class with the highest score:

$\hat{y}_v = \arg\max_c z_v[c]$

The overall model output is a matrix:

$Z \in \mathbb{R}^{N \times C}$

where:

$N$ is the number of nodes in the graph and each row corresponds to a node’s class scores.
$C$ is the number of possible classes.

TWO SETTINGS

Transductive setting, the entire graph is available during training, but only a subset of nodes has known labels. The main objective is to predict the labels of the remaining nodes within the same graph. Since the full structure is accessible, the model can take advantage of the connections among all nodes — both labeled and unlabeled. Even though the unlabeled nodes are not directly supervised, they still contribute to the learning process by sharing structural information that helps the model understand the overall graph context.
Inductive setting requires the model to generalize to completely new nodes or even entirely unseen graphs. These nodes are not part of the training data, so the model cannot rely on their connections during learning. Instead, it must base its predictions on the available node features and their local neighborhood structure. The goal is for the model to learn representations that can effectively transfer across different graphs and remain useful in new, unseen scenarios.

Link Prediction

In link prediction, the task is to predict whether a link (or edge) exists between two nodes in a graph. Given a pair of nodes $u$ and $v$ , the goal is to predict if $(u, v) \in \varepsilon$ .

This is typically formulated as a binary classification problem on node pairs, although it can also become a multi-class problem when different edge types are considered.

A common pipeline involves three main steps:

Learning node embeddings $h_u$ and $h_v$ using a GNN.
Computing an edge representation → $s_{uv} = \text{Decoder}(h_u, h_v)$
Mapping this representation to class scores with a linear layer $z_{uv} = W s_{uv} + b$
Prediction step $\hat{y}_{uv} = \arg\max_c z_{uv}[c]$

Graph Classification

In graph classification, the goal is to assign a label to an entire graph rather than to individual nodes or edges.

Each graph $G = (V, E)$ is treated as a single data point. Node features are first processed by several GNN layers to obtain node embeddings $h_v$ , which are then aggregated into a single graph-level vector through a global pooling operation:

$h_{\text{graph}} = \text{POOLING}(\{h_v \mid v \in V\})$

Finally, a multi-layer perceptron (MLP) is used to predict the label for the entire graph:

$\hat{y}_G = \text{MLP}(h_{\text{graph}})$

Problem

In Graph Neural Networks (GNNs), the graph convolutional layers do not change the number of nodes in the graph. After several convolutional layers, we still obtain a representation for each individual node, but not a single, unified representation for the entire graph.

This poses a problem when we want to perform graph-level predictions, such as classifying an entire graph rather than its individual nodes. In such cases, we need a mechanism to aggregate information from all nodes into one compact vector that represents the whole graph.

Solution: Global Pooling

The Global Pooling mechanism addresses this issue by aggregating the embeddings of all nodes into a single global representation. This operation effectively creates a “super-node” that summarizes the structural and feature information of the entire graph.

Common global pooling methods include:

Global average pooling: $h_{\text{graph}} = \frac{1}{|V|} \sum_{v \in V} h_v$
Attention-based pooling: $h_{\text{graph}} = \sum_{v \in V} \alpha_v h_v , \quad \text{with} \quad \alpha_v = \text{softmax}(a^\top h_v)$

While these methods summarize all node embeddings in a single step, other approaches introduce hierarchical pooling, similar to what convolutional neural networks (CNNs) do in image processing.

EXAMPLE

A common application of graph classification is action recognition from human skeleton data. Here, each sequence of human poses is represented as a spatio-temporal graph, where:

Spatial edges connect body joints within each frame.
Temporal edges connect the same joint across consecutive frames.

By applying Graph Neural Networks (GNNs) or Spatio-Temporal Graph Convolutional Networks (ST-GCNs), it becomes possible to jointly model both spatial and temporal relationships, enabling accurate classification of the performed action.

Graph Attention Networks (GAT)

PROBLEMS OF GCN

The layer-wise propagation rule of a Graph Convolutional Network (GCN) is defined as:

$H^{(l)} = \sigma \left( \hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} H^{(l-1)} W^{(l)} \right)$

Although this formulation enables effective message passing across graph nodes, standard GCNs suffer from several limitations:

Over-smoothing: As the number of layers increases, node representations from different parts of the graph become indistinguishable making it difficult to separate nodes by class or role.
Aggregation: In each layer, every neighboring node contributes equally to the target node’s update. This lack of adaptive weighting or attention ignores the fact that some neighbors may be more informative or relevant than others.
Limited expressivity: The model’s capacity to capture complex relationships is constrained by the relations encoded in the adjacency matrix.

GRAPH ATTENTION NETWORKS (GATs)

Graph Attention Networks (GATs) extend traditional Graph Convolutional Networks (GCNs) by integrating an attention mechanism that enables nodes to assign different levels of importance to their neighbors during message passing.

Key concepts:

Learned edge weights:

Unlike GCNs, GATs do not rely on a pre-defined weighted adjacency matrix. Instead, they learn attention coefficients that determine how much influence each neighbor should have.
Local adaptivity:

Different neighbors receive different weights, allowing the model to focus more on relevant nodes.

In essence, through the attention mechanism, each node attends to its local neighborhood, using the adjacency matrix as a structural guide.

This is done in five steps:

Linear Projection

Each node’s feature vector is first linearly transformed:

$h_i^W = W h_i, \quad \forall i$

where:
- $W \in \mathbb{R}^{F' \times F}$ is a learnable weight matrix shared across all nodes.
Attention Coefficients:

For every edge $(i, j)$ , the model computes an unnormalized attention score between node $i$ and its neighbor $j$ :

$e_{ij} = \text{LeakyReLU}\left( a^\top [h_i^W \Vert h_j^W] \right)$

where:
- $[\cdot \Vert \cdot]$ denotes vector concatenation.
- $a \in \mathbb{R}^{2F'}$ is a learnable weight vector.
The affinity score $e_{i,j}$ indicates the importance of node $j$ ’s features to node $i$ .

In the most general case, the model allow all nodes to attend to every other node, dropping all structural information. However, to maintain locality, GATs introduce masked attention, which restricts attention to first-order neighbors:
$e_{ij} \text{ is computed only if} j \in N(i)$
where $N(i)$ is the first-order neighbors of $i$ (including $i$ ) expressed in the adjacency matrix.

Advantage. The graph is not required to be undirected; the computation of the attention coefficient $\alpha_{ij}$ can be simply omitted if the edge $j \rightarrow i$ is not present.
Normalization:

The unnormalized attention coefficients are converted into normalized attention scores across all neighbors of $i$ :

$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in N(i)} \exp(e_{ik})}$
Aggregation

The final node representation is obtained by taking a weighted sum of the transformed features of its neighbors:

$h'_i = \sigma \left( \sum_{j \in N(i)} \alpha_{ij} h_j^W \right)$

Multi-head Attention

To stabilize learning and enhance expressivity, GATs often employ multi-head attention. In this case $K$ independent attention mechanisms are applied, and their outputs are concatenated:

$h'_i = \Big\Vert_{k=1}^K \sigma \left( \sum_{j \in N(i)} \alpha_{ij}^{(k)} W^{(k)} h_j \right)$

Note that, in this setting, the final returned output, $h'_i$ , will consist of $K \cdot F'$ features (rather than $F'$ ) for each node.

GATs can also be combined with residual connections to improve gradient flow and stabilize deeper architectures.

COMPARISON: GAT VS SELF-ATTENTION

Aspect	GAT	Self-Attention (Transformers)
Input structure	Graph (nodes + edges)	Sequence (e.g., tokens)
Neighborhood	Defined by adjacency matrix	Full attention (every token attends to all others)
Sparsity	Sparse: only local neighbors	Dense: full pairwise interactions
Attention mechanism	Single-layer MLP over concatenated node pairs	Dot-product between learned query and key vectors
Position encoding	Implicit via graph topology	Explicit (e.g., sinusoidal or learned)
Inductive bias	Local, relational inductive bias from graph edges	Temporal or spatial bias via positional encodings

GraphCast

WEATHER FORECASTING WITH GNNs

GraphCast is a state-of-the-art model for medium-range weather prediction that uses Graph Neural Networks to model the Earth’s atmosphere:

Nodes: grid points on a spherical mesh (icosahedral grid)
Edges: connect spatially adjacent locations, representing local interactions in the atmosphere
Node features: meteorological variables such as wind, temperature, pressure, humidity, etc.

Performance:

For 10-day forecasts, GraphCast achieves accuracy comparable to traditional physics-based systems like HRES, the leading deterministic operational forecast model.
Offers significantly faster inference, enabling rapid predictions across the globe.

ARCHITECTURE

GraphCast combines temporal sequence modeling with GNN-based message passing:

Message-passing GNN layers are used to propagate information across neighboring nodes in the atmospheric graph.
The model is trained on decades of historical weather data to predict future states at each grid point.

To capture both global and local weather patterns, GraphCast represents the Earth’s atmosphere using graph meshes at multiple resolutions:

Coarse meshes provide a global context for large-scale weather patterns.
Fine meshes capture localized phenomena such as storms or temperature gradients.

This hierarchical graph representation allows GraphCast to effectively balance computational efficiency with high-resolution predictive accuracy, making it one of the most advanced GNN-based weather forecasting systems to date.

Conclusions

Graph Neural Networks (GNNs) extend traditional neural networks to graph-structured data, effectively combining node features with the underlying graph topology.

GNNs operate through message passing, where each node updates its representation by aggregating information from its neighbors.
Different GNN variants—such as GCN, GAT, GATv2, and others—introduce alternative mechanisms for aggregation and weighting, enabling more flexible and expressive modeling of complex graphs.
GNNs have proven effective across a variety of tasks:
- Node classification – labeling individual nodes based on features and neighborhood context
- Link prediction – predicting the existence or type of edges between nodes
- Graph classification – assigning labels to entire graphs
- Applications span social networks, biological networks, recommendation systems, and many more domains.
Despite their successes, challenges remain, including scalability to very large graphs, over-smoothing of node representations, interpretability of learned embeddings, and handling of dynamic or evolving graphs.

Key Takeaways

Topic	Key Point
From grids to graphs	CNNs assume regular Euclidean grids; GNNs generalize to irregular, non-Euclidean structures
Message passing	Core GNN mechanism: each node aggregates neighbor features, then updates its own embedding
GCN layer	Uses symmetric normalization of the adjacency matrix with self-loops for stable propagation
Adjacency matrix	Encodes graph structure as a fixed prior; enforces locality and reduces overfitting
Receptive field	Stacking $k$ GNN layers lets each node capture information from its $k$ -hop neighborhood
Node classification	Predict labels per node; can be transductive (same graph) or inductive (new graphs)
Link prediction	Predict edges between node pairs using learned embeddings + decoder
Graph classification	Global pooling aggregates all node embeddings into a single graph-level representation
GAT	Learns attention coefficients over neighbors instead of using fixed adjacency weights
GAT multi-head	Runs $K$ independent attention heads in parallel; concatenates outputs for richer representations
GraphCast	State-of-the-art weather forecasting using GNNs on multi-resolution spherical meshes