Transformers

Temporal Convolutional Architectures

INTRODUCTION

A sequence model is a model that takes in input a sequence of items (words, letters, time series, audio signals, etc) and produce either a single output or another sequence of outputs.

image.png

Each position in the sequence is often called a time step. The item at each time step, also referred to as a token, is represented by a set of numerical features.

While sequence modeling is historically associated with Recurrent Neural Networks (RNNs), other architectures such as Convolutional Neural Networks (CNNs) can also be utilized.

CNN

Even though Convolutional Neural Networks (CNNs) are primarily associated with image analysis, their operations can also be effectively adapted for sequence data.

image.png

In image processing, a Conv2D operation uses a two-dimensional kernel that slides across the image’s spatial dimensions (height and width). In contrast, for time series or sequential data, this same principle is applied through a Conv1D operation.

In this approach, the kernel is a one-dimensional filter that moves along the temporal axis. This allows the model to preserve critical sequential information, such as:

  • Temporal dependencies in time series (e.g., causal patterns)
  • Syntactic order in text (e.g., noun–verb–object).

FROM 2D TO 1D CONVOLUTION

When shifting from 2D to 1D, we still operate on multi-featured data. For example, a sentence is a sequence where each token is described by multiple features.

image.png

In this context, the 1D convolution kernel (e.g., of size 5) slides along only one dimension: the temporal axis (the sequence length).

There is, however, an “implicit” second dimension: the feature size. The kernel’s second dimension must always match the number of features in the input sequence. For instance, if the input tokens have 3 features and the kernel size is 5, the kernel’s actual shape is (5, 3). It slides temporally, but at each step, it processes all 3 features simultaneously.

The discrete 1D convolution is mathematically defined as:

(xω)(n)=i=0k1x(ni)ω(i)(x * \omega)(n) = \sum_{i=0}^{k-1} x(n - i)\omega(i)

Where:

  • xx is our input vector of length nn
  • ω\omega our kernel of length kk

Pay attention to the term x(ni)x(n - i). This indicates that the kernel “goes back” to compute the output. To calculate the value at step nn, it needs input from steps nn back to nk+1n-k+1. This has a critical consequence: without padding, we cannot compute the output for the very first time steps (e.g., at n=0n=0 or n=1n=1), because the kernel would need to access data from “before” the start of the sequence, where there is nothing. The first output can only be computed once the kernel has kk items to process.

DILATED CONVOLUTION

A major disadvantage of using standard CNNs for sequences is that to capture long-term dependencies in a sequence, the network needs a large receptive field. This traditionally means we must use an extremely deep network or a very large kernel.

To mitigate this, dilated convolutions can be used. This technique introduces gaps between the kernel elements, allowing it to cover a wider input region without increasing the number of parameters.

The discrete dilated convolution is defined as:

(xω)(n)=i=0k1x(ndi)ω(i)(x * \omega)(n) = \sum_{i=0}^{k-1} x(n - d \cdot i)\omega(i)

Where:

  • kk is the kernel size
  • dd is the dilation factor, an integer that specifies how many items in the sequence are skipped.

For a kernel of size k=3k = 3:

  • Standard Convolution (d=1)( d = 1 ):

    The kernel looks at adjacent points in time and covers only 3 elements, capturing short-term dependencies.

    • Inputs seen: [x1,x2,x3][x₁, x₂, x₃]
    • Receptive Field =3= 3
  • Dilated Convolution (d=2)( d = 2 ):

    The kernel “skips” one element (d1)( d - 1 ) between each input, covering 5 elements.

    • Inputs seen: [x1,x3,x5][x₁, x₃, x₅]
    • Receptive Field =5= 5

CAUSAL CONVOLUTION

In a standard convolution (as used in image processing), the kernel is typically centered, meaning it “looks” at data both backward and forward around the current position.

However, when dealing with time series, this is not desirable because the output at a given time tt should not depend on future inputs, as the future has not yet occurred.

To enforce this constraint, we use a causal convolution, which ensures that the output at time step tt depends only on inputs from previous or current time steps t\le t. This is typically achieved by shifting the kernel (or using padding) so that it only looks backward.

This simple causal constraint makes CNNs suitable for sequence modeling tasks where causality is critical, such as in autoregressive models.

image.png

TEMPORAL CONVOLUTIONAL NETWORK

A Temporal Convolutional Network (TCN) is a model architecture composed of stacked residual blocks, where each block utilizes dilated causal convolutions.

image.png

A key feature is the use of increasing dilation factors d=1,2,4d = 1, 2, 4 in successive layers. As the network deepens, the receptive field grows exponentially, this allows the model to efficiently capture long-range dependencies and “see” a large portion of the input sequence without requiring an excessive number of layers.

image.png

Similar to how architectures like VGG are built from blocks, a TCN is composed of these residual blocks. Each individual block contains a small pipeline that is typically applied twice:

  • A dilated causal 1D convolution (with a specific kernel size kk, e.g., k=3k=3).
  • Weight Normalization (used to stabilize and speed up training by decoupling weight direction from magnitude).
  • ReLU activation (for non-linearity).
  • Dropout (for regularization to reduce overfitting).

After this two-layer pipeline, a residual connection (or skip connection) adds the original input of the block to the block’s final output. If the input and output have different dimensions (e.g., due to a change in the number of filters), a 1x1 convolution is applied to the input (on the skip connection) to match its shape before the addition

TCN VS RNN

Temporal Convolutional Networks (TCNs) can outperform recurrent models like LSTMs and GRUs on a wide range of sequence modeling tasks.

Their appealing properties, when compared to RNNs, include:

  • Parallelism: TCNs are highly parallelizable. Unlike RNNs, which must process data sequentially (where the output at time tt depends on the computation from t1t-1), the convolutions in a TCN can be computed in parallel across the entire sequence. This results in significantly faster training, especially on long sequences.
  • Stable Training: TCN training is generally more stable as the backpropagation path is not sequential, which avoids the vanishing and exploding gradient problems common in RNNs.
  • Higher effective memory: the “memory” (or receptive field) of a TCN is explicitly controlled by the network’s depth and dilation factors (dd). This allows the model to efficiently capture very long-range dependencies. In contrast, an RNN’s memory is tied to its hidden state, and attempting to capture long dependencies often leads to gradient instability during backpropagation through time.
  • Explainability: Since TCNs are fundamentally convolutional networks, they are compatible with standard CNN explainability tools. Methods like Grad-CAM (Class Activation Maps) can be used to generate activation maps, providing insights into which parts of the input sequence (or which features) were most important for a given prediction.

Transformer

FROM TEMPORAL CONVOLUTIONS TO TRANSFORMERS

Temporal Convolutional Networks (TCNs) model temporal dependencies using a fixed inductive bias: local convolutions with an expanding receptive field. This means that the model processes data along the temporal axis in order, enforcing causality — what happens in the past influences what happens in the future, and never the other way around.

This inductive bias is powerful, as it ensures stability and causal consistency, but it may not always be the best assumption for every problem. In some cases, relevant information is non-local in time, meaning that an item can depend on elements that are very distant in the sequence or follow long-range dependencies that do not obey a strict temporal order.

Example: Code Understanding

In source code, the opening of a for loop or a bracket { may influence variables or a corresponding closing bracket } dozens of lines later. This creates a long-range dependency that is difficult for a model based on local convolutions to capture.

TRANSFORMER ARCHITECTURE

Screenshot 2025-10-31 121853.png

The solution are the so-called attention-based models or transformers. This architecture is composed of two main components:

  • Encoder — on the left in the figure
  • Decoder — on the right in the figure

Both the encoder and decoder consist of a stack of NN identical layers (or “blocks”), typically with N=6N = 6, forming a deep architecture capable of capturing complex and long-range dependencies within the data.

Transformers have become ubiquitous across AI applications:

  • Machine translation and text generation
  • Question answering and dialogue systems
  • Computer vision (Vision Transformer, ViT)

TRANSFORMER ENCODER

A Transformer encoder block receives an input sequence of tokens — for example, TT tokens, each represented by a vector of dimension dmodeld_{model} — and produces an output sequence of the same shape.

No pooling or dimensionality reduction is applied: the encoder preserves the sequence length and dimensionality, but transforms the content of each token as it passes through the layers.

This is in contrasts with CNNs, which often change the dimensions (e.g., feature/channel size) of their input.

Each encoder block is composed of the following main components:

  • A Multi-Head Attention (MHA) layer (detailed later).
  • Add & Norm:
    • A Residual (skip) connection is applied, adding the input of the MHA layer to its output, to facilitate gradient flow.
    • Layer Normalization is applied to the result.

image.png

  • A Feed-Forward Neural Network (FFN), a simple MLP applied independently and in parallel to each token embedding. Typically is composed of two linear layers with a ReLU activation in between.

    The standard design first expands the dimensionality and then contracts it back to the original size:

    • Layer 1 (Expand): Linear(dmodeldff)\text{Linear}(d_{\text{model}} \rightarrow d_{\text{ff}})
    • ReLU Activation
    • Layer 3 (Contract): Linear(dffdmodel)\text{Linear}(d_{\text{ff}} \rightarrow d_{\text{model}})

    Where dffd_{ff} is the “feed-forward” inner dimension, often set to 4×dmodel4 \times d_{model}.

  • Add & Norm:

    • Another Residual connection adds the input of the FFN to its output.
    • Layer Normalization is applied again.

Layer Normalization

Layer Normalization (LN) is a technique used to stabilize and speed up training.

Unlike Batch Normalization, which normalizes activations across the batch dimension, Layer Normalization operates across the feature dimension — i.e., it normalizes the features within each individual sample (token).

Given an input vector xx representing one token with HH features, LNLN computes:

LN(x)=xμσ+ϵγ+βLN(x) = \frac{x - \mu}{\sigma + \epsilon} \odot \gamma + \beta

where:

  • μ=1Hixi\mu = \frac{1}{H} \sum_i x_i → is the mean of the features of the token xx
  • σ\sigma → is the variance over the features
  • γ,β\gamma, \beta → learnable parameters that allow the model to rescale and shift the normalized output
  • ϵ\epsilon → small constant for numerical stability

Thus, normalization is applied per token, not across different tokens in the same batch.

In Transformers, Layer Normalization is applied multiple times throughout both the encoder and decoder stacks — typically after each residual connection — ensuring that at every layer, the activations maintain a stable scale and distribution.

Multi-Head Attention

THE CORE IDEA OF ATTENTION

In many sequence tasks, not all input elements are equally relevant for producing a given output. Traditional models (like RNNs or CNNs) process information uniformly or with a fixed receptive field, which limits their ability to focus on what truly matters.

Attention solves this problem by allowing the model to dynamically assign “importance” weights to different input elements based on their relevance to the current context. At each step, the model effectively asks: “Which tokens should I pay more attention to right now?”

Given an input sequence of TT tokens, xx, the attention mechanism produces a new sequence of contextualized representations:

Z=[z1z2...zT]Z= \begin{bmatrix} z_1 \\ z_2 \\ ... \\ z_T \end{bmatrix}

where each output vector ztz_t is a weighted combination of all input vectors xix_i.

zt=iαt,ixiαt,i=softmax(st,i)z_t = \sum_i \alpha_{t,i} x_i \quad \alpha_{t,i} = \text{softmax}(s_{t,i})

The attention coefficients αt,i\alpha_{t,i} are not fixed. They are learned functions of the input, computed through dedicated parametric modules. Thus, the model learns where to look depending on the specific input sequence.

For example, to create z1z_1, the model might learn that it is composed of 80% information from x1x_1, 10% from x3x_3, and 10% from x7x_7.

ATTENTION AS CONTENT-BASED MEMORY ACCESS

Instead of compressing all contextual information into a single hidden state (as done in RNNs), the attention mechanism allows the model to directly query the entire sequence of inputs and retrieve the most relevant information on demand.

This process functions as a form of content-based memory access, where retrieval depends on what is stored, not where it is stored.

The model maintains a set of memory slots constructed from the entire input sequence x=[x1,x2,,xT]x = [x_1, x_2, \dots, x_T]. For each token xix_i, the model generates two vectors:

  • Key (kik_i): Acts as an “index” or “label.” It represents what kind of information this token offers (e.g., “I am a subject,” “I am a verb”).
  • Value (viv_i): Represents the token’s actual content or what information it provides.

The complete set of (ki,vi)(k_i, v_i) pairs constitutes the memory.

To retrieve information, the model generates a Query vector qq, which describes what the model is looking for. This query qq is used to retrieve the most relevant information zz from the memory by computing a weighted combination of all values viv_i.

The weights αi\alpha_i are determined by computing the similarity between the query qq and each corresponding key kik_i, as formalized in the algorithm below.


Algorithm 2 Content-based Retrieval


Require: query qq, keys {ki}\{k_i\}, values {vi}\{v_i\}

for each memory slot ii do

sisimilarity(q,ki)s_i \leftarrow \text{similarity}(q, k_i)

end for

αsoftmax(s)\alpha \leftarrow \text{softmax}(s) — attention weights

ziαiviz \leftarrow \sum_i \alpha_i v_i — retrieved information

return zz


QUERY, KEYS AND VALUES

image.png

In the attention mechanism, the Query QQ, Key KK, and Value VV matrices are generated by projecting the entire input sequence XX through three distinct, trainable linear layers (weight matrices):

Q=XWQK=XWKV=XWVQ=XW^Q \quad K=XW^K \quad V=XW^V

The input tokens are projected into a new dimension dkd_k, following these relationships:

  • Input → The input sequence matrix XX consists of TT tokens and has the shape (T,dmodel)(T, d_{\text{model}})
  • Weights → Each weight matrix WW maps the input dimension to the new one and therefore has shape (dmodel,dk)(d_{\text{model}}, d_k).
  • Output → The resulting matrices Q,K,VQ, K, V all share the shape (T,dk)(T, d_k). Each row corresponds to the respective vector of a token — for example, the first row represents q1q_1 for the first token, the second row represents q2q_2 for the second token, and so on.

Since all three components originate from the same input XX, the attention mechanism is referred to as self-attention.

SINGLE-HEAD ATTENTION

This mechanism acts as a form of content-based memory access: the model can look back at all inputs and retrieve information on demand.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}} \right) V

Where Q,K,VQ, K, V are the matrices representing Queries, Keys, and Values, respectively. The calculation involves four steps:

  1. Compute Similarity Scores

    The dot product QKTQK^T produces a similarity matrix where the entry (t,i)(t, i) measures how much every token tt (query) is similar to token ii (key).

    Shape: (T,dk)(dk,T)(T,T)(T, d_k) \cdot (d_k, T) \rightarrow (T, T)

  2. Scale

    To stabilize the gradients, the scores are scaled by dividing by dk\sqrt{d_k}

A typical value for the dimension is dk=64d_k = 64

image.png

  1. Normalize (Softmax)

    To convert the raw scaled scores into attention weights (which sum to 1), a softmax function is applied to each row (row-wise) of the scaled score matrix. Let’s call this resulting attention matrix AA.

  2. Compute Weighted Sum

    Finally, the attention weight matrix AA is multiplied by the Value matrix VV:

    Z=AVZ = A \cdot V

    Shape: (T,T)(T,dv)(T,dv)(T, T) \cdot (T, d_v) \rightarrow (T, d_v)

    The result ZZ is the new output matrix, where each row ziz_i is the new representation of token ii, computed as a weighted sum of all value vectors vjv_j in the sequence.

MULTI-HEAD ATTENTION

image.png

In multi-head attention the linear projections and the attention are repeated in parallel multiple times (e.g., h=8h = 8), each using a distinct set of weights.

Each head produces an output of dimension dkd_k, and the resulting outputs are concatenated along the feature axis. The concatenated vector is then projected through a final linear layer to restore the dimensionality of dmodeld_{model}

MHA(Q,K,V)=concat(Z1,...,Zh)WO\text{MHA}(Q, K, V) = \text{concat}(Z_1, ..., Z_h)W^O

where:

Zi=headi=Attention(XWiQ,XWiK,XWiV)Z_i = \text{head}_i = \text{Attention}\left( XW^Q_i, XW^K_i, XW^V_i \right)

image.png

FROM TOKEN REPRESENTATION TO CLASSIFICATION

The encoder produces a sequence of token embeddings, where each vector represents a contextualized version of its corresponding input token.

To perform classification, this sequence must be reduced to a single, fixed-size representation. Two common strategies are:

  • Average pooling: Compute the mean of all token embeddings and use the resulting vector as input to a Multi-Layer Perceptron (MLP) classifier:

    zavg=1Tt=1Tztz_{\text{avg}} = \frac{1}{T} \sum_{t=1}^{T} z_t

    This produces a single feature vector with the same dimensionality as the original embeddings. The MLP then maps this vector to the number of output classes through a final linear layer.

  • [CLS] token: Introduce a special classification token prepended to the input sequence. Its embedding, z[CLS]z_{\text{[CLS]}}, after passing through the encoder, serves as a global representation of the entire sequence.

    The [CLS] embedding is then fed into an MLP for classification, while the remaining token embeddings are discarded.

SUMMARY

Both Temporal Convolutional Networks (TCNs) and Transformers are powerful architectures for modeling sequential data, but they rely on different inductive biases and computational mechanisms. While TCNs are based on convolutional operations, Transformers use attention to model dependencies between sequence elements. The following table summarizes their main differences and characteristics:

AspectTemporal Convolutional Networks (TCNs)Transformers
Modeling MechanismUse causal and dilated convolutions to capture temporal dependencies.Use attention mechanisms to model relationships between all elements in a sequence.
Temporal CoverageCapture local-to-mid-range temporal structures efficiently.Learn arbitrary long-range dependencies with no fixed receptive field.
Training StabilityGenerally stable and easy to train due to convolutional nature.Training can be more complex but allows greater flexibility.
Context HandlingBest suited for sequential and time-ordered data.Naturally handle multimodal and non-sequential contexts.
Applications / InfluenceEffective for tasks with structured temporal signals (e.g., time series).Serve as the foundation of modern architectures like BERT, GPT, and ViT.

Loss Function for Regression

CLASSICAL REGRESSION LOSSES

  • Mean Squared Error (MSE)

    LMSE=1Nt=1N(yty^t)2L_{MSE} = \frac{1}{N} \sum_{t=1}^{N} (y_t - \hat{y}_t)^2
    • Penalizes large errors quadratically: This characteristic makes the loss function highly sensitive to outliers.
    • Reduces small errors: Conversely, if an error is small (<1< 1), squaring it makes it even smaller (e.g., 0.22=0.040.2^2 = 0.04).
    • Optimization: The function is smooth and continuously differentiable, making it stable and well-suited for gradient-based optimization.
  • Mean Absolute Error (MAE)

    LMAE=1Nt=1Nyty^tL_{MAE} = \frac{1}{N} \sum_{t=1}^{N} |y_t - \hat{y}_t|
    • Robust to outliers: It penalizes all deviations linearly.
    • Optimization: The function is not differentiable at zero (it has a “kink” or a discontinuous gradient). This can lead to instability during the training and optimization process, especially when using standard gradient-based methods.

FROM REGRESSION TO CLASSIFICATION VIA BINNING

By discretizing a continuous target variable into intervals (bins), we can transform a regression problem into a classification one and train the model using a cross-entropy loss to predict which interval the target value belongs to.

Age Prediction from an Image

  • Target variable (Regression): age [0,100]\in [0, 100]
  • Target variable (Classification): Define KK bins, e.g.:
    • Class 1: [0–10]
    • Class 2: [10–20]
    • Class 10: [90–100]
  • Train a classifier with KK output classes (one per age interval)
  • The final prediction can be the center of the predicted bin (e.g., 25 for the [20–30] bin) or, more robustly, a weighted average of all bin centers based on the output class probabilities.

It is particularly useful when small variations in the target value are not critical, and the main interest lies in the broader range or category of the prediction.

Classification problems are often easier to optimize than regression ones, since the loss function provides a stronger and more focused learning signal through the softmax output — the gradient clearly points toward the correct class.

LIMITATIONS OF MSE AND MAE

Both Mean Absolute Error (MAE) and Mean Squared Error (MSE) assume symmetric errors, meaning that overestimation and underestimation are penalized equally.

However, in many real-world scenarios, this assumption is not appropriate:

  • The cost of underpredicting can differ significantly from that of overpredicting (e.g., in energy demand, risk assessment, or stock forecasting).
  • In some cases, we are not interested in predicting a single “average” value, but rather in estimating a range of possible outcomes.

For example, with MSE, an error of +10 and an error of –10 contribute equally to the loss, since the sign is lost when squaring the difference. This symmetry ignores situations where one type of error is more critical than the other.

To address this limitation, we can use the Quantile Loss.

QUANTILES

The τ\tau-quantile of a probability distribution is the value qτq_\tau such that a random variable YY from the distribution will be less than or equal to qτq_\tau with probability τ\tau:

P(Yqτ)=τP(Y \le q_\tau) = \tau

image.png

  • τ=0.9\tau = 0.9 (90th percentile):

    q0.9q_{0.9} is the value such that 90% of the observations are below it and 10% are above.

If the 90th percentile of exam scores is 85, it means 90% of students scored below 85.

  • τ=0.5\tau = 0.5 (median):

    The median is the 0.5 quantile — half the observations are below it, half above.

In the ordered set [1, 3, 5, 7, 9], the median is 5.

  • τ=0.1\tau = 0.1 (10th percentile):

    q0.1q_{0.1} is the value such that 10% of the data is below it and 90% above.

In income data, the 10th percentile is the income level below which the poorest 10% of people fall.

QUANTILE LOSS

The Quantile Loss allows the model to estimate specific quantiles of the target distribution, rather than just the mean. This introduces asymmetry in the loss, because the penalty for overestimation and underestimation depends on the chosen quantile τ\tau.

Lτ(y,y^)={τ(yy^)if yy^,(1τ)(y^y)otherwise.L_\tau(y, \hat{y}) = \begin{cases} \tau (y - \hat{y}) & \text{if } y \ge \hat{y}, \\ (1 - \tau)(\hat{y} - y) & \text{otherwise.} \end{cases}

Properties:

  • yy is the target, while y^\hat{y} is the estimate.
  • τ(0,1)\tau \in (0, 1) is a hyperparameter that controls which quantile we want to estimate.
    • τ=0.5\tau = 0.5 \to median (50th percentile)
    • τ=0.9\tau = 0.9 \to 90th percentile (upper bound)
    • τ=0.1\tau = 0.1 \to 10th percentile (lower bound)

Intuition

By adjusting τ\tau, we can control how much the model penalizes one side of the error:

  • For τ=0.9\tau = 0.9: underestimations are penalized 9× more than overestimations → the model prefers to slightly overpredict.
  • For τ=0.1\tau = 0.1: overestimations are penalized 9× more → the model prefers to underpredict.
  • For τ=0.5\tau = 0.5: both sides are penalized equally → same as MAE.

The model “pays” more for errors on one side of the prediction, depending on the chosen quantile τ\tau.

Key Takeaways

TopicKey Point
1D ConvolutionAdapts CNN convolutions to sequential data by sliding a kernel along the temporal axis
Dilated ConvolutionIntroduces gaps between kernel elements to exponentially expand the receptive field without adding parameters
Causal ConvolutionEnsures output at time tt depends only on inputs t\le t, enforcing temporal causality
TCNStacks residual blocks with dilated causal convolutions; parallelizable and stable vs. RNNs
TransformerEncoder–decoder architecture using attention instead of convolutions to model arbitrary long-range dependencies
Self-AttentionEach token queries all others via Q, K, V projections; computes weighted combinations based on content similarity
Multi-Head AttentionRuns multiple attention heads in parallel, then concatenates — captures diverse relational patterns
Layer NormalizationNormalizes per-token features (not across the batch) to stabilize training in deep transformers
MSE vs MAEMSE penalizes outliers quadratically; MAE is robust but non-differentiable at zero
Quantile LossAsymmetric loss that estimates specific quantiles of the target distribution, useful when over/underprediction costs differ