The structure of Longformer

Tiya Vaj
2 min readFeb 29, 2024

The Longformer architecture is an extension of the Transformer model, specifically designed to handle longer sequences of text efficiently. Here’s an overview of its structure:

1. Self-Attention Mechanism:
— Similar to the standard Transformer architecture, Longformer relies on self-attention mechanisms to capture dependencies between different tokens in the input sequence. However, Longformer introduces modifications to handle longer sequences effectively.

2.Global Attention:
— One key feature of Longformer is the inclusion of global attention, which allows the model to attend to all positions in the input sequence simultaneously. This enables Longformer to capture long-range dependencies across the entire input sequence, even in cases where tokens are far apart.

3. Sparse Attention:
Longformer employs a sparse attention mechanism to handle memory and computational constraints associated with processing long sequences. Instead of attending to every token in the input sequence, Longformer only attends to a subset of tokens, resulting in a more efficient computation.

4. Local Attention:
— In addition to global attention, Longformer also supports local attention, where tokens within a certain distance of each other are attended to directly. This allows the model to capture both short-range and long-range dependencies in the input sequence effectively.

5. Positional Embeddings:
— Longformer utilizes positional embeddings to encode the position information of tokens in the input sequence. These embeddings help the model understand the relative positions of tokens and capture positional relationships during processing.

6. Transformer Layers:
— Longformer consists of multiple layers of transformer blocks, each comprising self-attention, feedforward neural networks, and layer normalization modules
. These transformer layers enable the model to perform hierarchical feature extraction and capture complex patterns in the input sequence.

Overall, the Longformer architecture extends the Transformer model to handle longer sequences by incorporating global and sparse attention mechanisms while maintaining the ability to capture both local and global dependencies within the input text. This makes Longformer well-suited for tasks involving long documents or sequences of text.



Tiya Vaj

Ph.D. Research Scholar in Informatics and my passionate towards data-driven for social good.Let's connect here