Understanding Attention: Implementing a Transformer from Scratch in PyTorch
When Vaswani et al. published “Attention Is All You Need” in 2017, it completely shifted the landscape of Natural Language Processing. The paper introduced the Transformer architecture, which aggressively discarded recurrent connections in favor of a self-attention pipeline.
In this blog post, we walk through the critical steps of rebuilding that original mathematical breakthrough from scratch in PyTorch.
1. Discarding Recurrence: The Need for Positional Encoding
Because Transformers process tokens in parallel rather than sequentially, the model inherently lacks an understanding of order. How does it know that the word “dog” came before “bites”?
The answer is Positional Encoding. We merge standard token embeddings with static temporal embeddings. By calculating sinusoidal functions mapping frequencies to indices—alternating sines for even dimensions and cosines for odd dimensions—the model dynamically interprets absolute and relative positions in latent space.
f = lambda i, k: pos / 10000 ** (2 * (i // 2) / k)
# Using sin and cos across varying sequence indices2. The Core Mechanic: Scaled Dot-Product Attention
The defining aspect of the Transformer is its attention calculation. Simply put, attention is a method that determines which other words in the current sentence an individual word should focus against to extract meaning.
- Queries, Keys, and Values: Think of them as database entries. A Query probes the sequence; Keys correspond to available words; Values represent the true representation.
- Scaling: We matrix-multiply Queries and Keys, but divide the result by the square root of the key dimension ($\sqrt{d_k}$) to prevent large magnitude logits pushing the SoftMax into flat gradients.
- Softmax: The matrix is pushed through a Softmax function, outputting attention “weights,” which are then multiplied onto the Values.
3. Creating Breadth: Multi-Head Attention
A single focus mechanism is easily overpowered. For instance, sometimes a word relates conceptually to another word, inherently linking subject and action.
By linearly projecting Queries, Keys, and Values into $h$ fragmented sub-spaces (the “heads”), the implementation essentially instructs the network to compute $h$ independent scaled-dot interactions in isolation. We concatenate these vectors together and pass them through a final linear representation, allowing the model to decipher overlapping contextual clues.
4. The Building Blocks: Encoders and Decoders
The Encoder
The raw sequence is pushed into six stacked Encoder layers. Each layer uses:
- A Multi-Head Self-Attention sublayer mapping every word against all others simultaneously.
- A Feed-Forward Network processing individual elements independently. As the tokens scale up the chain, they accumulate a holistic contextual understanding of the entire input context. Residual connections (adding the input to the output) and Layer Normalizations map around each step to prevent gradients from eroding.
The Decoder
The Decoder acts symmetrically but introduces a major constraint and a secondary dependency:
- Masked Self-Attention: We mask out future tokens during training using lower-triangular zeroing. This mimics the autoregressive inference limitation so the model cannot “cheat” by seeing the next true token.
- Cross-Attention: A secondary Multi-Head Attention step that utilizes the Encoder’s final output as Keys and Values, mapping them against the Decoder’s own contextual Queries. This binds the original language sequence to the targeted translated output stream.
Conclusion
Building a Transformer forces an incredible understanding of multi-modal attention flow and scaled representations. By tearing down dependencies on slow, serial recurrent loops, the “Attention Is All You Need” architecture redefined data scalability inside PyTorch applications and paved the foundation for the massive Large Language Models we rely upon today.
