Transformers are widely used in the NLP domain and provide state-of-the-art performance for a wide range of tasks. They form the backbone of many modern models, such as BERT, GPT, and T5, among others.
The Transformer architecture consists of two main components:
- Encoder
- Decoder
Figure 1: Transformer Architecture (from the paper Attention Is All You Need, Vaswani et al., 2017)
The positional encoding matrix is added to the input embeddings to provide the model with information about the positions of tokens in the sequence. The positional encoding vectors are generated using sinusoidal functions:
- Even-indexed dimensions:
$$\text{PE}_{(p, 2i)} = \sin\left(\frac{p}{10000^{2i/d}}\right)$$ - Odd-indexed dimensions:
$$\text{PE}_{(p, 2i+1)} = \cos\left(\frac{p}{10000^{2i/d}}\right)$$
Where:
- ( p ): Position in the sequence
- ( i ): Dimension index
- ( d ): Dimensionality of the encoding
The encoder processes the input sequence as a whole. For each token in the sequence, it computes the following:
- Query, Key, and Value vectors: These are derived using learnable weight matrices.
- Scaled Dot-Product Attention: The attention mechanism calculates the relevance of other tokens in the sequence to the current token. It is computed as:
Where:
- ( Q ): Query matrix
- ( K ): Key matrix
- ( V ): Value matrix
- ( d_k ): Dimension of the key vectors
The decoder generates the output sequence one token at a time, attending to both the encoder outputs and the tokens generated so far. It shares many components with the encoder but adds causal masking to ensure proper autoregressive generation.
- Input Sequence: Tokenized and processed by the encoder.
- Attention Mechanism: Captures dependencies within and across sequences.
- Output Sequence: Generated one token at a time, conditioned on the encoder outputs and previously generated tokens.
During training, Training Loss and Validation Loss are monitored to evaluate model performance. Below is an example of loss curves:
Here are some examples of predicted tokens generated by the Transformer model compared to the true tokens for a neural machine translation task:
| Example | Predicted Tokens | True Tokens |
|---|---|---|
| 1 | a man in an orange hat, something something. | a man in an orange hat staring at something. |
| 2 | a is walking across grass grass grass in front of a white fence. | a boston terrier is running on lush green grass in front of a white fence. |
| 3 | a girl in a jacket is at toy with a toy of. | a girl in karate uniform breaking a stick with a front kick. |
| 4 | five people in winter clothes and helmets are with the snow with with in the background. | five people wearing winter jackets and helmets stand in the snow, with in the background. |
| 5 | people are walking the roof of a house. | people are fixing the roof of a house. |
| 6 | a man dressed a blue clothing, a group of men in dark suits and a are around a woman in in a street. | a man in light-colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a gown. |
| 7 | a group of people are in front of a outdoor. | a group of people standing in front of an igloo. |
| 8 | a boy in a red jersey is trying to hit the the the the while while while the other, the blue jersey, trying to be the. | a boy in a red uniform is attempting to avoid getting out at home plate, while the catcher in the blue uniform is attempting to catch him. |
| 9 | a guy working on a building. | a guy works on a building. |
| 10 | a man in a vest is sitting on a chair while holding . | a man in a vest is sitting in a chair and holding magazines. |
- Attention is All You Need
- Towards Data Science - Build Your Own Transformer
- The AI Summer - Transformer Overview
- Machine Learning Mastery - Building Transformers


