Transformer-Model-from-Scratch

Transformers are widely used in the NLP domain and provide state-of-the-art performance for a wide range of tasks. They form the backbone of many modern models, such as BERT, GPT, and T5, among others.

Transformer Overview

The Transformer architecture consists of two main components:

Encoder
Decoder

Figure 1: Transformer Architecture (from the paper Attention Is All You Need, Vaswani et al., 2017)

Positional Encoding

The positional encoding matrix is added to the input embeddings to provide the model with information about the positions of tokens in the sequence. The positional encoding vectors are generated using sinusoidal functions:

Even-indexed dimensions: $$\text{PE}_{(p, 2i)} = \sin\left(\frac{p}{10000^{2i/d}}\right)$$
Odd-indexed dimensions: $$\text{PE}_{(p, 2i+1)} = \cos\left(\frac{p}{10000^{2i/d}}\right)$$

Where:

( p ): Position in the sequence
( i ): Dimension index
( d ): Dimensionality of the encoding

Encoder

The encoder processes the input sequence as a whole. For each token in the sequence, it computes the following:

Query, Key, and Value vectors: These are derived using learnable weight matrices.
Scaled Dot-Product Attention: The attention mechanism calculates the relevance of other tokens in the sequence to the current token. It is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Where:

( Q ): Query matrix
( K ): Key matrix
( V ): Value matrix
( d_k ): Dimension of the key vectors

Decoder

The decoder generates the output sequence one token at a time, attending to both the encoder outputs and the tokens generated so far. It shares many components with the encoder but adds causal masking to ensure proper autoregressive generation.

How It Works

Input Sequence: Tokenized and processed by the encoder.
Attention Mechanism: Captures dependencies within and across sequences.
Output Sequence: Generated one token at a time, conditioned on the encoder outputs and previously generated tokens.

Training and Validation Loss

During training, Training Loss and Validation Loss are monitored to evaluate model performance. Below is an example of loss curves:

Sample Output

Here are some examples of predicted tokens generated by the Transformer model compared to the true tokens for a neural machine translation task:

Example	Predicted Tokens	True Tokens
1	a man in an orange hat, something something.	a man in an orange hat staring at something.
2	a is walking across grass grass grass in front of a white fence.	a boston terrier is running on lush green grass in front of a white fence.
3	a girl in a jacket is at toy with a toy of.	a girl in karate uniform breaking a stick with a front kick.
4	five people in winter clothes and helmets are with the snow with with in the background.	five people wearing winter jackets and helmets stand in the snow, with in the background.
5	people are walking the roof of a house.	people are fixing the roof of a house.
6	a man dressed a blue clothing, a group of men in dark suits and a are around a woman in in a street.	a man in light-colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a gown.
7	a group of people are in front of a outdoor.	a group of people standing in front of an igloo.
8	a boy in a red jersey is trying to hit the the the the while while while the other, the blue jersey, trying to be the.	a boy in a red uniform is attempting to avoid getting out at home plate, while the catcher in the blue uniform is attempting to catch him.
9	a guy working on a building.	a guy works on a building.
10	a man in a vest is sitting on a chair while holding .	a man in a vest is sitting in a chair and holding magazines.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
Attention_Is_All_You_Need.ipynb		Attention_Is_All_You_Need.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-Model-from-Scratch

Transformer Overview

Positional Encoding

Encoder

Decoder

How It Works

Training and Validation Loss

Sample Output

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer-Model-from-Scratch

Transformer Overview

Positional Encoding

Encoder

Decoder

How It Works

Training and Validation Loss

Sample Output

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages