Skip to content

JHansiduYapa/Transformer-Model-from-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Transformer-Model-from-Scratch

Transformers are widely used in the NLP domain and provide state-of-the-art performance for a wide range of tasks. They form the backbone of many modern models, such as BERT, GPT, and T5, among others.


Transformer Overview

The Transformer architecture consists of two main components:

  1. Encoder
  2. Decoder

Figure 1: Transformer Architecture (from the paper Attention Is All You Need, Vaswani et al., 2017)


Positional Encoding

The positional encoding matrix is added to the input embeddings to provide the model with information about the positions of tokens in the sequence. The positional encoding vectors are generated using sinusoidal functions:

  • Even-indexed dimensions: $$\text{PE}_{(p, 2i)} = \sin\left(\frac{p}{10000^{2i/d}}\right)$$
  • Odd-indexed dimensions: $$\text{PE}_{(p, 2i+1)} = \cos\left(\frac{p}{10000^{2i/d}}\right)$$

Where:

  • ( p ): Position in the sequence
  • ( i ): Dimension index
  • ( d ): Dimensionality of the encoding


Encoder

The encoder processes the input sequence as a whole. For each token in the sequence, it computes the following:

  • Query, Key, and Value vectors: These are derived using learnable weight matrices.
  • Scaled Dot-Product Attention: The attention mechanism calculates the relevance of other tokens in the sequence to the current token. It is computed as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Where:

  • ( Q ): Query matrix
  • ( K ): Key matrix
  • ( V ): Value matrix
  • ( d_k ): Dimension of the key vectors

Decoder

The decoder generates the output sequence one token at a time, attending to both the encoder outputs and the tokens generated so far. It shares many components with the encoder but adds causal masking to ensure proper autoregressive generation.


How It Works

  1. Input Sequence: Tokenized and processed by the encoder.
  2. Attention Mechanism: Captures dependencies within and across sequences.
  3. Output Sequence: Generated one token at a time, conditioned on the encoder outputs and previously generated tokens.

Training and Validation Loss

During training, Training Loss and Validation Loss are monitored to evaluate model performance. Below is an example of loss curves:


Sample Output

Here are some examples of predicted tokens generated by the Transformer model compared to the true tokens for a neural machine translation task:

Example Predicted Tokens True Tokens
1 a man in an orange hat, something something. a man in an orange hat staring at something.
2 a is walking across grass grass grass in front of a white fence. a boston terrier is running on lush green grass in front of a white fence.
3 a girl in a jacket is at toy with a toy of. a girl in karate uniform breaking a stick with a front kick.
4 five people in winter clothes and helmets are with the snow with with in the background. five people wearing winter jackets and helmets stand in the snow, with in the background.
5 people are walking the roof of a house. people are fixing the roof of a house.
6 a man dressed a blue clothing, a group of men in dark suits and a are around a woman in in a street. a man in light-colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a gown.
7 a group of people are in front of a outdoor. a group of people standing in front of an igloo.
8 a boy in a red jersey is trying to hit the the the the while while while the other, the blue jersey, trying to be the. a boy in a red uniform is attempting to avoid getting out at home plate, while the catcher in the blue uniform is attempting to catch him.
9 a guy working on a building. a guy works on a building.
10 a man in a vest is sitting on a chair while holding . a man in a vest is sitting in a chair and holding magazines.

Acknowledgements


About

Build a Transformer model from scratch using Pytorch, implementing key concepts like attention, multi-head attention, and positional encoding.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors