Transformer Models 101: Getting Started — Part 2

The complex math behind transformer models, in simple words

Published in

Towards Data Science

7 min readJun 6, 2023

In the previous article, we looked at how the Encoder block of the Transformer model works in detail. If you haven’t read that article, I would recommend you to read it before starting with this one as the concepts covered there are carried forward in this article. You may head to:

Transformer Models 101: Getting Started — Part 1

Complex maths behind transformer models in simple words...

towardsdatascience.com

If you have already read it, awesome! Let’s get started with a deep dive into the Decoder block and the complex maths associated with it.

Decoder of Transformer

Like the Encoder block of the Transformer models, the Decoder block consists of N stacked decoders that function sequentially and accept the input from the previous decoder. However, that is not the only input accepted by the decoder. The sentence representation generated by the Encoder block is fed to every decoder in the Decoder block. Therefore, we can conclude that each decoder accepts two different inputs:

Sentence representation from the Encoder Block
The output of the previous Decoder

Fig 1. Encoder & Decoder blocks functioning together (Image by Author)

Before we delve any deeper into the different components that make up a Decoder, it is essential to have an intuition of how the decoder in general generates the output sentence or target sentence.

How is the target sentence generated?

At timestep t=1, only <sos> token or the start of the sentence is passed as input to the decoder block. Based on the <sos>, the first word of the target sentence is generated by the decoder block.

In the next timestamp i.e. t=2, the input to the decoder block includes the <sos> token as well as the first word generated by the decoder block. The next word is generated based on this input.

Similarly, with every timestamp increment, the length of the input to the decoder block increases as the word generated in the previous timestamp is added to the current input sentence.

When the decoder block completes the generation of the entire target sentence, <eos> or the end of the sentence token is generated.

You can think of it as a recursive process!

Fig. 2 Recursive generation of output tokens using Decoder (Image by Author)

Now, this is what is supposed to happen when input is given to the transformer model and we are expecting an output. But at the time of training/finetuning the transformer model, we already have the target sentence in the training dataset. So how does it work?

It brings us to an extremely important concept of Decoders: Masked Multi-head Attention. Sounds familiar? Of course, it does. In the previous part, we understood the concept of Multi-head attention which is used in the Encoder block. Let us now understand how these two are different.

Masked Multi-head Attention

The decoder block generates the target sentence word by word and hence, the model has to be trained similarly so that it can make accurate predictions even with a limited set of tokens.

Hence, as the name suggests, we mask all the tokens to the right of the sentence which have not been predicted yet before calculating the self-attention matrix. This will ensure that the self-attention mechanism only considers the tokens that will be available to the model at each recursive step of prediction.

Let us take a simple example to understand it:

Fig 3. Masked Multi-head attention matrix representation (Image by Author)

The steps and formula to calculate the self-attention matrix will be the same as we do in the Encoder block. We will cover the steps on a high level in this article. For a deeper understanding, please feel free to head to the previous part of this article series.

Generate embeddings for the target sentence and obtain target matrix Y
Transform the target sentence into Q, K & V by multiplying random weight matrices Wq, Wk & Wv with target matrix Y
Calculate the dot product of Q and K-transpose
Scale the dot product by dividing it by the square root of the embedding dimension (dk)
Apply masking on the scaled matrix by replacing all the cells with <mask> with — inf
Now apply the softmax function on the matrix and multiply it with the Vi matrix to generate the attention matrix Zi
Concatenate multiple attention matrices Zi into a single attention matrix M

This attention matrix will be fed to the next component of the Decoder block along with the input sentence representation generated by the Encoder block. Let us now understand how both of these matrices are consumed by the Decoder block.

Multi-head Attention

This sublayer of the Decoder block is also known as the “Encoder-Decoder Attention Layer” as it accepts both masked attention matrix (M) and sentence representation by Encoder (R).

The calculation of the self-attention matrix is very similar to how it is done in the previous step with a small twist. Since we have two input matrices for this layer, they are transformed into Q, K & V as follows:

Q is generated using Wq and M
K & V matrices are generated using Wk & Wv with R

By now you must’ve understood that every step and calculation that goes behind the Transformer model has a very specific reason. Similarly, there is also a reason why each of these matrices is generated using a different input matrix. Can you guess?

Quick Hint: The answer lies in how the self-attention matrix is calculated...

Yes, you got it right!

If you recall, when we understood the concept of self-attention using an input sentence, we talked about how it calculates attention scores while mapping the source sentence to itself. Every word in the source sentence is compared against every other word in the same sentence to quantify the relationships and understand the context.

Here also we are doing the same thing, the only difference being, we are comparing each word of the input sentence (K-transpose) to the target sentence words (Q). It will help us quantify how similar both of these sentences are to each other and understand the relationships between the words.

Fig 5. Attention Matrix Representation with input sentence & target sentence (Image by Author)

In the end, the attention matrix Zi generated will be of dimension N X 1 where N = word count of the target sentence.

Since this is also a multi-head attention layer, to generate the final attention matrix, multiple attention matrices are concatenated.

With this, we have covered all the unique components of the Decoder block. However, some other components function the same as in Encoder Block. Let us also look at them briefly:

Positional Encoding — Just like the encoder block, to preserve the word order of the target sentence, we add positional encoding to the target embedding before feeding it to the Masked Multi-attention layer.
Feedforward Network — This sublayer in the decoder block is the classic neural network with two dense layers and ReLU activations. It accepts input from the multi-head attention layer, performs some non-linear transformations on the same and finally generates contextualised vectors.
Add & Norm Component — This is a residual layer followed by layer normalisation. It helps faster model training while ensuring no information from sub-layers is lost.

We have covered these concepts in detail in Part 1.

With this, we have wrapped up the internal working of the Decoder block as well. As you might have guessed, both Encoder & Decoder blocks are used to process and generate contextualized vectors for the input sentence. So who does the actual next-word prediction task? Let’s find out.

Linear & Softmax Layer

Sitting on top of the Decoder network, it accepts the output matrix generated by the last decoder in the stack as input. This output matrix is transformed into a logit vector of the same size as the vocabulary size. We then apply the softmax function on this logit vector to generate probabilities corresponding to each word. The word with the highest probability is predicted as the next word. The model is optimized for cross-entropy loss using Adam Optimizer.

To avoid overfitting, dropout layers are added after every sub-layer of the encoder/decoder network.

That’s all about the entire Transformer Model. With this, we have completed the in-depth walk-through of Transformer Model Architecture in the simplest language possible.

Conclusion

Now that you know all about Transformer models, it shouldn’t be difficult for you to build your knowledge on top of this and delve into more complex LLM model architectures such as BERT, GPT, etc.

You may refer to the below resources for the same:

I hope this 2-part article would make the Transformer Models a little less intimidating to understand. If you found it useful, please spread the good word.

Until next time!