#1 LLM: Decoding LLM Transformer Architecture — Part 1

LAKSHMI VENKATESH
13 min readMar 3, 2024

--

Generative AI Theory & Practice series

Attention is all you need — Google Paper

This article is has 6 sections

  1. Section A: Why Transformer model is needed
  2. Section B: Encoder-Decoder Transformer Architecture
  3. Section C: Example of Encoder-Decoder Transformer Architecture
  4. Section D: What are the popular and notable Transformers
  5. Section E: The Attention Mechanisms — What is Attention and why do you need only attention?
  6. Section F: Quick Look at the Transfer Learning

Section A: Why Transformer is needed:

RNN (Recurrent Neural Network), LSTM (Long Short-term memory) and Gated Recurrent are the most well-known Neural networks that were the key catalysts for two well known Transformers GPT (Generative Pre-Trained Transformer) and BERT (Bidirectional Encoder Representation from Transformers).

  1. RNN vs Transformer:

RNNs (Recurrent Neural Network) process information like passing small packets of information over telephone, one piece / one packet of information gets delivered at a time, which can lead to mistakes or forgetting parts of the information. Transformers, on the other hand, put everything out in the open at once, like writing on the white-board making it easier to see and understand the whole message clearly and quickly.

Unrolled RNN — created by author

2. LSTM vs Transformer:

Both the architectures are used for processing sequence of data. LSTM processes data sequentially, one part at a time. It has a memory cell to carry information across the sequence to overcome short-term memory issues. Good for tasks where the sequence order is crucial, like time series prediction.

A Transformer processes the entire sequence at once, not sequentially. Uses attention mechanisms to weigh the importance of different parts of the input data. Excels in tasks that benefit from understanding the whole sequence context, like language translation.

LSTMs or RNNs are like reading a book page by page and LSTM remembering the entire plot and RNN forgetting few characters, while Transformers are like scanning the entire book and understanding everything all at once.

Created by author
  • Cell State (Memory): Think of this as the long-term memory of the LSTM cell. It carries information throughout the processing of the sequence, and only gets updated by the gates below.
  • Hidden State: This is like short-term memory. It represents the current state of the cell and is used for the cell’s output.
  • Input: This is the new information that the LSTM cell will process.
  • Forget Gate: Decides what information to throw away from the cell state. It looks at the current input and the previous hidden state to make this decision.
  • Input Gate: Decides what new information to store in the cell state. It also uses the current input and the previous hidden state to make this decision.
  • Output Gate: Decides what the next hidden state should be. The next hidden state contains information that’s based on the current input and the memory of the cell.
  • tanh: This is a mathematical function that helps to regulate the values flowing through the network, keeping them between -1 and 1.

Choice between using an LSTM or RNN or a Transformer model depends on the specific tasks, dataset characteristics, and computational resources. Both have shown to be effective in various applications, but the Transformer’s ability to manage large-scale data and capture complex relationships often gives it an edge in many modern NLP tasks.

3. NLP and Transformers:

Unlike the Traditional RNN or LSTM, in the Attention mechanism, the encoder passes lot more data to the decoder. Instead of passing the final hidden state to the Decoder, the Encoder passess all the hidden states to the Decoder. This gives the decoder more context beyond just the final hidden state.

Attention Network

The above is a simplified version of the Attention network. Below is the actual architecture of the Transformer and the Encoder-Decoder Framework of the Transformer Architecture. Let’s delve a bit more deeper to understand this architecture.

Section B: Encoder-Decoder Framework of the Transformer Architecture:

Decoding the Transformer Architecture.

Attention is all you need

Encoder:

The encoder processes input data, like a sentence, by simultaneously analyzing it to create a context-rich numerical representation. It’s structured in layers, using attention mechanisms to focus on understanding, not generating text, and outputs vectors that capture the essence of the input.

  • The encoder’s job is to process the input data (like a sentence in a language for translation, or any sequence of data for other tasks).
  • It reads and understands the entire input sequence all at once (thanks to the attention mechanism) and creates a representation of it. This is like creating a detailed map or summary of the input.
  • It does not need to worry about generating any new text; it focuses solely on analysis.
  • The encoder is typically composed of a stack of identical layers that include multi-head attention and feed-forward networks.
  • The output of the encoder is a set of vectors that encapsulate the information of the input sequence. These vectors carry context for each word, considering the entire sequence.

Decoder:

The decoder builds the final output step by step, using both the encoder’s clues and its own growing chain of words to figure out what comes next. It’s like completing a puzzle with hints from the encoder, making sure it doesn’t peek at the unseen pieces.

  • The decoder’s role is to generate the output data sequence step by step, one piece at a time. For instance, in language translation, it would generate the translated sentence word by word.
  • It takes in the encoder’s output and uses it to make predictions for the next element in the sequence. It starts this process with an initial input (like the start of a sentence) and builds the output sequentially.
  • The decoder also has layers similar to the encoder, but with an additional layer of attention that focuses on the encoder’s outputs. This is often called “encoder-decoder attention.”
  • The decoder is aware of what it has already generated, thanks to the masked multi-head attention, which prevents it from seeing the future output tokens when making predictions. This is necessary because, during training, the decoder has access to the full output sequence and we need to prevent it from “cheating.”Left and Right side of the Transformer Architecture:

Left Side — Encoder:

  1. Inputs: This is where the data (like a sentence in English) enters the model. These are the words you give to the model, like a sentence you want to translate.
  2. Input Embedding: The words of the sentence are converted into a format the model can understand — think of it as translating words into a secret code. The model turns these words into numbers because it’s easier for the model to work with numbers. Each word gets a unique number code.
  3. Positional Encoding: This adds information about the order of the words because the secret code doesn’t naturally keep track of which word came first, second, and so on. The model needs to know the order of words because “I read a book” means something different from “Read I a book.” This step is like tagging each word with a little note about where it stands in line.
  4. Multi-Head Attention: Here, the model pays attention to different parts of the sentence all at once. It can look at the beginning and the end of the sentence simultaneously to better understand the context. Here, the model looks at the sentence and decides which words are important to each other. For example, in “The cat sat on the wall,” it figures out that “cat” is related to “sat.”
  5. Add & Norm: After paying attention to the sentence, the model makes some adjustments to avoid confusion and to keep the information clear as it moves deeper into the system. After looking at the words, the model smooths out its understanding to make sure it’s not focusing too much on any one thing.
  6. Feed Forward: This is like a mini-brain that processes each word’s secret code individually, refining it further. Now the model takes its understanding and refines it, like polishing a gemstone to make it clearer.
  7. Nx: This means that the steps above (starting from the multi-head attention to feed forward) are repeated several times to deepen the understanding of the sentence. This symbol means that the model repeats its next set of steps several times to really understand the sentence.

Right Side — Decoder:

  1. Outputs (Shifted Right): Here, the target data (like a sentence in French that corresponds to the English sentence) is fed into the model, also in the secret code form. It’s slightly shifted to help the model learn to predict the next word in the sequence. This is where the model starts to guess what the next word should be. It’s like the model is trying to predict the future, but it can only use the words it already knows.
  2. Output Embedding + Positional Encoding: Similar to the input side, the decoder processes the output words and their order. Just like with the input, the model turns these output words into numbers and keeps track of their order.
  3. Masked Multi-Head Attention: The model pays attention to the output sentence but does so in a way that it can only look at the words it has already predicted, not the ones it hasn’t gotten to yet — like reading a sentence with some words covered, guessing the covered words one at a time. When making its guesses, the model can’t cheat by looking ahead. This step ensures it only uses what it’s already guessed and what it learned from the left side.
  4. Add & Norm: Just like on the input side, this keeps things clear and on track in the decoder. Again, the model smooths out its thoughts before moving on.
  5. Multi-Head Attention: Now, the decoder also looks at what the encoder learned from the input sentence. This step is like comparing notes between the input and output to make better predictions. Now the model combines what it learned from the left side (the input sentence) with what it’s guessing to make sure it all makes sense together.
  6. Add & Norm and Feed Forward: These steps further refine the understanding and prediction of the output sentence. These are more refining steps, like double-checking your work to make sure it’s right.
  7. Linear Layer: This transforms the refined secret codes into a more standardized format before making the final guess. The model now has to turn its polished guesses into actual words. These steps are like picking the best word out of a hat where each word’s chance of being picked is based on how good of a fit the model thinks it is.
  8. Softmax: The model uses this step to make its best guess on what each word in the output sentence should be, turning its predictions into probabilities.
  9. Output Probabilities: Finally, the model gives the results in the form of probabilities for each possible word, and the word with the highest probability is chosen as the translation. Finally, the model presents its best guess as to what the next word in the sequence should be.
Created by author

Section C: Example of Encoder-Decoder Framework

Level 0 Block Diagram
Level 1: High Level flow. Improve the text translation with Attention Network
  • Left Side: This is the encoder part of a sequence-to-sequence model. RNN cells at different time steps processing the input sentence in French “Je suis scientifique”. The horizontal line labelled “Attention” combines the outputs of these RNN cells with different weights (the pink heatmap below shows the attention weights, with darker colors indicating higher weights) to create a context-sensitive representation for each word in the sentence.
  • Right Side: This is the decoder part of the model, which is generating the translated sentence in English. The blue rectangles are the decoder RNN cells, and they use the combined, context-sensitive representation from the encoder (via attention) to generate the translation “I am a Scientist” word by word. The decoder starts with an initial “GO” token and produces the translation until the “end of sentence” token is reached.
  • Attention Visualization (Bottom of Left Side): The heatmap shows how much attention the decoder gives to each word in the input sentence when predicting the next word in the translation. For example, when translating “scientifique”, the decoder pays most attention to “Scientist”, as indicated by the darker color and count in the heatmap.

Section D: What are the popular / Notable Transformers?

  1. GPT-4: An evolution of GPT-3, GPT-4 is capable of taking both images and text as input, though specific technical details remain under wraps.
  2. GPT-3: Known for its staggering number of parameters (175 billion), GPT-3 is a massive model adept at generating human-like text and performing a variety of NLP tasks without task-specific training.
  3. BERT: This model generates contextually rich word embeddings and is widely employed across various NLP applications such as sentiment analysis and text classification.
  4. RoBERTa: An optimized version of BERT, RoBERTa has been trained on an even larger corpus of text and fine-tuned with more sophisticated techniques, achieving remarkable success across multiple benchmarks.
  5. T5: The Text-to-Text Transfer Transformer reframes all NLP tasks into a text-to-text format, allowing it to excel at tasks ranging from question-answering to document summarization.
  6. ALBERT: A lighter and more efficient version of BERT, ALBERT implements parameter-reduction techniques to speed up training without sacrificing performance.
  7. XLNet: This model uses an autoregressive method to learn bidirectional context, improving upon BERT’s limitations and excelling in tasks like text classification and question-answering.
  8. ULMFiT: One of the earlier models to use transfer learning effectively, ULMFiT can be fine-tuned to a variety of tasks, with a particular strength in text classification.
  9. DistilBERT: A distilled version of BERT that retains most of its predecessor’s strengths while being smaller and faster, making it suitable for environments with limited computational resources.
  10. ELECTRA: A model trained with a novel approach of replacing some input tokens with synthetically generated ones, leading to efficiency and strong performance across various NLP tasks.
  11. DeBERTa: Enhanced with disentangled attention, DeBERTa improves upon the BERT architecture and achieves superior performance on several benchmarks.

These models represent the cutting-edge of Transformer technology, with each bringing unique strengths to various aspects of NLP.

Section E: The Attention Mechanisms — What is attention and why you need only attention?

Encoder-Decoder with Attention Mechanism for RNN’s

Note: The above RNN Cell replaced with Feed Forward Network in the original Transformer Architecture.

Attention in AI and machine learning is a mechanism that enables models, especially those involved in processing language or sequences, to focus on specific parts of the input data when performing a task. This is similar to the way humans pay more attention to certain aspects of what they see or hear to better understand or respond to it. The attention mechanism improves the model’s ability to remember and relate different parts of the input, enhancing its performance in tasks like translation, summarization, and question-answering.

  1. Scaled Dot-Product Attention: A computation where queries and keys are dot-multiplied, scaled down, and then a softmax function is applied to determine the weights on the values.

Example: It’s like when you’re listening to someone talk but only pay close attention to certain keywords to grasp the story. The computer does something similar by focusing more on certain words.

Softmax: Softmax is a function that turns numbers into probabilities, which add up to 1. This helps in deciding which category something belongs to, making it useful for tasks like recognizing objects in pictures or understanding language.

2. Multi-Head Attention: This involves running multiple attention mechanisms (heads) in parallel. Each head processes the input differently, allowing the model to attend to different parts of the input simultaneously and capture a richer diversity of information.

Example: Imagine trying to understand a story by listening to it multiple times, each time focusing on different details. That’s what the computer does here — it looks at the information in several different ways at once to get a fuller understanding.

3. Self-Attention (also known as Intra-Attention): A mechanism where all the queries, keys, and values come from the same sequence. It enables an element to attend to all positions within the same sequence, thus capturing the internal structure of the sequence.

Example: This is when the computer looks at a sentence and considers how each word relates to all the others within the same sentence, helping it understand the sentence structure and meaning better.

Section F: Quick look at the Transfer Learning

Transfer learning leverages the knowledge from another domain and applies to the new model where as a Traditional supervised learning runs one for each domain.

Created by author

Supervised Learning (Left Side):

  • There are two domains, Domain A and Domain B.
  • For each domain, there is a dedicated model, Model A and Model B, which is trained and evaluated on the same domain.
  • The outcome is two sets of predictions, Predictions A and Predictions B, corresponding to each domain and model.

Transfer Learning (Right Side):

  • There is knowledge transfer from Domain A to Domain B.
  • The model trained by Domain A can be transferred to Domain B. The knowledge (learned weights and patterns) from Body A is then transferred to a new model for Domain B.
  • This new model for Domain B uses Model of Domain A but replaces the final layer with a new layer suitable for Domain B, adapting it to the new task.
  • The result is a set of predictions for Domain B, benefiting from the knowledge transferred from Domain A.

Transfer learning leverages the knowledge from a related task that has already been learned to improve performance or reduce training time on a new, but related task. This is particularly useful when the new task has limited data available for training.

Part 2: https://luxananda.medium.com/decoding-llm-transformer-architecture-part-2-143edc8282fe

Linkedin

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

LAKSHMI VENKATESH
LAKSHMI VENKATESH

Written by LAKSHMI VENKATESH

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.

Responses (2)

Write a response