Introduction
As an AI enthusiast diving into the fascinating realm of GenerativeAI, you’ve likely wondered at some point how modern large language models (LLMs) like GPT understand intricate meanings of the prompts. We know that the answer lies in the self-attention block of transformers, that forms the base for all modern large language models (LLMs) like GPT, BERT, and T5.
Positional Encodings: Giving Order to the Words
Positional Encodings are a critical component of transformers that aim to solve the challenge of unscrambling words in a sentence. This is part two of the multi-post series on Transformers. In the previous one, we had a look at the self-attention mechanism. In this blog, we’ll break down how Positional Encoding works in self-attention block.
Method 1: Why Not Just Use Integer Indices?
One could simply capture word positions by appending their position in the phrase to the corresponding embedding. For example, in the sentence: “Tom chases Jerry”
If word embeddings of the words “Tom”, “chases” and “Jerry” are Et, Ec and Ej respectively, then
“Tom” → concat( Et + 1 )
“chases” → concat( Ec + 2 )
“Jerry” → concat( Ej + 3 )
However, this method has its own limitations:
- Unbounded Integers: Large integer values would dominate the embedding space, making it difficult for the model to balance learning between token meaning and positional information.
- Discrete Nature: Integer positions are discrete, leading to poor gradient-based learning for training models.
- Relative Positioning: Integer positions don’t capture the relative distances between tokens, which are crucial for understanding context.
Method 2: Sinusoidal Positional Encoding
To overcome these limitations, sinusoidal waves could be used, as they are bounded, continuous and could be used for locating relative positioning. In this method, the integral indices of the words are passed through a sine function to find the new positional encoding of the words.
For example: In the sentence — “Tom chases Jerry”
“Tom” → concat( Et + sin(1) )
“chases” → concat( Ec + sin(2) )
“Jerry” → concat( Ej + sin(3) )
However, this method isn’t perfectly clean either, there are two problems in this method:
- Problem 1: Periodicity of the function
- Problem 2: Adding vs. Concatenating Positional Encodings
How Multi-Head Attention Works: Capturing Multiple Contexts
To solve this issue, Transformers use multiple self-attention blocks for each word — each with its own Key, Query, and Value vectors — hence capturing different contexts in the same set of words, allowing it to handle complex linguistic phenomena effectively.
Conclusion
Understanding Multi-Head Attention and Positional Encoding is what forms a strong foundation to understanding Transformers in their full power. Whether it’s for machine translation, chatbot development, or content generation, understanding these mechanisms is key to harnessing the true potential of modern AI. Transformers have redefined the entire field of NLP, and as we continue to explore their capabilities, we will see even greater advancements in how machines understand and generate human language.
FAQs
Q: What is Positional Encoding?
A: Positional Encoding is a method used in transformers to capture the positional information of words in a sentence.
Q: Why do we need Positional Encoding?
A: Positional Encoding is necessary to solve the challenge of unscrambling words in a sentence.
Q: How does Multi-Head Attention work?
A: Multi-Head Attention uses multiple self-attention blocks for each word, each with its own Key, Query, and Value vectors, to capture different contexts in the same set of words.
Q: What are the limitations of Positional Encoding?
A: Positional Encoding has limitations such as unbounded integers, discrete nature, and relative positioning.
Q: How do we overcome the limitations of Positional Encoding?
A: We overcome the limitations of Positional Encoding by using sinusoidal waves and adding positional encodings to the word embeddings.