The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture [Summary]

Explore 'The Transformer Blueprint' - an all-inclusive guide to understanding and applying the transformative power of the Transformer Neural Network Architecture.

Prof. Otto NomosOct 05, 2023 ∙ 17 min read

Original link: https://deeprevision.github.io/posts/001-transformer/
Author: Jean Nyandwi

Introduction

The Transformer model is a type of artificial intelligence model that has become a revolutionary tool in the field of computer science since it was introduced in 2017 in a paper titled "Attention is All You Need." Initially created for tasks like translating one language to another, this model has shown its versatility, proving useful not just in natural language processing (the technology that helps computers understand human language), but also in other areas as a general-purpose tool.

To break this down simply, imagine if you had a tool that could not only help you understand and translate languages but also perform other tasks like identifying objects in photos or predicting stock market trends. That's what the Transformer model is like in the world of artificial intelligence.

In our deep dive, we'll peel back the layers of the Transformer model, exploring how it pays attention to the right information, how it encodes and decodes data, and how it forms its structure. Beyond the basics, we'll also check out bigger, more powerful models that use the Transformer's capabilities. Plus, we'll look at how the Transformer model is used outside of language processing and discuss the current issues and potential future developments related to this powerful AI tool. For those who want to learn more, we'll also share a list of resources where you can find open-source versions of the model and more information.

Neural Networks Before Transformers

Before the invention of the Transformer model, several other types of artificial intelligence (AI) models were used to try to understand and work with sequence data - that's data where the order matters, like the words in a sentence or the notes in a song.

Multilayer Perceptrons (MLPs): These are a classic type of AI model, made up of multiple layers of nodes that each process the data a bit. The problem is, they don't consider the order of the data - it's like trying to understand a sentence by looking at all the words at once, without considering which word comes before or after another. Plus, MLPs can be quite complicated and require lots of parameters (instructions for the AI on how to handle the data), which isn't great.
Convolutional Neural Networks (CNNs or ConvNets): These models are mostly used for image processing, but have also been used with text and video. They bundle computations in local parts of the input data, which is good for processing efficiency, but they struggle with variable-length data (like a sentence that can be short or long) and can require many layers to handle long-term dependencies, or relationships between elements far apart in the sequence.
Recurrent Neural Networks (RNNs): These models were actually designed for sequence data. They're set up so that information can loop back within the model, which helps it remember earlier parts of the data. Variations of RNNs like Long Short Term Memories (LSTMs) and Gated Recurrent Units (GRUs) are even better at handling longer sequences. However, RNNs can be unstable with very long sequences, and they're not easy to scale up for bigger tasks because they can't be easily parallelized (run parts at the same time) on today's powerful hardware.

These limitations with MLPs, CNNs, and RNNs motivated the development of the Transformer model. Unlike these earlier models, Transformers can handle sequence data well, scale up efficiently, remain stable over long sequences, and consider the global context of the data. Now that we understand the issues with earlier models, we're ready to dive into the Transformer architecture!

Transformer Architecture

The Transformer is an AI model that processes sequential data, like text, audio, video, and even images (when they're broken down into a sequence of parts). What's unique about the Transformer is that it doesn't use recurrent or convolutional layers, which are common in other models. Instead, it relies on something called 'Attention', along with other basic layers including fully-connected layers, normalization layers, embedding layers, and positional encoding layers.

Let's visualize the Transformer like a factory assembly line for data. Here's how it works:

Encoder: This is the first station on the assembly line, and it handles the input data. The job of the encoder is to transform the input sequence into a condensed representation. In the original Transformer model, the encoding process is repeated six times. Each repetition (or 'block') consists of three main sub-layers: multi-head attention, layer normalization, and MLPs (multilayer perceptrons). The more times the encoding process is repeated, the more the model is able to capture the overall context of the input data, which usually leads to better results.
Decoder: This is the second station, where the output is generated. The decoder is similar to the encoder, but it has an extra multi-head attention layer that operates on the output of the encoder. Essentially, the decoder's job is to merge the encoder output with the target sequence and make predictions for the next element in the sequence. The decoder also repeats the same number of times as the encoder (typically six in the original Transformer).

An important point about the decoder is that it's designed to 'mask' future data in the sequence to prevent the model from 'cheating' by peeking ahead at data it shouldn't have access to. This helps to create a model that can generalize well beyond the training data.

So, in simple terms, the Transformer model is like a factory line that takes in sequential data, processes it piece by piece while paying attention to its overall context, and then makes educated predictions about what comes next.

What Really is Attention?

Attention is the star player in the Transformer AI model. It's a mechanism that allows the AI to focus more on important parts of the input data and less on the less relevant parts. Let's compare it to reading a book. While reading, your brain naturally focuses on the key details that move the story forward, and pays less attention to the less important descriptions. The Attention mechanism in the Transformer model does something similar for data.

This idea first popped up in machine translation, where Attention was used to identify where the most important information in a sentence was concentrated. For example, when translating a sentence from English to French, not only does Attention help translate the words correctly, it also helps arrange the translated words in an order that makes sense in French.

Another early use of Attention was in image captioning. In this task, the AI model had to generate a sentence that describes what's happening in a picture. The Attention mechanism helped the model focus on the important parts of the picture when generating the caption.

Now, let's think of the Attention mechanism in the Transformer model like the eye of the AI. It helps the model focus on the important parts of the data while ignoring the less important bits, just like how our eyes naturally focus on what's important when we're reading a book or looking at a picture. Next, we'll take a look at how this 'eye' works in detail with the help of three elements: queries, keys, and values.

Attention Function: Query, Key, Value

The Attention function in AI can be likened to how a search engine works when you're looking for something. Let's say you're searching for papers about "attention" on an academic database like ArXiv. Here's how the three elements of the Attention function - query, key, and value - come into play:

Query: This is your search term, like "attention" in our example. It's what you're looking for.
Key: These are like the tags or categories that each paper on ArXiv has. When you search for something, the system will compare your query to these keys to find matches.
Value: These are the actual papers in the database. Once the system finds keys that match your query, it'll return the papers (values) associated with those keys.

Now, let's translate this to the AI context. The Attention function uses these three elements to determine which parts of the input data (like a sentence or an image) the AI should focus on. Here's a step-by-step breakdown of what happens:

It compares the query and key to measure how similar they are (like comparing your search term to the paper categories on ArXiv).
This comparison is then scaled down to avoid situations where very large values would result in very small gradients, which could hamper learning.
The scaled comparison is then normalized into a probability distribution using a softmax function, turning them into weights (imagine this as determining how relevant each paper is to your search term).
These weights are then multiplied with the values, resulting in a weighted value, which tells the AI how much attention to give to each part of the input data.

This process is what makes the Transformer model so efficient. It allows the model to process all parts of the input data at once (parallellization), rather than one at a time, making it faster and more efficient, especially for larger models and massive datasets.

The Attention function we've just described is known as scaled-dot product attention, which is a type of attention. Other types include additive attention, content-based attention, location-based attention, and general attention. These can be applied to either the whole input data (global attention) or specific parts of it (local attention).

Multi-Head Attention

Multi-Head Attention is a method used in Transformer models, a type of artificial intelligence model used for tasks like language translation. Essentially, it's a way of letting the model pay attention to different parts of the input at the same time, in parallel, and combine these 'attention views' to understand the input better.

Here's a more simple breakdown:

Imagine you're in a crowded room, and you're trying to understand multiple conversations at once. Rather than trying to focus on all the noise together, you could 'split' your hearing into several 'heads'. Each 'head' listens to a different conversation independently, gets the gist of it, and then all these separate understandings are combined for you to get a fuller picture of what's happening in the room. That's what Multi-Head Attention does.

The key point here is that each 'head' looks at the same conversation (or data in AI terms), but from different 'perspectives', highlighting different features or aspects. After they've done their individual work, all the different perspectives are merged together. This gives the AI model a more comprehensive understanding of the data.

An important thing about Multi-Head Attention is that it doesn't make the model more computationally expensive. This is because the dimension (the complexity or detail level) of each head is divided by the total number of heads. So, the overall amount of data the model has to handle stays the same.

To draw an analogy with another AI technique, Multi-Head Attention can be compared to a concept called 'depth-wise separable convolution' used in Convolutional Neural Networks (ConvNets), which are often used for image processing tasks. This method also splits input data into multiple channels, processes each channel independently, and then combines the outputs. The goal in both cases is to get a fuller and more nuanced understanding of the data.

Other Transformer Components

Let's break down the components of the Transformer architecture in AI in simpler terms:

Multilayer Perceptrons (MLPs): MLPs are like tiny little brains within the Transformer that help process information. They're made of two layers, with a special function called ReLU sandwiched in between. This is applied to each piece of input data separately but in the same way.
Embeddings and Positional Encoding Layers: These are like the Transformer's translation system. They convert input data (like words in a sentence) into vectors, which are like lists of numbers that computers can understand. This is done twice, once for the source (input) and once for the target (output). Embeddings help group similar words together. The positional encodings preserve the order of words in a sentence because unlike us, the Transformer doesn't inherently know which word comes first, second, and so on.
Residual Connections, Layer Normalization, and Dropout: These are techniques that make the Transformer smarter and more efficient. Residual connections help signals move better within the model, making it learn faster. Layer normalization balances the activity levels across different parts of the Transformer, speeding up the learning process. The position of layer normalization within the model can vary, and it's an active area of research. Dropout is like a random filter applied to prevent overfitting, which happens when a model is too closely fitted to the training data and performs poorly with new data.
Linear and Softmax Layers: These are the last step of the Transformer. The linear layer maps the decoded vectors (the Transformer's interpretation of the input data) to the size of the vocabulary (the number of unique words the model knows). Then, the softmax layer turns these into probabilities of what the next word (or token) in the sequence could be.

So, in essence, a Transformer takes in words, translates them into vectors using embeddings, figures out their order using positional encodings, processes this information using MLPs, adjusts the signals using residual connections and layer normalization, prevents overfitting using dropout, and finally outputs probabilities for the next word in the sequence using the linear and softmax layers.

Visualizing Attention

"Visualizing Attention" is a way to peek into what a neural network, like a Transformer, pays attention to when it's processing information.

Think of it like reading a book. Your brain naturally pays more attention to certain words or phrases that it thinks are more important. The concept of attention in AI is similar. When a Transformer reads data (like a sentence), it pays more attention to certain parts over others.

Visualizing attention is like drawing a heat map over a sentence, showing where the Transformer is focusing. This can help us understand how the Transformer thinks and what parts of the data it considers most important.

One of the benefits of attention is its ability to handle long-term dependencies. That means if a sentence refers to something mentioned much earlier, a Transformer is likely to remember it, unlike traditional models that tend to forget over time.

Attention also allows Transformers to process information in parallel, all at once, instead of one-by-one, making them faster and more efficient.

An additional benefit is that attention gives us a way to visualize and understand what the model is focusing on, unlike many other AI techniques. This "attention map" can highlight which parts of the data were most influential in the Transformer's output.

However, attention also comes with some challenges. For instance, it can require a lot of memory and computation power, especially with longer sequences of data. Also, while attention can provide some interpretability, it doesn't give a full picture of what's happening inside the complex model.

Large Language Transformer Models

Evolution of LLMs

Large Language Models (LLMs) have fundamentally transformed the way we interact with machines, especially in the realm of natural language. These models, like ChatGPT and Bard, can perform specific tasks that usually required data specifically gathered for those tasks.

LLMs are essentially transformer models on steroids. They have grown in size, starting from 65 million parameters in the base model to billions of parameters in recent versions. A parameter here is like a knowledge unit that the model learns.

Training an LLM starts with feeding it massive amounts of text data, such as books, articles, and web content. The aim is to help the model understand a broad range of topics. This phase, called the pretraining phase, happens in an unsupervised manner. That means the model isn't given any labels or categories; it's just asked to figure things out on its own.

To train these models, they are often given a goal like predicting the next word in a sentence (known as next-token prediction) or filling in a missing word (masked language modelling). This helps the model to understand and generate text better.

After pretraining, these models can be used directly or fine-tuned for specific tasks. Fine-tuning involves providing some examples of a task and letting the model figure out how to perform that task. This is known as few-shot learning. If no examples are provided and the model is asked to perform the task right away, it's called zero-shot learning.

However, for more complex tasks that are harder to explain through prompts, fine-tuning is often necessary. This is where the model is trained on specific data related to the task. This helps the model to perform better in specialized fields like mathematics, medicine, and scientific areas.

The field of LLMs is evolving, and each model can have unique design choices, ranging from encoder models, decoder models, to combined encoder-decoder architectures.

Vertical LLMs

Large Language Models (LLMs) are typically known as foundational models because they are trained on vast amounts of data and can be fine-tuned for many different tasks. However, while these foundational LLMs are excellent at general tasks, they might struggle with more complex tasks that require specific expertise.

Imagine you have a friend who knows a little bit about everything. This friend can help you with simple questions about a wide range of topics, but if you need in-depth information about something very specific, like law, medicine, or finance, they may not be the best resource. This is where vertical LLMs come in.

Vertical LLMs are like specialists. They are designed to excel in specific domains, essentially acting as an expert in that field. They're fine-tuned using data related to a specific subject, so they're much better at answering complex questions about that topic.

Just like you'd go to a doctor for medical advice or a lawyer for legal help, you'd use a vertical LLM for specific tasks. For example, MedPaLM and ClinicalGPT are vertical LLMs focused on medicine, FinGPT is specialized in finance, and Galactica and Minerva focus on physics. So while foundational models are like generalists, vertical LLMs are the specialists in the world of AI.

Transformers Beyond NLP: Vision and other Modalities

The Transformer model, first designed for tasks related to natural language processing, has significantly expanded its reach to other fields, including visual recognition and more.

Consider visual recognition. Before Transformers, Convolutional Neural Networks (ConvNets) were the best tools we had for visual tasks. But they have some limitations, mainly due to their spatial biases. In 2018, a new application of Transformers, called the Image Transformer, was introduced. It treated image generation like a text generation problem, sequentially creating pixels until it formed a complete image.

However, because images often have high resolution, it wasn't practical to apply this self-attention mechanism to larger images. A groundbreaking solution came with the Vision Transformer (ViT), which processed images as a sequence of smaller patches, significantly reducing the computational load.

ViT's success was also aided by unsupervised pretraining on vast amounts of unlabelled data, just like in NLP. Since then, ViT has been used as a base for many other projects, combining with ConvNets to achieve great results in various computer vision tasks.

But Transformers haven't stopped at language and vision. They're also used in reinforcement learning (used in games and robotics), speech recognition, and even multi-modal learning, which includes pretty much all forms of data.

In simple terms, imagine Transformers as a versatile toolbox. Initially, we only used this toolbox for language-related tasks. But soon, we found that it could be very useful for other tasks too, like recognizing images, learning from game play, understanding speech, and much more. However, despite their versatility, Transformers still have challenges and limitations that need to be addressed.

Transformer: Current Challenges and Future Directions

Transformers have shown impressive performance across various fields such as language, vision, robotics, and reinforcement learning. However, they come with a high computational cost due to a feature called self-attention, which increases the time and memory required. This makes it challenging to use Transformers on low-budget devices like smartphones and microcontrollers.

Imagine Transformers as an engine. They're extremely powerful, but they also consume a lot of fuel (computational power and memory), making them expensive to run. Some models, often called "xformers," claim to reduce this cost, but they're usually designed for specific tasks or devices and often fail to be as efficient and universal as the original Transformers.

In response to this, a new approach called FlashAttention has been developed. It computes attention values (a key aspect of Transformer's workings) much faster than standard methods. Think of it as an engine modification that significantly improves the fuel efficiency without losing any of the power.

FlashAttention does this by using two techniques: tiling and recomputation. Tiling involves splitting larger problems (large matrices) into smaller blocks, and solving them separately, which saves memory. Recomputation means not storing certain computations, but doing them again when needed, again saving memory. The idea here is to maximize computational operations (FLOPs) while reducing memory usage because GPUs, the 'engine' running these computations, are usually limited by memory but not by computation power.

FlashAttention is already being used in several software libraries, and there's an even faster version called FlashAttention2. It improves on the original by splitting different parts of the computation and parallelizing over different dimensions, further speeding up the computation process.

So in a nutshell, while Transformers are powerful tools in AI, they face challenges in terms of computational efficiency. But with techniques like FlashAttention, we're finding ways to make them faster and more efficient, just like improving the fuel efficiency of a powerful engine.

Transformers with Effective Long Contexts

Transformers, the powerful engines of AI, face a challenge when it comes to dealing with long context lengths. This means they have difficulty when they need to process and remember a lot of information at once. It's like trying to remember every detail of a long movie after watching it just once. This is problematic for tasks like carrying on long conversations, summarizing extensive documents, or making long-term plans, all of which need a lot of context to do well.

Recently, researchers have been trying to extend the 'memory span' of Transformers so they can handle more information at once. But an interesting discovery was made by some scientists. Even if a model can technically handle more context, its performance can actually drop as the context gets longer. To put it in simpler terms, just because a Transformer can remember a 3-hour movie doesn't mean it'll understand it any better than a 30-minute episode.

Furthermore, the researchers found that these models perform better if the most important information is at the start or end of the context, like a movie where all the important scenes are at the beginning or end. But if important details are in the middle, performance drops. So, these models are like 'U-shaped' reasoners.

All these findings are fascinating and could help in designing better AI models in the future. But it's essential to remember that this is still a new and active area of research. The hope is to create Transformers that can handle long sequences of information and understand it well, no matter where the crucial information is placed. This would be like having a perfect memory and comprehension for long movies, which is the ultimate goal for these AI models.

Multimodal Transformer

Imagine you're trying to build a tool that can understand and process not only text, but also pictures, audio, and other types of data - this is the idea behind multimodal transformers in AI. Just like a multitool, the goal is to make one model that's versatile enough to handle all these different 'modes' or types of data equally well.

Transformers, which are a type of AI model, have been used in a lot of areas like language, images, robotics, and speech. But creating a transformer that works equally well with all types of data without needing special adjustments is still a challenge. It's like trying to make a single multitool that works perfectly for every task, from screwing in a lightbulb to fixing a car engine.

Why is this so tough? Well, think about the difference between reading a book, looking at a picture, or listening to a song. Each type of data, or 'modality', has its own unique features that make it different from the others. Transformers are good at handling data that can be broken down into a sequence of pieces, like words in a sentence. But how we break down a picture or a song into pieces can vary greatly, so it's a challenge to create one model that can handle all of these effectively.

Creating such a model would be a huge step forward in AI. It would mean creating models that can smoothly switch between different types of data, which could open up new areas of research.

Currently, most of the best models in this area use separate processes to handle each type of data, and many are designed for visual language learning. Some examples of these models include Flamingo, Gato, ImageBind, OFA, Unified-IO, and Meta-Transformer. However, this piece does not delve deeply into the specifics of these models.

Open-source Implementations of Transformer

Imagine you want to build something complex, like a car. Instead of starting from scratch, you'd prefer to have a blueprint or even parts of the car ready-made. This is what open-source implementations of transformers in AI do - they give you ready-made parts to work with.

The original 'blueprint' for transformers was made in a tool called Tensor2Tensor, but it's not used anymore. Its successor is called Trax and is based on a tool called JAX.

There are several popular open-source implementations of transformers out there. Think of these like different brands of car parts that you can choose from. One of the most well-known is called the HuggingFace Transformer library. This library is like a toolkit that simplifies the process of using transformers for language and image tasks. It's user-friendly, neat, and has a large community of developers adding to it.

Two other popular options are called minGPT and nanoGPT, created by a researcher named Andrej Karpathy. Additionally, there's a tool called x-transformers, which provides concise versions of various transformer models that are usually based on the latest research.

The good news is that you likely won't need to build a transformer from scratch. Modern tools for creating AI models, like PyTorch, Keras, and JAX, offer ready-made 'parts' or layers that you can easily use, just like picking out parts to build a car.

Content:

Latent Space Podcast 8/16/23 [Summary] - The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Prof. Otto NomosOct 05, 2023 ∙ 3 min read

Explore the math behind training LLMs with Quentin Anthony from Eleuther AI. Dive into the Transformers Math 101 article & master distributed training techniques for peak GPU performance.

Latent Space Podcast 8/10/23 [Summary]: LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Prof. Otto NomosOct 05, 2023 ∙ 6 min read

Explore the magic of MLC with Tianqi Chen: deploying 70B models on browsers & iPhones. Dive into XGBoost, TVM's creation, & the future of universal AI deployments.

Latent Space Podcast 8/4/23 [Summary] Latent Space x AI Breakdown crossover pod!

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Join AI Breakdown & Latent Space for the summer AI tech roundup: Dive into GPT4.5, Llama 2, AI tools, the rising AI engineer, and more!

Latent Space Podcast 7/26/23 [Summary] FlashAttention 2: making Transformers 800% faster - Tri Dao of Together AI

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Discover how FlashAttention revolutionized AI speed with Tri Dao, as he unveils the power of FlashAttention 2, dives into Stanford's Hazy Lab & future AI insights.

Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.