Both RNNs and LSTMs share one stubborn limitation: they read a sequence one step at a time. That makes them slow to train, because you cannot process step fifty until you have processed step forty-nine, and it still leaves long-range connections fragile. In 2017 a paper titled "Attention Is All You Need" proposed an architecture that abandoned recurrence completely. It was called the Transformer, and it is the foundation of essentially every large language model in use today.
Instead of passing information along a chain, attention lets every position in a sequence look directly at every other position and decide what is relevant. When the model processes the word "it" in a sentence, attention lets it look back at all the earlier words and figure out which noun "it" refers to, in a single step, no matter how far back that noun appeared. There is no chain to travel along and no memory to slowly decay.
The mechanism is often described as a soft lookup. Each word produces three vectors: a query, which represents what it is looking for; a key, which represents what it offers; and a value, which is the actual content it carries. To work out how much one word should pay attention to another, the model compares that word's query against every other word's key. Strong matches get high weights, weak matches get low ones, and the result is used to blend the values together. Every word ends up with a new representation that mixes in information from the words most relevant to it. This is called self-attention.
Doing this once captures one kind of relationship. Transformers run several attention operations in parallel, called heads, each free to focus on a different pattern. One head might track which adjective modifies which noun while another tracks subject and verb. Their results are combined, giving the model a rich, multi-faceted view of the sequence. This is multi-head attention.
There is one thing attention loses by treating a sequence as a set of positions all visible at once: it no longer knows the order of the words. Since order obviously matters in language, Transformers add a positional encoding to each input, a signal that tells the model where each token sits in the sequence. With that added back in, the model gets both the parallelism of attention and an awareness of order.
A Transformer is a stack of identical blocks, and each block is simpler than it sounds. A block runs multi-head self-attention, then passes the result through a small feed-forward network that uses the GELU activation we met earlier. Around each of these two parts sits a residual connection, which adds the input back to the output so information and gradients can skip ahead freely, along with layer normalization to keep the numbers stable. Stack dozens of these blocks, train on enormous amounts of text, and you get a model that handles language remarkably well.
Transformers come in a few shapes. Encoder-style models such as BERT read an entire input at once and are well suited to understanding tasks like classification and search. Decoder-style models such as the GPT family generate text one token at a time and are what sit behind modern chat assistants. Both are built from the same attention blocks; they differ mainly in how they are arranged and trained.
The representations a Transformer learns are vectors that capture the meaning of text, and those vectors are exactly what people mean by embeddings. When you build a system that searches your own documents by meaning rather than keywords, you are using a Transformer to turn text into embeddings and comparing them. That is the heart of retrieval-augmented generation, covered in depth in the RAG field manual, and the practical questions of choosing and evaluating an embedding model are dealt with directly in the chapter on picking and evaluating embeddings. In other words, the architecture in this chapter is the engine underneath the applied AI work in the rest of the catalogue.
You will almost never build a large Transformer from scratch. The data and compute required are enormous. Instead you load one that has already been trained and adapt it to your task, which is the subject of the next chapter. A typical workflow pulls a pretrained model from a library and runs your text through it with only a few lines of code, getting either embeddings or generated text back. The skill that matters is not implementing attention by hand but knowing how these models behave and how to apply them well.
Training a model like this from zero is out of reach for most projects, but you rarely need to. The next chapter covers transfer learning, the practice of taking a model someone else trained on a massive dataset and adapting it to your own problem with a fraction of the data and effort.
Sign in to join the discussion and post comments.
Sign in