On this tutorial

Foundations

How Networks Learn

Core Architectures

Working with Models in Practice

Long Short-Term Memory (LSTM) Networks

The previous chapter ended on a problem: plain recurrent networks forget things quickly, because gradients shrink as they travel back across many time steps. For a long time this kept RNNs from being useful on anything but short sequences. The LSTM, short for Long Short-Term Memory, is a redesigned recurrent cell that fixes this, and it powered most of the serious sequence modelling of the 2010s.

The Core Fix: A Separate Memory Channel

A plain RNN has one piece of state, the hidden state, that it both computes with and remembers through. The LSTM adds a second channel called the cell state, which you can picture as a conveyor belt running straight through the whole sequence. Information can ride along this belt with very little interference, which means something learned early can still be available many steps later. The hidden state still exists and does the moment-to-moment work, but the cell state is what gives the LSTM its long memory.

What controls this belt is a set of small neural gates that decide, at every step, what to keep, what to throw away, and what to read out.

The Three Gates

Each gate is a tiny layer that outputs values between 0 and 1, using the sigmoid function, where 0 means block completely and 1 means let through completely. There are three of them:

The forget gate looks at the current input and the previous hidden state and decides how much of the existing cell state to erase. When the topic of a sentence changes, this is what lets the network drop information that no longer matters.
The input gate decides what new information from the current step should be written into the cell state. A tanh layer proposes candidate values and the input gate decides how much of each to actually store.
The output gate decides what part of the updated cell state to expose as the new hidden state, which is what gets passed on and used for predictions.

Put together, at each step the LSTM forgets some of its old memory, writes in some new memory, and reads out a filtered view of it. Because the cell state is updated mostly by addition rather than repeated multiplication, gradients flow back through it without collapsing to zero, which is precisely why the LSTM remembers far longer than a plain RNN.

The GRU, a Lighter Alternative

There is a popular simplified version called the GRU, or Gated Recurrent Unit, which merges some of the gates and drops the separate cell state. It has fewer parameters and trains a little faster, and in practice it often performs about as well as a full LSTM. It's common to try both and keep whichever does better on your data.

An LSTM in Keras

Using one is as simple as swapping the layer type. Everything else about the model stays the same:

from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(200, 32)),     # 200 time steps, 32 features each
    layers.LSTM(64),                   # swap SimpleRNN for LSTM
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

You can stack LSTM layers for more capacity, or wrap one in a bidirectional layer so it reads the sequence both forwards and backwards, which often helps on text where context comes from both directions.

Where LSTMs Stand Now

For most of the last decade, LSTMs were the workhorse of machine translation, speech recognition, and text generation. They are still a strong and practical choice for many time series problems, for forecasting, and for cases where data is limited or models need to stay small. For large-scale language work, though, they have been overtaken by the architecture in the next chapter, which removed the one thing an LSTM cannot escape: the need to process a sequence one step at a time.

Key Takeaways

LSTMs solve the short-memory problem of plain RNNs by adding a cell state that carries information across long sequences.
Three gates, built from sigmoid layers, control what to forget, what to store, and what to output.
The mostly-additive cell state is what keeps gradients from vanishing.
The GRU is a lighter variant that often matches the LSTM.
LSTMs remain useful for time series and smaller tasks, while Transformers now lead large-scale language work.

What's Next

Even with its long memory, an LSTM still reads a sequence one element at a time, which makes it slow to train and limits how far it can connect ideas. The next chapter introduces the attention mechanism and the Transformer, the architecture that threw out recurrence entirely and now underpins nearly every modern language model.

Discussion

Recurrent Neural Networks (RNNs)Transformer Models and Attention Mechanism

Long Short-Term Memory (LSTM) Networks

The Core Fix: A Separate Memory Channel

What controls this belt is a set of small neural gates that decide, at every step, what to keep, what to throw away, and what to read out.

The Three Gates

Each gate is a tiny layer that outputs values between 0 and 1, using the sigmoid function, where 0 means block completely and 1 means let through completely. There are three of them:

The forget gate looks at the current input and the previous hidden state and decides how much of the existing cell state to erase. When the topic of a sentence changes, this is what lets the network drop information that no longer matters.
The input gate decides what new information from the current step should be written into the cell state. A tanh layer proposes candidate values and the input gate decides how much of each to actually store.
The output gate decides what part of the updated cell state to expose as the new hidden state, which is what gets passed on and used for predictions.

The GRU, a Lighter Alternative

An LSTM in Keras

Using one is as simple as swapping the layer type. Everything else about the model stays the same:

from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(200, 32)),     # 200 time steps, 32 features each
    layers.LSTM(64),                   # swap SimpleRNN for LSTM
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Where LSTMs Stand Now

Key Takeaways

LSTMs solve the short-memory problem of plain RNNs by adding a cell state that carries information across long sequences.
Three gates, built from sigmoid layers, control what to forget, what to store, and what to output.
The mostly-additive cell state is what keeps gradients from vanishing.
The GRU is a lighter variant that often matches the LSTM.
LSTMs remain useful for time series and smaller tasks, while Transformers now lead large-scale language work.

What's Next

Discussion

Recurrent Neural Networks (RNNs)Transformer Models and Attention Mechanism

Long Short-Term Memory (LSTM) Networks