The previous chapter ended on a problem: plain recurrent networks forget things quickly, because gradients shrink as they travel back across many time steps. For a long time this kept RNNs from being useful on anything but short sequences. The LSTM, short for Long Short-Term Memory, is a redesigned recurrent cell that fixes this, and it powered most of the serious sequence modelling of the 2010s.
A plain RNN has one piece of state, the hidden state, that it both computes with and remembers through. The LSTM adds a second channel called the cell state, which you can picture as a conveyor belt running straight through the whole sequence. Information can ride along this belt with very little interference, which means something learned early can still be available many steps later. The hidden state still exists and does the moment-to-moment work, but the cell state is what gives the LSTM its long memory.
What controls this belt is a set of small neural gates that decide, at every step, what to keep, what to throw away, and what to read out.
Each gate is a tiny layer that outputs values between 0 and 1, using the sigmoid function, where 0 means block completely and 1 means let through completely. There are three of them:
Put together, at each step the LSTM forgets some of its old memory, writes in some new memory, and reads out a filtered view of it. Because the cell state is updated mostly by addition rather than repeated multiplication, gradients flow back through it without collapsing to zero, which is precisely why the LSTM remembers far longer than a plain RNN.
There is a popular simplified version called the GRU, or Gated Recurrent Unit, which merges some of the gates and drops the separate cell state. It has fewer parameters and trains a little faster, and in practice it often performs about as well as a full LSTM. It's common to try both and keep whichever does better on your data.
Using one is as simple as swapping the layer type. Everything else about the model stays the same:
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Input(shape=(200, 32)), # 200 time steps, 32 features each
layers.LSTM(64), # swap SimpleRNN for LSTM
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
You can stack LSTM layers for more capacity, or wrap one in a bidirectional layer so it reads the sequence both forwards and backwards, which often helps on text where context comes from both directions.
For most of the last decade, LSTMs were the workhorse of machine translation, speech recognition, and text generation. They are still a strong and practical choice for many time series problems, for forecasting, and for cases where data is limited or models need to stay small. For large-scale language work, though, they have been overtaken by the architecture in the next chapter, which removed the one thing an LSTM cannot escape: the need to process a sequence one step at a time.
Even with its long memory, an LSTM still reads a sequence one element at a time, which makes it slow to train and limits how far it can connect ideas. The next chapter introduces the attention mechanism and the Transformer, the architecture that threw out recurrence entirely and now underpins nearly every modern language model.
Sign in to join the discussion and post comments.
Sign in