The networks we've looked at so far assume every input is independent and complete. You hand a CNN a whole image and it gives you an answer. But a lot of data doesn't work that way. A sentence only makes sense word by word, in order. A stock price today depends on the days before it. Speech is a stream that unfolds over time. For this kind of sequential data we need a network with a sense of memory, and that is what recurrent neural networks provide.
Two things set sequences apart from the images a CNN handles. First, order carries meaning. "The dog bit the man" and "the man bit the dog" use identical words but mean very different things. Second, sequences vary in length. One review is ten words, another is three hundred. A fixed-size dense or convolutional input layer is a poor fit for data like this.
What we want is a network that reads a sequence one element at a time and carries forward some understanding of what it has seen so far, updating that understanding at each step.
An RNN processes a sequence one step at a time while maintaining a hidden state, which you can think of as its running memory. At each step it takes two things: the current input, and the hidden state left over from the previous step. It combines them to produce a new hidden state, which then carries forward to the next step. The same set of weights is used at every step, so the network applies the same logic whether it's reading the first word or the fiftieth.
That loop is the entire concept. Because the hidden state passes from one step to the next, information from earlier in the sequence can influence how later elements are interpreted. When you draw the loop out across all the time steps, it looks like a very deep network where each layer is a copy of the same cell, a picture often described as unrolling the network through time.
Training works by the same backpropagation we covered earlier, applied to the unrolled network. The error at the end is propagated back through every time step to adjust the shared weights, a process called backpropagation through time. Conceptually it's the same chain rule, just run backwards along the sequence as well as through the layers.
This is where RNNs run into trouble. Backpropagation through a long sequence multiplies many gradients together, one for each time step. As we saw in the backpropagation chapter, repeatedly multiplying numbers smaller than one drives the result toward zero. The practical consequence is that a plain RNN struggles to connect information across long gaps. By the time it has read a long paragraph, it has effectively forgotten how the paragraph began. Its memory is real but short.
This limitation is exactly what the next chapter solves, and it's the reason simple RNNs are rarely used on their own for serious sequence work today.
Building one is straightforward. This example reads sequences and classifies them, the kind of setup you'd use for sentiment analysis on short text:
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Input(shape=(100, 32)), # 100 time steps, 32 features each
layers.SimpleRNN(64), # the recurrent layer
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
The SimpleRNN layer hides the per-step loop. You give it a sequence and it returns either the final hidden state or, if you ask, the hidden state at every step.
For years, recurrent networks were the standard tool for language tasks like translation, text generation, and speech recognition, which is the territory of the Natural Language Processing series. That has changed. The Transformer architecture, which we reach soon, has largely replaced RNNs for most language work because it handles long-range connections far better and trains faster. Recurrent networks are still useful for some time series problems and in settings where models must be small and run step by step, but for cutting-edge language work the field has moved on.
That said, understanding RNNs is not wasted effort. The ideas of a hidden state and processing a sequence over time carry directly into the more capable architectures that followed.
The short-memory problem held RNNs back for a long time, until a redesigned recurrent cell solved it with a system of gates that decide what to remember and what to forget. That cell, the LSTM, is the subject of the next chapter.
Sign in to join the discussion and post comments.
Sign in