On this tutorial

Foundations

How Networks Learn

Core Architectures

Working with Models in Practice

Backpropagation and Gradient Descent

So far we have built a network that can make predictions (the introduction to artificial neural networks) and given it the non-linearity it needs to model complex patterns (activation functions). But a freshly built network is useless, because its weights are random and so are its predictions. This chapter covers the two ideas that turn that random network into one that works: gradient descent, which is how it improves, and backpropagation, which is how it knows what to improve. Together they are the learning engine of every neural network.

The Core Problem: Who Is to Blame?

A network might have millions of weights. When it makes a wrong prediction, the question is which weights caused the error, and in which direction each one should move to make the error smaller. Answering that efficiently, for millions of weights at once, is the entire challenge. Backpropagation is the algorithm that answers it, and gradient descent is what acts on the answer.

Step 1: Measuring the Error with a Loss Function

Before you can reduce an error you have to measure it. A loss function, also called a cost function, takes the network's prediction and the true answer and returns a single number describing how wrong it was, where lower is better and zero would be perfect. Two loss functions cover most cases. Mean squared error is used for regression, where you predict a number, and it averages the squared differences between predictions and truth:

MSE = (1/n) Σ (y_true - y_pred)²

Squaring punishes big mistakes far more than small ones and removes the sign. Cross-entropy loss is used for classification, where it measures how far the predicted probability distribution is from the true label, and it pairs naturally with the sigmoid and softmax outputs from the previous chapter. In both cases the loss is a function of the weights, so changing the weights changes the loss, and learning is just the search for the weights that make it as small as possible.

Step 2: Gradient Descent, Rolling Downhill

Picture the loss as a landscape of hills and valleys, where your horizontal position is the current set of weights and the height is the loss. You want to reach the lowest valley. You cannot see the whole landscape, since it has millions of dimensions, but at your current spot you can feel the slope under your feet, so the strategy is to take a step downhill and repeat. The gradient is that slope, and it points in the direction of steepest increase, so to decrease the loss you step in the opposite direction. This gives the most important equation in deep learning:

new_weight = old_weight - (learning_rate × gradient)

The learning rate controls how big each step is, and subtracting the gradient is what sends you downhill. Repeat this for every weight, over and over, and the loss slides toward a minimum.

The Learning Rate: The Most Important Dial

If the learning rate is too small, training takes forever because the steps are tiny. If it is too large, you overshoot the valley, bouncing across it or even diverging as the loss climbs. When it is set well, you get a steady, efficient descent. Tuning it matters enough that it gets its own treatment in the hyperparameter tuning chapter, and the optimizers we cover next exist largely to adjust it intelligently for you.

Step 3: Backpropagation, Computing the Gradient Efficiently

Gradient descent needs the gradient of the loss with respect to every weight, and computing each one separately would be hopelessly slow. Backpropagation is the algorithm that computes all of them in a single backward sweep through the network, reusing intermediate results. It works by applying the chain rule from calculus, which lets you find how a small change deep inside the network ripples out to the final loss by multiplying the local slopes along the path. The process has three parts. The forward pass runs the input through the network, saving each layer's intermediate values, and computes the loss at the end. The backward pass starts from the loss and works backwards layer by layer, using the chain rule to find how much each weight contributed to the loss, so the error signal is propagated back from output to input. Then the update applies the gradient-descent rule to every weight using the gradients just computed.

Why activation functions mattered: The backward pass multiplies the slopes of the activation functions along the way. If those slopes are tiny, as with a saturated sigmoid, the product shrinks toward zero by the time it reaches the early layers, which is the vanishing gradient problem from the activation functions chapter. If the slopes are large, the product can blow up, which is the exploding gradient problem. This is the exact mechanism behind those warnings.

A Concrete Mental Model

Imagine the loss depends on a weight w through a chain: w affects z, z affects the activation a, and a affects the loss L. The chain rule says the gradient of the loss with respect to w is the product of the local slopes along that chain:

dL/dw = (dL/da) × (da/dz) × (dz/dw)

Backpropagation computes these local slopes once, from right to left, and multiplies them. Because adjacent layers share parts of the chain, it reuses those shared pieces instead of recomputing them, and that reuse is what makes training millions of weights feasible.

Batch, Stochastic, and Mini-Batch Gradient Descent

How much data you look at before each update is a choice. Batch gradient descent uses the entire dataset for one update, which gives an accurate gradient but is slow and memory-hungry. Stochastic gradient descent updates after every single example, which is fast and noisy, and the noise can actually help escape bad spots even though it is erratic. Mini-batch gradient descent updates after a small batch, commonly 32, 64, or 128 examples, and it is the practical default because it balances speed, stability, and hardware efficiency. One pass through all the mini-batches is one epoch.

A Training Loop From Scratch (Python)

Here is gradient descent training a single neuron on the pass/fail task from the introduction, with no framework, so you can see every step:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Training data: [study, sleep] -> pass(1)/fail(0)
X = np.array([[5, 7], [3, 4], [8, 6], [2, 3]], dtype=float)
y = np.array([1, 0, 1, 0], dtype=float)

# Start with random weights and bias
np.random.seed(42)
weights = np.random.randn(2)
bias = 0.0
learning_rate = 0.1

for epoch in range(2000):
    # Forward pass
    z = X @ weights + bias
    pred = sigmoid(z)

    # Loss (binary cross-entropy)
    loss = -np.mean(y * np.log(pred + 1e-9) + (1 - y) * np.log(1 - pred + 1e-9))

    # Backward pass (gradients)
    error = pred - y                 # dL/dz for sigmoid + cross-entropy
    grad_w = X.T @ error / len(X)
    grad_b = np.mean(error)

    # Update (gradient descent)
    weights -= learning_rate * grad_w
    bias    -= learning_rate * grad_b

    if epoch % 500 == 0:
        print(f"epoch {epoch:4d}  loss {loss:.4f}")

print("final weights:", weights, "bias:", bias)

Watch the loss fall as the loop runs, because that falling number is learning. The three phases, forward, backward, and update, map exactly onto the steps above. The tidy error = pred - y line is the gradient simplification you get when sigmoid pairs with cross-entropy, which is one reason that pairing is standard.

The Same Thing in Keras

Frameworks hide all of this behind fit(). When you call it, Keras runs the forward pass, computes the loss, runs backpropagation, and applies the optimiser, for every mini-batch and every epoch:

model.compile(optimizer='sgd',          # gradient descent
              loss='binary_crossentropy')
model.fit(X, y, epochs=2000, batch_size=2)

Everything you coded by hand is happening inside that one fit call. Understanding the manual version is what lets you debug it when training goes wrong, whether the loss is not falling, is exploding, or is stuck.

Connection to Classical Machine Learning

Gradient descent is not unique to neural networks. The same optimisation drives linear regression, logistic regression, and many other models. If you have followed the Machine Learning series you have already met gradient descent there, and neural networks simply apply it to a much larger, layered set of weights, with backpropagation making it tractable.

Key Takeaways

A loss function turns "how wrong" into a single number to minimise: MSE for regression, cross-entropy for classification.
Gradient descent reduces the loss by stepping weights opposite to the gradient, using w = w - lr × gradient.
The learning rate sets the step size, where too big diverges and too small crawls.
Backpropagation uses the chain rule to compute every weight's gradient in one efficient backward sweep.
Vanishing and exploding gradients come from multiplying slopes through many layers.
Mini-batch gradient descent is the practical default.

What's Next

Plain gradient descent works, but it is slow and sensitive to the learning rate. In practice nobody uses vanilla SGD alone; they use smarter optimisers that adapt the step size automatically and add momentum to push through flat regions. That is where we go next.

Discussion

Activation Functions in Neural Networks Optimizers: Adam, RMSprop, SGD

Backpropagation and Gradient Descent

The Core Problem: Who Is to Blame?

Step 1: Measuring the Error with a Loss Function

MSE = (1/n) Σ (y_true - y_pred)²

Step 2: Gradient Descent, Rolling Downhill

new_weight = old_weight - (learning_rate × gradient)

The learning rate controls how big each step is, and subtracting the gradient is what sends you downhill. Repeat this for every weight, over and over, and the loss slides toward a minimum.

The Learning Rate: The Most Important Dial

Step 3: Backpropagation, Computing the Gradient Efficiently

Why activation functions mattered: The backward pass multiplies the slopes of the activation functions along the way. If those slopes are tiny, as with a saturated sigmoid, the product shrinks toward zero by the time it reaches the early layers, which is the vanishing gradient problem from the activation functions chapter. If the slopes are large, the product can blow up, which is the exploding gradient problem. This is the exact mechanism behind those warnings.

A Concrete Mental Model

dL/dw = (dL/da) × (da/dz) × (dz/dw)

Batch, Stochastic, and Mini-Batch Gradient Descent

A Training Loop From Scratch (Python)

Here is gradient descent training a single neuron on the pass/fail task from the introduction, with no framework, so you can see every step:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Training data: [study, sleep] -> pass(1)/fail(0)
X = np.array([[5, 7], [3, 4], [8, 6], [2, 3]], dtype=float)
y = np.array([1, 0, 1, 0], dtype=float)

# Start with random weights and bias
np.random.seed(42)
weights = np.random.randn(2)
bias = 0.0
learning_rate = 0.1

for epoch in range(2000):
    # Forward pass
    z = X @ weights + bias
    pred = sigmoid(z)

    # Loss (binary cross-entropy)
    loss = -np.mean(y * np.log(pred + 1e-9) + (1 - y) * np.log(1 - pred + 1e-9))

    # Backward pass (gradients)
    error = pred - y                 # dL/dz for sigmoid + cross-entropy
    grad_w = X.T @ error / len(X)
    grad_b = np.mean(error)

    # Update (gradient descent)
    weights -= learning_rate * grad_w
    bias    -= learning_rate * grad_b

    if epoch % 500 == 0:
        print(f"epoch {epoch:4d}  loss {loss:.4f}")

print("final weights:", weights, "bias:", bias)

The Same Thing in Keras

Frameworks hide all of this behind fit(). When you call it, Keras runs the forward pass, computes the loss, runs backpropagation, and applies the optimiser, for every mini-batch and every epoch:

model.compile(optimizer='sgd',          # gradient descent
              loss='binary_crossentropy')
model.fit(X, y, epochs=2000, batch_size=2)

Connection to Classical Machine Learning

Key Takeaways

A loss function turns "how wrong" into a single number to minimise: MSE for regression, cross-entropy for classification.
Gradient descent reduces the loss by stepping weights opposite to the gradient, using w = w - lr × gradient.
The learning rate sets the step size, where too big diverges and too small crawls.
Backpropagation uses the chain rule to compute every weight's gradient in one efficient backward sweep.
Vanishing and exploding gradients come from multiplying slopes through many layers.
Mini-batch gradient descent is the practical default.

What's Next

Discussion

Activation Functions in Neural Networks Optimizers: Adam, RMSprop, SGD

Backpropagation and Gradient Descent