On this tutorial

Foundations

How Networks Learn

Core Architectures

Working with Models in Practice

Activation Functions in Neural Networks

In the introduction to artificial neural networks we saw that a neuron computes a weighted sum, adds a bias, and then passes the result through an activation function. That last step looks like a small detail, but it is what separates a neural network from ordinary linear algebra. Remove it and a hundred-layer network collapses into a straight line. This chapter explains why that happens, walks through every activation function you will actually use, and tells you which one to reach for in each situation.

Why Non-Linearity Is the Whole Point

Recall the neuron's core operation, z = w·x + b, which is a linear function. Now stack two layers with no activation between them. Layer one computes z₁ = W₁x + b₁ and layer two computes z₂ = W₂z₁ + b₂. Substitute one into the other:

z₂ = W₂(W₁x + b₁) + b₂
   = (W₂W₁)x + (W₂b₁ + b₂)
   = W'x + b'

The two layers reduce to a single linear layer with combined weights W' and bias b'. No matter how many linear layers you stack, the result is always equivalent to one linear layer, so it can only ever draw a straight line or a flat plane. It could never separate two classes that curve around each other, recognise a digit, or model language.

The activation function inserts a non-linear bend after each layer, and that bend is what lets a network combine simple pieces into very complex shapes. This is the property that makes deep learning a universal function approximator, and everything else in this series rests on it.

The Step Function, Where It Started

The earliest neuron, the 1958 perceptron, used a hard step function that output 1 above a threshold and 0 below it. It works for the simplest decisions, but it has a fatal flaw for learning: its slope is zero everywhere and undefined at the jump. As you will see in the chapter on backpropagation and gradient descent, training relies on the slope of the function to know which way to adjust the weights. A function with zero slope gives no such signal, so the step function is a historical starting point rather than a practical tool.

Sigmoid

The sigmoid, or logistic, function smooths the step into an S-curve that squashes any input into the range 0 to 1:

sigmoid(z) = 1 / (1 + e^(-z))

Because it outputs a value between 0 and 1, it reads naturally as a probability, which is why it is the standard choice for the output neuron of a binary, yes or no, classifier. We used it for exactly this in the introduction's pass/fail example. Its weakness is that for large positive or negative inputs the curve flattens and its slope approaches zero, which in deep networks causes the vanishing gradient problem, where early layers stop learning because almost no signal reaches them. Its output is also not centred on zero, which slows training down.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

print(sigmoid(np.array([-5, 0, 5])))
# [0.0067 0.5    0.9933]

Use sigmoid for the output layer of binary classification, and avoid it in the hidden layers of deep networks.

Tanh (Hyperbolic Tangent)

Tanh is a rescaled sigmoid that outputs in the range minus 1 to plus 1:

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Because its output is centred on zero, gradients flow more symmetrically and training is usually faster than with sigmoid. It was the default hidden-layer activation for years and still appears inside recurrent networks like the LSTM gates we cover later. It does still saturate at the extremes, so it shares sigmoid's vanishing-gradient weakness in very deep stacks.

ReLU (Rectified Linear Unit), the Default

ReLU is the activation that made deep learning practical, and it is almost trivially simple:

relu(z) = max(0, z)

If the input is positive it passes through unchanged, and if it is negative the output is zero. It is extremely cheap to compute, and for positive inputs its slope is exactly 1, so it does not saturate. That largely solves the vanishing-gradient problem and lets very deep networks train. It also produces many exact zeros, which is efficient. The one catch is the dying ReLU problem: if a neuron's weights push it into the negative region for every input, it outputs zero forever and its gradient is zero, so it never recovers, and a chunk of the network can quietly go dead.

def relu(z):
    return np.maximum(0, z)

print(relu(np.array([-3.0, -0.5, 2.0, 5.0])))
# [0. 0. 2. 5.]

Use ReLU for hidden layers as your default starting point in almost every modern feed-forward and convolutional network.

Leaky ReLU, ELU, and GELU, the Fixes

Several variants exist to patch ReLU's dying-neuron flaw. Leaky ReLU allows a small slope for negative inputs instead of flat zero, for example 0.01z, so the neuron always has a non-zero gradient and can recover. ELU, the exponential linear unit, smooths the negative side with an exponential curve, pushing the mean activation closer to zero, which can speed up convergence. GELU, the Gaussian error linear unit, is a smooth probabilistic gate that has become the standard activation inside Transformer models like BERT and GPT, so if you work with modern large language models, for example when building retrieval-augmented systems, GELU is the activation under the hood. In practice, start with ReLU, move to Leaky ReLU or ELU if you see dead neurons, and use GELU for Transformer-style architectures.

Softmax, for Multi-Class Output

Everything above acts on a single number. Softmax is different because it acts on a whole layer of outputs at once and turns them into a probability distribution that sums to 1:

softmax(z_i) = e^(z_i) / Σ e^(z_j)

If your network classifies an image as one of ten digits, the output layer has ten neurons and softmax converts their raw scores into ten probabilities that add up to 1.0, and you pick the highest. This is the standard output activation for any multi-class classification problem.

def softmax(z):
    e = np.exp(z - np.max(z))   # subtract max for numerical stability
    return e / e.sum()

print(softmax(np.array([2.0, 1.0, 0.1])))
# [0.659 0.242 0.099]  -> sums to 1.0

Notice the - np.max(z) step. Raw exponentials can overflow for large inputs, and subtracting the maximum value first keeps the numbers small without changing the result. It is a small but important production detail.

How to Choose

For hidden layers, use ReLU by default, switching to Leaky ReLU or ELU if you run into dead neurons, and use GELU inside Transformer blocks. Recurrent gates in LSTMs and GRUs use tanh and sigmoid. For the output layer, use sigmoid for binary classification with one neuron, softmax for multi-class with one neuron per class, and usually no activation at all for regression, since you do not want to squash a numeric prediction.

Setting Activations in Keras

from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(128, activation='relu'),    # hidden layer
    layers.Dense(64,  activation='relu'),    # hidden layer
    layers.Dense(10,  activation='softmax')  # 10-class output
])

The pattern is consistent across every framework: ReLU-family activations in the hidden layers, and a task-appropriate activation, sigmoid or softmax or none, on the output layer.

The Connection to Training

Notice a recurring theme. Every good or bad property of an activation function came back to its gradient, whether the slope vanishes, stays at 1, or dies at zero. That is not a coincidence. Activation functions are chosen almost entirely for how they behave during training, and training is driven by gradients. The next chapter on backpropagation makes this concrete by showing exactly how those gradients are computed and used to update the weights.

Key Takeaways

Without a non-linear activation, any deep network collapses to a single linear layer.
ReLU is the default for hidden layers, and its variants (Leaky ReLU, ELU, GELU) fix specific problems.
Sigmoid suits binary output and softmax suits multi-class output.
Saturating functions like sigmoid and tanh cause vanishing gradients in deep networks.
Activation choice is fundamentally a choice about gradient behaviour.

What's Next

We have mentioned gradients, vanishing gradients, and the slope of the function several times without fully explaining them. It is time to open that box and see how a network actually computes those gradients and uses them to learn from its mistakes.

Discussion

Introduction to Artificial Neural Networks Backpropagation and Gradient Descent

Activation Functions in Neural Networks

Why Non-Linearity Is the Whole Point

z₂ = W₂(W₁x + b₁) + b₂
   = (W₂W₁)x + (W₂b₁ + b₂)
   = W'x + b'

The Step Function, Where It Started

Sigmoid

The sigmoid, or logistic, function smooths the step into an S-curve that squashes any input into the range 0 to 1:

sigmoid(z) = 1 / (1 + e^(-z))

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

print(sigmoid(np.array([-5, 0, 5])))
# [0.0067 0.5    0.9933]

Use sigmoid for the output layer of binary classification, and avoid it in the hidden layers of deep networks.

Tanh (Hyperbolic Tangent)

Tanh is a rescaled sigmoid that outputs in the range minus 1 to plus 1:

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

ReLU (Rectified Linear Unit), the Default

ReLU is the activation that made deep learning practical, and it is almost trivially simple:

relu(z) = max(0, z)

def relu(z):
    return np.maximum(0, z)

print(relu(np.array([-3.0, -0.5, 2.0, 5.0])))
# [0. 0. 2. 5.]

Use ReLU for hidden layers as your default starting point in almost every modern feed-forward and convolutional network.

Leaky ReLU, ELU, and GELU, the Fixes

Softmax, for Multi-Class Output

Everything above acts on a single number. Softmax is different because it acts on a whole layer of outputs at once and turns them into a probability distribution that sums to 1:

softmax(z_i) = e^(z_i) / Σ e^(z_j)

def softmax(z):
    e = np.exp(z - np.max(z))   # subtract max for numerical stability
    return e / e.sum()

print(softmax(np.array([2.0, 1.0, 0.1])))
# [0.659 0.242 0.099]  -> sums to 1.0

How to Choose

Setting Activations in Keras

from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(128, activation='relu'),    # hidden layer
    layers.Dense(64,  activation='relu'),    # hidden layer
    layers.Dense(10,  activation='softmax')  # 10-class output
])

The pattern is consistent across every framework: ReLU-family activations in the hidden layers, and a task-appropriate activation, sigmoid or softmax or none, on the output layer.

The Connection to Training

Key Takeaways

Without a non-linear activation, any deep network collapses to a single linear layer.
ReLU is the default for hidden layers, and its variants (Leaky ReLU, ELU, GELU) fix specific problems.
Sigmoid suits binary output and softmax suits multi-class output.
Saturating functions like sigmoid and tanh cause vanishing gradients in deep networks.
Activation choice is fundamentally a choice about gradient behaviour.

What's Next

Discussion

Introduction to Artificial Neural Networks Backpropagation and Gradient Descent

Activation Functions in Neural Networks