In the introduction to artificial neural networks we saw that a neuron computes a weighted sum, adds a bias, and then passes the result through an activation function. That last step looks like a small detail, but it is what separates a neural network from ordinary linear algebra. Remove it and a hundred-layer network collapses into a straight line. This chapter explains why that happens, walks through every activation function you will actually use, and tells you which one to reach for in each situation.
Recall the neuron's core operation, z = w·x + b, which is a linear function. Now stack two layers with no activation between them. Layer one computes z₁ = W₁x + b₁ and layer two computes z₂ = W₂z₁ + b₂. Substitute one into the other:
z₂ = W₂(W₁x + b₁) + b₂
= (W₂W₁)x + (W₂b₁ + b₂)
= W'x + b'
The two layers reduce to a single linear layer with combined weights W' and bias b'. No matter how many linear layers you stack, the result is always equivalent to one linear layer, so it can only ever draw a straight line or a flat plane. It could never separate two classes that curve around each other, recognise a digit, or model language.
The activation function inserts a non-linear bend after each layer, and that bend is what lets a network combine simple pieces into very complex shapes. This is the property that makes deep learning a universal function approximator, and everything else in this series rests on it.
The earliest neuron, the 1958 perceptron, used a hard step function that output 1 above a threshold and 0 below it. It works for the simplest decisions, but it has a fatal flaw for learning: its slope is zero everywhere and undefined at the jump. As you will see in the chapter on backpropagation and gradient descent, training relies on the slope of the function to know which way to adjust the weights. A function with zero slope gives no such signal, so the step function is a historical starting point rather than a practical tool.
The sigmoid, or logistic, function smooths the step into an S-curve that squashes any input into the range 0 to 1:
sigmoid(z) = 1 / (1 + e^(-z))
Because it outputs a value between 0 and 1, it reads naturally as a probability, which is why it is the standard choice for the output neuron of a binary, yes or no, classifier. We used it for exactly this in the introduction's pass/fail example. Its weakness is that for large positive or negative inputs the curve flattens and its slope approaches zero, which in deep networks causes the vanishing gradient problem, where early layers stop learning because almost no signal reaches them. Its output is also not centred on zero, which slows training down.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
print(sigmoid(np.array([-5, 0, 5])))
# [0.0067 0.5 0.9933]
Use sigmoid for the output layer of binary classification, and avoid it in the hidden layers of deep networks.
Tanh is a rescaled sigmoid that outputs in the range minus 1 to plus 1:
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Because its output is centred on zero, gradients flow more symmetrically and training is usually faster than with sigmoid. It was the default hidden-layer activation for years and still appears inside recurrent networks like the LSTM gates we cover later. It does still saturate at the extremes, so it shares sigmoid's vanishing-gradient weakness in very deep stacks.
ReLU is the activation that made deep learning practical, and it is almost trivially simple:
relu(z) = max(0, z)
If the input is positive it passes through unchanged, and if it is negative the output is zero. It is extremely cheap to compute, and for positive inputs its slope is exactly 1, so it does not saturate. That largely solves the vanishing-gradient problem and lets very deep networks train. It also produces many exact zeros, which is efficient. The one catch is the dying ReLU problem: if a neuron's weights push it into the negative region for every input, it outputs zero forever and its gradient is zero, so it never recovers, and a chunk of the network can quietly go dead.
def relu(z):
return np.maximum(0, z)
print(relu(np.array([-3.0, -0.5, 2.0, 5.0])))
# [0. 0. 2. 5.]
Use ReLU for hidden layers as your default starting point in almost every modern feed-forward and convolutional network.
Several variants exist to patch ReLU's dying-neuron flaw. Leaky ReLU allows a small slope for negative inputs instead of flat zero, for example 0.01z, so the neuron always has a non-zero gradient and can recover. ELU, the exponential linear unit, smooths the negative side with an exponential curve, pushing the mean activation closer to zero, which can speed up convergence. GELU, the Gaussian error linear unit, is a smooth probabilistic gate that has become the standard activation inside Transformer models like BERT and GPT, so if you work with modern large language models, for example when building retrieval-augmented systems, GELU is the activation under the hood. In practice, start with ReLU, move to Leaky ReLU or ELU if you see dead neurons, and use GELU for Transformer-style architectures.
Everything above acts on a single number. Softmax is different because it acts on a whole layer of outputs at once and turns them into a probability distribution that sums to 1:
softmax(z_i) = e^(z_i) / Σ e^(z_j)
If your network classifies an image as one of ten digits, the output layer has ten neurons and softmax converts their raw scores into ten probabilities that add up to 1.0, and you pick the highest. This is the standard output activation for any multi-class classification problem.
def softmax(z):
e = np.exp(z - np.max(z)) # subtract max for numerical stability
return e / e.sum()
print(softmax(np.array([2.0, 1.0, 0.1])))
# [0.659 0.242 0.099] -> sums to 1.0
Notice the - np.max(z) step. Raw exponentials can overflow for large inputs, and subtracting the maximum value first keeps the numbers small without changing the result. It is a small but important production detail.
For hidden layers, use ReLU by default, switching to Leaky ReLU or ELU if you run into dead neurons, and use GELU inside Transformer blocks. Recurrent gates in LSTMs and GRUs use tanh and sigmoid. For the output layer, use sigmoid for binary classification with one neuron, softmax for multi-class with one neuron per class, and usually no activation at all for regression, since you do not want to squash a numeric prediction.
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Input(shape=(784,)),
layers.Dense(128, activation='relu'), # hidden layer
layers.Dense(64, activation='relu'), # hidden layer
layers.Dense(10, activation='softmax') # 10-class output
])
The pattern is consistent across every framework: ReLU-family activations in the hidden layers, and a task-appropriate activation, sigmoid or softmax or none, on the output layer.
Notice a recurring theme. Every good or bad property of an activation function came back to its gradient, whether the slope vanishes, stays at 1, or dies at zero. That is not a coincidence. Activation functions are chosen almost entirely for how they behave during training, and training is driven by gradients. The next chapter on backpropagation makes this concrete by showing exactly how those gradients are computed and used to update the weights.
We have mentioned gradients, vanishing gradients, and the slope of the function several times without fully explaining them. It is time to open that box and see how a network actually computes those gradients and uses them to learn from its mistakes.
Sign in to join the discussion and post comments.
Sign in