What Is Backpropagation?
Backpropagation (short for "backward propagation of errors") is the core algorithm used to train neural networks. It calculates how much each weight in the network contributes to the overall prediction error, then adjusts those weights to reduce the error. This process is repeated thousands or millions of times until the network produces accurate predictions.
How Backpropagation Works
Step 1: Forward Pass
Input data is passed through the network layer by layer. Each neuron applies its weights, bias, and activation function to produce an output. The final layer generates the network's prediction.
Step 2: Loss Calculation
A loss function (also called a cost function) measures the difference between the network's prediction and the actual target value. Common loss functions include:
- Mean Squared Error (MSE) — for regression tasks
- Cross-Entropy Loss — for classification tasks
Step 3: Backward Pass
The algorithm computes the gradient of the loss function with respect to each weight in the network, using the chain rule of calculus. Starting from the output layer and working backward:
- Calculate the gradient at the output layer
- Propagate gradients through each hidden layer
- Each weight receives a gradient indicating how much it should change
Step 4: Weight Update
Using an optimization algorithm (like Stochastic Gradient Descent or Adam), weights are adjusted in the direction that reduces the loss:
w_new = w_old − learning_rate × gradient
The learning rate controls the size of each update step — too large and the model overshoots; too small and training takes forever.
Why Backpropagation Matters
Backpropagation made it possible to train multi-layer neural networks — something that was computationally infeasible before. Without it, deep learning as we know it would not exist.
Challenges with Backpropagation
| Challenge | Description |
|---|---|
| Vanishing Gradients | Gradients shrink to near-zero in deep networks, causing early layers to stop learning |
| Exploding Gradients | Gradients grow uncontrollably, causing unstable training |
| Computational Cost | Each training iteration requires a full forward and backward pass |
| Local Minima | The optimizer may get trapped in suboptimal solutions |
| Hyperparameter Sensitivity | Performance depends heavily on learning rate, batch size, and architecture choices |
Optimization Algorithms
Several optimization algorithms have been developed to improve on basic gradient descent:
- SGD (Stochastic Gradient Descent) — Updates weights using a random subset of data
- Adam — Combines momentum and adaptive learning rates; the most widely used optimizer
- RMSProp — Adapts learning rates based on recent gradient magnitudes
- AdaGrad — Adapts learning rates based on historical gradients
The ELM Alternative: No Backpropagation Required
Extreme Learning Machines (ELMs) take a fundamentally different approach. Instead of iteratively adjusting weights through backpropagation, ELMs:
- Randomly assign input-to-hidden weights (and never change them)
- Compute output weights analytically using the Moore-Penrose pseudoinverse
This single-step solution eliminates backpropagation entirely, achieving training speeds 100–1000x faster than conventional approaches. For applications where training speed matters more than squeezing out marginal accuracy gains, ELMs offer a compelling alternative.