Backpropagation Algorithm
In our previous blog, you learned about the Feed-Forward network used in the neural network, so in a feed-forward network, we got our first predicted output. Now, we shall discuss the backpropagation algorithm in detail and that would conclude the whole process.
Why do we need a backpropagation algorithm?
The first prediction we got from our feed-forward network will have a huge difference than our actual output, therefore, we will not get the result we were hoping for and the accuracy could be as low as 25% or more, therefore, the feed-forward network does not ensure accuracy on prediction and we needed a solution or the way to minimize this difference between the actual and predicted output to achieve higher accuracy. The backpropagation algorithm is a way to minimize this loss in order to achieve good predictions on unseen data.
All About Loss-Function
Let us consider the feed-forward network,
x1, x2 = Input Layer
h1, h2 = Hidden Layer
A1, A2 = Sigmoid Activation Function
A3, A4 = Softmax Activation Function
01, 02 = output layers
In the Figure of the feed-forward network, we got output as 0.24 and 0.14, but our actual output is 0.32 and 0.22, you may think the difference is not much, but to clarify, it is a huge difference as we have standardized our data in the certain range of 0 to 1. With this prediction, as a result, we are going to get poor accuracy, therefore we will calculate the loss or error between the actual and predicted output with the help of the loss function (also known as the error function)
The loss function is nothing but the sum of the square of the difference between the desired output to the predicted output.
Loss Function = ∑ ½ ( Desired Output - Predicted Output )2
Here, ½ is a constant used.
Now let us say we got the loss function, now we have to minimize the loss in order to get more accuracy, or in other words, we have to reduce the value of the loss function to zero.
In order to reduce the value of the loss function, we have to update the value of our weights and biases and for that, we have to backpropagate with the help of backpropagation algorithm.
Gradient Descent to update weight and reduce loss function
We are going to take an algorithm to update our weights, which is known to be gradient descent, see the below figure
In order to update the weights, we have to follow the following steps
Step 1: Take the initial value of weights
Step 2: Update W(init) in the direction of decreasing gradient(slope), when it reaches the local minima, it would have reduced its error.
Step 3: Update the weights with the help of the below formula
Updated weight -> weight - ⋉ * ∂ (error)/∂ (weight)
⋉ = Learning rate
Repeat Step 3 until you reach local minima.
The learning rate is the step size taken while minimizing the loss or error function.
Let us see the above discussion with the help of an example
While backpropagating with backpropagation algorithm, let us say we have to update the weight ‘w5’, see the procedure below -:
We will find out the change or error with respect to ‘w5’, so we will do partial differentiation
∂(loss function)/∂ w5
=
∂ (loss function) / ∂ out o1 * ∂ out o1/ ∂ h1 * ∂ h1/ ∂ w5
Above is the result of the chain rule in partial differentiation.
Here, (∂ out 01/ ∂ h1) = out o1(1 - out 01)
Out 01 -> output of o1
∂ (loss function) / ∂ out 01 = out o1 - actual 01, and
∂ h1/ ∂ w5 = out h1
You should write these values in your notebook for a better understanding, now, at last, we will update the weight using the formula we discussed in the 3rd step.
w5*(updated weight) = w5 - 0.6 * ∂ (loss function)/ ∂ w5
Here, 0.6 is a learning rate.
Therefore, this is how we are going to calculate the weights and update them, which will result in the minimization of the loss function and we will achieve the desired accuracy.
Code Implementation of Backpropagation Algorithm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dataset.csv')
dataset = dataset.values
X, y = dataset[:, 1:], dataset[:, 0]
# Min-Max Scaler
X = (X - X.min()) / (X.max() - X.min())
class Backpropagation:
def __init__(self, X, y):
self.X = (X - X.min()) / (X.max() - X.min())
self.y = y
self.H1_size = 256
self.H2_size = 64
self.OUTPUT_SIZE = len(np.unique(y))
self.INPUT_SIZE = X.shape[1]
self.losses = []
# Initialize weights
self.W1 = np.random.randn(self.INPUT_SIZE, self.H1_size)
self.W2 = np.random.randn(self.H1_size, self.H2_size)
self.W3 = np.random.randn(self.H2_size, self.OUTPUT_SIZE)
# Initialize biases
self.b1 = np.random.random((1, self.H1_size))
self.b2 = np.random.random((1, self.H2_size))
self.b3 = np.random.random((1, self.OUTPUT_SIZE))
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def softmax(self, z):
return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
def forward(self, x):
Z1 = x.dot(self.W1) + self.b1 # (N,256) = (N,784)(784,256)(1,256)
A1 = self.sigmoid(Z1)
Z2 = A1.dot(self.W2) + self.b2
A2 = self.sigmoid(Z2)
Z3 = A2.dot(self.W3) + self.b3
yhat = self.softmax(Z3)
self.activations = [A1, A2, yhat]
return yhat
def backprop(self, x, y, yhat, learning_rate=0.01):
A1, A2, yhat = self.activations
# Compute Gradients
delta3 = yhat - y
dldw3 = A2.T.dot(delta3)
dldb3 = delta3.sum(axis=0, keepdims=True)
delta2 = delta3.dot(self.W3.T) * (A2 * (1 - A2))
dldw2 = A1.T.dot(delta2)
dldb2 = delta2.sum(axis=0, keepdims=True)
delta1 = delta2.dot(self.W2.T) * (A1 * (1 - A1))
dldw1 = x.T.dot(delta1)
dldb1 = delta1.sum(axis=0, keepdims=True)
# Update Weights
self.W3 -= dldw3 * learning_rate
self.b3 -= dldb3 * learning_rate
self.W2 -= dldw2 * learning_rate
self.b2 -= dldb2 * learning_rate
self.W1 -= dldw1 * learning_rate
self.b1 -= dldb1 * learning_rate
def compute_loss(self, y, yhat):
# L = -E[y log(yhat)]
return -np.sum(y * np.log(yhat))
def get_predictions(self, test):
yhat = self.forward(test)
preds = np.argmax(yhat, axis=1)
return preds
def accuracy(self, preds, true_labels):
return (preds == true_labels).mean()