...

Understand the Backpropagation algorithm with python code.

Backpropagation Algorithm 

In our previous blog, you learned about the Feed-Forward network used in the neural network, so in a feed-forward network, we got our first predicted output. Now, we shall discuss the backpropagation algorithm in detail and that would conclude the whole process. 

Why do we need a backpropagation algorithm?

The first prediction we got from our feed-forward network will have a huge difference than our actual output, therefore, we will not get the result we were hoping for and the accuracy could be as low as 25% or more, therefore, the feed-forward network does not ensure accuracy on prediction and we needed a solution or the way to minimize this difference between the actual and predicted output to achieve higher accuracy. The backpropagation algorithm is a way to minimize this loss in order to achieve good predictions on unseen data.

All About Loss-Function

Let us consider the feed-forward network,

 

Architecture of a Neural Network in order to understand loss function


 

x1, x2 = Input Layer

h1, h2 = Hidden Layer

A1, A2 = Sigmoid Activation Function

A3, A4 = Softmax Activation Function

01, 02 = output layers

In the Figure of the feed-forward network, we got output as 0.24 and 0.14, but our actual output is 0.32 and 0.22, you may think the difference is not much, but to clarify, it is a huge difference as we have standardized our data in the certain range of 0 to 1. With this prediction, as a result, we are going to get poor accuracy, therefore we will calculate the loss or error between the actual and predicted output with the help of the loss function (also known as the error function) 

The loss function is nothing but the sum of the square of the difference between the desired output to the predicted output.

Loss Function = ∑ ½ ( Desired Output - Predicted Output )2

Here, ½ is a constant used.

Now let us say we got the loss function, now we have to minimize the loss in order to get more accuracy, or in other words, we have to reduce the value of the loss function to zero.

In order to reduce the value of the loss function, we have to update the value of our weights and biases and for that, we have to backpropagate with the help of backpropagation algorithm.

 

backpropagation architecture in neural network

 

Gradient Descent to update weight and reduce loss function

We are going to take an algorithm to update our weights, which is known to be gradient descent, see the below figure

 

Gradient descent to update weight and reduce loss function

 

In order to update the weights, we have to follow the following steps

Step 1: Take the initial value of weights

Step 2: Update W(init) in the direction of decreasing gradient(slope), when it reaches the local minima, it would have reduced its error.

Step 3: Update the weights with the help of the below formula

           

                Updated weight -> weight - ⋉ * ∂ (error)/∂ (weight)

⋉ = Learning rate

Repeat Step 3 until you reach local minima.

 

Learning Rate in Gradient descent

 

The learning rate is the step size taken while minimizing the loss or error function.

Let us see the above discussion with the help of an example 

 

The fully feed-forward network

 

While backpropagating with backpropagation algorithm, let us say we have to update the weight ‘w5’, see the procedure below -:

We will find out the change or error with respect to ‘w5’, so we will do partial differentiation

∂(loss function)/∂ w5 

              =

∂ (loss function) / ∂ out o1 * ∂  out o1/ ∂ h1 * ∂ h1/ ∂ w5

Above is the result of the chain rule in partial differentiation.

Here, (∂ out 01/ ∂ h1) = out o1(1 - out 01)

Out 01 -> output of o1 

∂ (loss function) / ∂ out 01 = out o1 - actual 01, and

∂ h1/ ∂ w5 = out h1

You should write these values in your notebook for a better understanding, now, at last, we will update the weight using the formula we discussed in the 3rd step.

w5*(updated weight) = w5 - 0.6 * ∂ (loss function)/ ∂ w5

Here, 0.6 is a learning rate.

Therefore, this is how we are going to calculate the weights and update them, which will result in the minimization of the loss function and we will achieve the desired accuracy.

Code Implementation of Backpropagation Algorithm

 

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
dataset = pd.read_csv('dataset.csv')
dataset = dataset.values

X, y = dataset[:, 1:], dataset[:, 0]

# Min-Max Scaler

X = (X - X.min()) / (X.max() - X.min())

class Backpropagation:
    
    def __init__(self, X, y):
        self.X = (X - X.min()) / (X.max() - X.min()) 
        self.y = y
        self.H1_size = 256
        self.H2_size = 64
        self.OUTPUT_SIZE = len(np.unique(y))
        self.INPUT_SIZE = X.shape[1]
        self.losses = []
        
        # Initialize weights
        self.W1 = np.random.randn(self.INPUT_SIZE, self.H1_size)
        self.W2 = np.random.randn(self.H1_size, self.H2_size)
        self.W3 = np.random.randn(self.H2_size, self.OUTPUT_SIZE)
        
        # Initialize biases
        self.b1 = np.random.random((1, self.H1_size))
        self.b2 = np.random.random((1, self.H2_size))
        self.b3 = np.random.random((1, self.OUTPUT_SIZE))
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def softmax(self, z):
        return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
    def forward(self, x):
        Z1   = x.dot(self.W1) + self.b1 # (N,256) = (N,784)(784,256)(1,256)
        A1   = self.sigmoid(Z1)
        Z2   = A1.dot(self.W2) + self.b2
        A2   = self.sigmoid(Z2)
        Z3   = A2.dot(self.W3) + self.b3
        yhat = self.softmax(Z3)
        
        self.activations = [A1, A2, yhat]
        
        return yhat
    
    def backprop(self, x, y, yhat, learning_rate=0.01):
        
        A1, A2, yhat = self.activations
        
        # Compute Gradients
        delta3 = yhat - y
        dldw3  = A2.T.dot(delta3)
        dldb3  = delta3.sum(axis=0, keepdims=True)
        
        delta2 = delta3.dot(self.W3.T) * (A2 * (1 - A2))
        dldw2  = A1.T.dot(delta2)
        dldb2  = delta2.sum(axis=0, keepdims=True)
    
        delta1 = delta2.dot(self.W2.T) * (A1 * (1 - A1))
        dldw1  = x.T.dot(delta1)
        dldb1  = delta1.sum(axis=0, keepdims=True)

        # Update Weights
        self.W3 -= dldw3 * learning_rate
        self.b3 -= dldb3 * learning_rate
        
        self.W2 -= dldw2 * learning_rate
        self.b2 -= dldb2 * learning_rate
        
        self.W1 -= dldw1 * learning_rate
        self.b1 -= dldb1 * learning_rate


    
    def compute_loss(self, y, yhat):
        # L = -E[y log(yhat)]
        return -np.sum(y * np.log(yhat))

    def get_predictions(self, test):
        yhat = self.forward(test)
        preds = np.argmax(yhat, axis=1)
        return preds
    
    def accuracy(self, preds, true_labels):
        return (preds == true_labels).mean()

 

tanesh

Founder Of Aipoint, A very creative machine learning researcher that loves playing with the data.