Non-Linear Classification model using Deep neural networks (p-2)

7 min readJun 15, 2021

We will be discussing how to perform binary classification using Non-linear neural network model. Concepts include Non-linear model, Feed Forward process, Back propagation, Neural networks architecture, Example Implementation.

If you are new to these concepts or not sure where to start , I recommend you visit Basics of deep neural networks with linear model
If you are already familiar with the topics mentioned, you can dive into Multi-class classification network

Example code for Non-linear classification model is provided in Git-hub

Non-Linear boundaries:

When a linear model is unable to represent a set of data a Non-Linear model is used instead. what does a non-linear model look like?

So how are we going to obtain this curve?
We are going to combine two linear models to form a non-linear model (curve), as shown below.

Linear combination to form a Non linear model

Now, lets treat each input model as an input node which contains some linear equation and refer the first model as x1 and second model as x2. We multiply each linear model by some weights. X1 is multiplied with w1 and x2 with w2. And if you recall from our linear model explanation, we also consider the bias as a node as well, which is multiplied by some bias value. Every thing is then added up to obtain the linear combination. Recall that subsequent to linear combination we apply a sigmoid activation function, ultimately resulting in the following curve.

Representing a perceptron model for the above example chosen

Let us assume w1=1.5, w2=1, b=0.5.

Suppose in the first model it predicts that for the blue point, marked in the below image, its probability of being in the positive region is 0.88. In the second model it assigns the same point with a probability of 0.64. Now, we multiply the probability of 1st model with weight 1, that is, x1w1=0.88(1.5) and the probability of 2nd model with weight 2, that is, x2w2=0.64(1), and the bias with b, that is, 1(0.5), adding them up to obtain the score.

0.88(1.5)+0.64(1)+1(0.5) = 2.46

Multiply weights and add them to obtain total score

Since we are taking a sigmoid of linear combination for both of our models, we are also taking the sigmoid of all the combinations of our points which converts them into probabilities. In our case the sigmoid is 0.92. Which indicates that in our new model the exact same point is 92% positive.

Apply sigmoid to the total score obtained

If you are not familiar with sigmoid function. I recommend you visit and learn sigmoid function explanation

Note that as we are multiplying the each one by respective weights, the input with highest weight is the one that has the larger effect on the result. In our case it is clear that weight1 is higher than weight2, as a result our final model looks a little more like the first mode.

Lets see what happens if we increase the weight of the second model to 3, that is w2=3, and take their linear combination with the new weights and apply the sigmoid, the resulting model is going to look a lot more like the second model, as you can see in the image below.

The point being, linearly combining existing models to create new models that better classify hard data is the core of complex neural networks.

Neural Network Architecture and Feed Forward:

The linear model whose line is represented with the following equation
-4x1-x2+12
can be represented as a perceptron with weights -4,-1 and bias as 12.

perceptron representation of an equation

Similarly other linear model with equation
-(1/5)x1-x2+5
represented as a perceptron with weights -1/5, -1 and bias 5

Perceptron representations of Two equations

As we discussed early, we combine both linear models, multiply each of them with respective weights, add them up and finally apply sigmoid function to obtain their probabilities. This is done at each node to obtain the final curve.

Add all the probabilities to get a total score

Then apply sigmoid to the obtained score we get 0.92

This process we used to build this model is called as “Feed Forward process” of a deep neural network.

For concision we can rearrange the notation of this neural network instead of representing our point as two distinct x1 and x2 input nodes where each one goes into a separate model. We can represent it as a single pair of x1 and x2 as input nodes. That is instead of having two of the same x1’s we’ll have single x1 that is multiplied by the weight of the first linear model and multiplied by the weight of the second linear model. We represent x2 in the same way.

This above example architecture demonstrates the actual architecture of a neural network.

The first layer is called the “input layer” which contains the inputs x1 and x2

Instead of processing these inputs in the output layer, first they must go through the “hidden layer” which is a set of linear models. You can use as many hidden layers as you require, in our case we used only 1. The more hidden layers you have the deeper our model gets, that is why they are called as deep neural networks.

And the final layer is the “output layer” which resulted from the combination of our two linear models to obtain a non linear model.

Note: Some times we need to combine non-linear models to obtain a more complex model.

To find the error in a model we use Cross entropy, which I discussed in my previous post have a look at Cross entropy.

To recap:

Firstly we conduct feed forward operation on previously labeled data to obtain their predictions.
Then apply error function on all the probabilities to determine the total error using cross entropy
The center is then backpropagated to update the weights of all of our models and we keep doing that iteratively over and over at some learning rate until we obtain a good model

So How does this back propagation work?

Back propagation:

It is simply the reverse of feedforward. With feed forward we can predict outputs for all of our training data and now based on the outputs obtained from this model, by comparing the predictions to the actual outputs of our labelled data. The more miss classifications there are the larger the error, which is calculated using cross entropy. We apply some form of gradient descent. I already discussed about gradient descent in detail. If you recall, the negative of the gradient takes us in the direction that minimizes the error the most. The same concept applies here

I recommend you visit Tensor-flow playground and have fun learning with live examples.

Example Implementation:

Now that you got an idea, lets dive into an example. For code look at Git-hub

We are going to look at a complex classification model that looks like this

As you can see our sample Non-linear data is much more complex compared to the previous linear model as shown below,

The data in the question cannot be separated using single line. So classifying this data requires a deeper neural network. So lets build our classification model using keras, python.

As we discussed we use simple linear models and combine them to form our required model. We use 1 hidden layer with 4 nodes (or 4 linear models), as shown below

You might wonder “is 4 a fixed number?”. No, there is no optimal number that you must have but having huge number of nodes or hidden layers can lead to overfitting.

That’s it, code is available at Git-hub with comments

Congratulations coming this far, now you have a knowledge about Linear and Non-linear binary classification models using neural networks.
Now lets dive into Multi-class classification neural networks