Non-Linear Classification model using Deep neural networks (p-2)
We will be discussing how to perform binary classification using Non-linear neural network model. Concepts include Non-linear model, Feed Forward process, Back propagation, Neural networks architecture, Example Implementation.
If you are new to these concepts or not sure where to start , I recommend you visit Basics of deep neural networks with linear model
If you are already familiar with the topics mentioned, you can dive into Multi-class classification network
- Example code for Non-linear classification model is provided in Git-hub
Non-Linear boundaries:
When a linear model is unable to represent a set of data a Non-Linear model is used instead. what does a non-linear model look like?
So how are we going to obtain this curve?
We are going to combine two linear models to form a non-linear model (curve), as shown below.
Now, lets treat each input model as an input node which contains some linear equation and refer the first model as x1 and second model as x2. We multiply each linear model by some weights. X1 is multiplied with w1 and x2 with w2. And if you recall from our linear model explanation, we also consider the bias as a node as well, which is multiplied by some bias value. Every thing is then added up to obtain the linear combination. Recall that subsequent to linear combination we apply a sigmoid activation function, ultimately resulting in the following curve.
Let us assume w1=1.5, w2=1, b=0.5.
Suppose in the first model it predicts that for the blue point, marked in the below image, its probability of being in the positive region is 0.88. In the second model it assigns the same point with a probability of 0.64. Now, we multiply the probability of 1st model with weight 1, that is, x1w1=0.88(1.5) and the probability of 2nd model with weight 2, that is, x2w2=0.64(1), and the bias with b, that is, 1(0.5), adding them up to obtain the score.
0.88(1.5)+0.64(1)+1(0.5) = 2.46
Since we are taking a sigmoid of linear combination for both of our models, we are also taking the sigmoid of all the combinations of our points which converts them into probabilities. In our case the sigmoid is 0.92. Which indicates that in our new model the exact same point is 92% positive.
If you are not familiar with sigmoid function. I recommend you visit and learn sigmoid function explanation
Note that as we are multiplying the each one by respective weights, the input with highest weight is the one that has the larger effect on the result. In our case it is clear that weight1 is higher than weight2, as a result our final model looks a little more like the first mode.
Lets see what happens if we increase the weight of the second model to 3, that is w2=3, and take their linear combination with the new weights and apply the sigmoid, the resulting model is going to look a lot more like the second model, as you can see in the image below.
The point being, linearly combining existing models to create new models that better classify hard data is the core of complex neural networks.
Neural Network Architecture and Feed Forward:
The linear model whose line is represented with the following equation
-4x1-x2+12
can be represented as a perceptron with weights -4,-1 and bias as 12.
Similarly other linear model with equation
-(1/5)x1-x2+5
represented as a perceptron with weights -1/5, -1 and bias 5
As we discussed early, we combine both linear models, multiply each of them with respective weights, add them up and finally apply sigmoid function to obtain their probabilities. This is done at each node to obtain the final curve.
This process we used to build this model is called as “Feed Forward process” of a deep neural network.
For concision we can rearrange the notation of this neural network instead of representing our point as two distinct x1 and x2 input nodes where each one goes into a separate model. We can represent it as a single pair of x1 and x2 as input nodes. That is instead of having two of the same x1’s we’ll have single x1 that is multiplied by the weight of the first linear model and multiplied by the weight of the second linear model. We represent x2 in the same way.
This above example architecture demonstrates the actual architecture of a neural network.
The first layer is called the “input layer” which contains the inputs x1 and x2
- Instead of processing these inputs in the output layer, first they must go through the “hidden layer” which is a set of linear models. You can use as many hidden layers as you require, in our case we used only 1. The more hidden layers you have the deeper our model gets, that is why they are called as deep neural networks.
And the final layer is the “output layer” which resulted from the combination of our two linear models to obtain a non linear model.
Note: Some times we need to combine non-linear models to obtain a more complex model.
To find the error in a model we use Cross entropy, which I discussed in my previous post have a look at Cross entropy.
To recap:
- Firstly we conduct feed forward operation on previously labeled data to obtain their predictions.
- Then apply error function on all the probabilities to determine the total error using cross entropy
- The center is then backpropagated to update the weights of all of our models and we keep doing that iteratively over and over at some learning rate until we obtain a good model
So How does this back propagation work?
Back propagation:
It is simply the reverse of feedforward. With feed forward we can predict outputs for all of our training data and now based on the outputs obtained from this model, by comparing the predictions to the actual outputs of our labelled data. The more miss classifications there are the larger the error, which is calculated using cross entropy. We apply some form of gradient descent. I already discussed about gradient descent in detail. If you recall, the negative of the gradient takes us in the direction that minimizes the error the most. The same concept applies here
I recommend you visit Tensor-flow playground and have fun learning with live examples.
Example Implementation:
Now that you got an idea, lets dive into an example. For code look at Git-hub
We are going to look at a complex classification model that looks like this
As you can see our sample Non-linear data is much more complex compared to the previous linear model as shown below,
The data in the question cannot be separated using single line. So classifying this data requires a deeper neural network. So lets build our classification model using keras, python.
As we discussed we use simple linear models and combine them to form our required model. We use 1 hidden layer with 4 nodes (or 4 linear models), as shown below
You might wonder “is 4 a fixed number?”. No, there is no optimal number that you must have but having huge number of nodes or hidden layers can lead to overfitting.
That’s it, code is available at Git-hub with comments
Congratulations coming this far, now you have a knowledge about Linear and Non-linear binary classification models using neural networks.
Now lets dive into Multi-class classification neural networks