Linear Classification model using neural networks(Basics of Deep Neural Networks)(P-1)
We are going to build a Linear binary classification model to understand the neural network concepts which includes Sigmoid function, Cross Entropy, Gradient Descent with example.
You can find the code for the example used in this post at Git-hub repository. For better understanding, take a look at the concepts below before proceeding with the code.
If you are already familiar with these concepts, I recommend you visit Deeper neural networks using Non-Linear Model
Some might be confused b/w ANN, DNN and CNN.
- Note that all neural networks are referred as Artificial Neural Networks(ANN). The neural networks with more than one hidden layer are called Deep Neural Networks(DNN). Convolutional Neural Networks(CNN) are mainly used in image processing. We will be discussing about all of these one by one.
Target:
To build a linear classification model to classify whether a person has diabetes or not.
Graph Description:
* Each point in the graph represents a person who is tested and labeled whether they are diabetic or not
* Points in blue color represents person who is not diabetic
* Points in red color represents person who is diabetic
Considering age and blood glucose levels of a person on x and y axis, we need to find a line that best classifies the data. So that, when a new person’s details are entered our model must be able to classify which group a person belongs to. In other words, whether the person has a risk of diabetes or not.
Final linear model to be built looks like this:
Model Building:
The system starts with a random linear model to separate our data, calculates the errors associated with this model and readjust the weights to minimize the error and properly classify the data points. Lets have a look at each step closely with examples.
A perceptron is a basic form of Neural network that takes inspiration from the brain. What does a brain do? takes input from our ears, eyes, nose and process it then produce a result. Perceptron is pretty similar.
Steps involved in building a model are
Step-1: let’s consider an example, place a random linear model into a node, we can call it a “model node or perceptron”. As we discussed in above section, just like brain, our model node is going to receives inputs age(x1) and blood glucose(x2).
A line can be represented as
w1x1 + w2x2 + b = 0
where w1, w2 are weights which dictates slope and b is the bias
Step-2: These weights (w1,w2) start out as random values. So at the beginning, we are just going to have a random line which does not classify our data correctly. As the neural network learns more about what kind of output data its dealing with, it will adjust the weights based on the output errors that resulted in categorizing the data with previous weights, until it comes up with a better model. So, how do we do this?
Step-3: Considering our selected random model, we use Sigmoid function to predict continuous probabilities for each point. This function is also known as activation function. What is this Sigmoid function?
Sigmoid Function (Activation Function) Theory:
We discussed that the system starts with a random linear model to separate our data, calculates the errors associated with this model and then readjust the weights to minimize the error and properly classify the data points. Now comes the question How to calculate the error? We will be needing a continuous error function.
Looking at the above diagram clearly there are two misclassified points. We know the blue points need to be below the line and red one’s above the line.
So, the error function is going to assign each misclassified point a big penalty. For better understanding, we set the size of the points reflect the size of penalty in the below image. What we do is detect these error variations and thus figure out which direction we need to move the line the most. The total error results from the sum of these penalties associated with each point.
From the image above, we can see that there is high error value. So we move the line in the direction of the most errors, as shown below, until all error penalties are sufficiently small, thus minimizing the error.
Let’s re-think our perceptron model
In the second node, based on the score of each point it predicts a value of 0 or 1. Any point with positive score gets a 1 otherwise 0. These are discrete predictions which are derived from our step function. the problem is a step function increases or decreases very abruptly from one constant value to another. There is no in-between, its just 1 or 0.
We need continuous probabilities that is why we cannot use step function, as step function just tells us yes or no. So, we use sigmoid function instead of step function.
Step-5: Using these probabilities we calculate the error with Cross Entropy. So what is Cross entropy?
Cross Entropy Theory: (Binary Cross Entropy)
It is an error function used to calculate the total error associated with our linear model, Remember “more incorrect our model in separating our data more the entropy value, thus larger the error.”
Note: We are using a binary cross entropy, this cannot be used for Multi-class classification. You might have guessed it already :)
The idea is that with some random displayed data the computer will display some random model, based on that model it needs to calculate the error.
If you look at the example given below, you can see how a cross entropy is calculated for a point. Here y = label where label = 1 as the blue point indicates that a person has diabetes, probability p = 0.95, using the formula calculate probability of a point being blue.
In the same way we calculate the probability of each and every point to find the total error associated with our model.
Let us consider two models, good model and a bad model, as shown below, and calculate total error for both models to observe the difference. If you look at the below example models, you can easily identify which is the best model, the model that has less entropy value (error value) is the model that can classify more accurately, that is the one on the right.
Step-6: Using these cross entropy values we apply gradient descent, which keeps minimizing the error, doing so obtaining the linear model that better classifies.
So, what is this Gradient Descent?
Gradient Descent Theory:
Previously we looked at calculating the total error with our cross entropy function. Now what we’ll do is, use gradient descent to minimize that error and obtain a better model which better classifies the data and we keep doing that over and over through many iterations until we obtain the perfect line.
To minimize the error, we need to take its gradient (It is the derivative with respect to weights). If we subtract the gradient to our linear parameters [weight1, weight2 and bias] it tells us the value that’s going to decrease the error function the most. Ultimately resulting in a linear model with a smaller error. This process is repeated, thus minimizing the error and eventually obtaining a line with small enough error that correctly classifies our data.
Unfortunately in Python we’re not able to simply take the derivative of the error function but have to actually derive the equation ourselves and then code it.
Finally, by following all these steps we implemented a linear classification model using neural network.
Code for the example Linear classification neural network model(Diabetes detection) is available at Git-hub
Note:
Congrats on coming this far, don’t stop now. Now that you got an idea about how a Linear neural network model works.
lets dive into Non-Linear Classification Neural Network Model. for deeper and complex neural network.