Convolutional Neural Networks-CNN Concepts(p-5)

13 min readJun 20, 2021

We will be discussing the concepts of CNN, which involve CNN architecture, Convolutional Layers, Pooling Layer, Relu activation function, Fully connected Layers.

In previous pages, we discussed the concepts of Binary classification in Linear neural model, Non-Linear Model, Multi-class classification and an example project on digit recognition using DNN.
If you are familiar with these concepts, you can dive into implementation of digit recognition project using CNN.

CNN has become the go to model for image classification. Its mainly due to their exceptional ability to extract important and distinctive features from images. They do so thanks to convolutional layers which make use of convolutional operation.

Recall:

In the last page we tried to classify and written images with normal artificial neural networks and reached a very limited accuracy. So, we step into convolutional neural network.

CNN:

Some applications that use CNN are face recognition, object detection, traffic signs etc.

CNN’s are very effective at recognizing useful patterns within the images by understanding that spatial structure of the inputs is relevant. While typical neural networks ignore the spatial relevance of pixels such as pixels being closer together versus pixels that are further apart.

CNN’s require a lower quantity of parameters when compared to Artificial neural networks.

Lets consider an example from Git-hub of a CNN that is classifying the image of a cat into its appropriate class.

By seeing the example above, we can see that a similarity b/w a CNN and Regular Neural Network (RNN) is the input layer. Other similarity is the fully connected layer which essentially is just a multilayer perceptron parameterized by weights and bias values that make use of the softmax activation function in the output layer, which, after many convolutions outputs the probabilities of the image belonging to some class.

CNN are known to process data that has a known grid like topology and as such will be using CNN to solve image driven pattern recognition task. Which makes sense as we know that images can be read as a grid of pixels.

In case of grayscale images, is what we’ll be dealing with, They consist of a 2d array of pixels, each pixel value ranging from 0 to 255 depending on pixel intensity. The darker the pixel higher the intensity value, as shown in above image.

As we discussed in MNIST digit recognition using DNN, using regular DNN for image classification results in inaccuracy as it requires more computational power. So we will be using CNN.

Other drawback with DNN is that it might result in Overfitting. Where as in CNN we can make use of pooling layers which will act to continuously reduce the number of parameters in computations in the network.

CNN architecture with MINIST dataset example:

CNN are very different from regular ANN that we discussed before, as they are comprised of three types of layers:

convolutional layers
pooling layers
fully connected layers

Just like before, the input layer is where we pass in the pixel values of the image, where in the above image it is still 784 and we’re still going to have 10 output nodes from 0 to 9 each one corresponding to a class for which the number bring passed and belongs to. We still make use of the softmax function, that we discussed, to make a final prediction.

Lets focus on convolutional layer first

Convolutional Layers:

The main building block of the CNN is the convolutional layer. Their primary goal is to extract and learn specific image features that can be used to help classify the image.
It is apparent that the network employs some kind of operation called convolution. Suppose we had the following image, where each pixel is denoted by some pixel intensity from 0 to 255.

This entire image is the input and each pixel corresponds to an input node. All of these inputs all of these pixels inside of the convolutional layer are going to be processed by a convolutional filter which is also known as the kernel or kernel matrix.

These kernels or filters are generally small in spatial dimensionality. Since we’re dealing with a three by three kernel what we’re doing is applying a three by three convolution on our image. kernel convolution though it may sound overly complex is actually really simple.

Kernel Convolution:

We can perform the convolutional operation by sliding the kernel at every location of the image. The amount by which we’re shifting the kernel at every operation is known as the “stride”. In this case the stride is one, which makes the filter move one pixel at a time. “The bigger the stride the smaller the corresponding feature map.”
- Lets look at a smaller image for simplicity

The area where the operation takes place, the highlighted area, is known as receptive field whose dimensions correspond to the size of the kernel that you’re using. In this case we’re performing the operation on a three by three receptive field as shown above.
And inside the receptive field what we do is multiply every cell with the corresponding cell in the kernel. See the below images to understand this process.

Multiply the cells with corresponding kernel cells

multiply all cells in the same way as the above

Now add them up,

then taking the average by dividing by the number of pixels in the receptive field, we get -6.1, this result is shown in the following feature map.

We then slide the kernel by step size of 1, shifting our receptive field one unit to the right and perform the same operations

we keep doing that until we’ve convolved the kernel over the entire image. which eventually results in the following feature map

Notice how the general feature of the image is preserved. Such that there is an edge between small pixel intensity values and the larger pixel intensity.

So Why is this called a feature map?

Remember that the primary purpose of the convolutional layer is to extract and learn specific image features that we can use to help classify the image and the feature map contains a specific feature of interest which was extracted from the original image. In this case the edge. That being said, the kernel itself is the feature detector. Convolving the kernel over the image is what we use to extract some feature from the original image onto a feature map.

Accordingly every kernel designed is trained to have a distinct set of weights, which is what distinguishes one kernel from the other. The values of the weights in the kernel are learned by the CNN during the training process through a gradient descent algorithm, which acts to minimize the error function by changing the kernel values to the ones that are best able to detect features in the image.

It’s the same concept as before where we used gradient descent to update weights and the direction that minimizes the error the most. In our case it’s going to be in the direction that best classifies the hand written image.

CNN’s possess an ability known as “translational invariance”.
The idea is that if a kernel is able to detect a feature in one part of the image, since we’re convolving it throughout the entire image it is likely that it can detect the same feature somewhere else. In the case of below image the kernel which was made to detect diagonal lines in the image when convolved throughout the entire image will identify both diagonal lines on the corresponding feature.

Essentially by sharing weights you enable the network to learn a single filter for a feature in the image no matter where it appears in the image.
Different filters are able to detect different features from an image. The more filters we have the more features we can extract from the original image and thus improving the neural networks ability to recognize patterns in unseen images, as shown below.

Consider an example,

In this example what we did is we perform three convolutions on the same inputs, in each convolution we use a different kernels, each kernel corresponding to its own feature map combining all of these feature maps gives us the final output of the convolutional layer a depth of three feature maps, where each feature map detected a distinct image feature which is then further processed by the neural network.

For the image above, At first we used 15 different kernels. We therefore ended up with 15 different feature maps. We stack them up along the depth dimension to form the full output volume of the first convolutional Layer.

The deeper the resulting feature map the more features we extracted and thus improving the network’s ability to classify the image.

what we’ve dealt with so far are two dimensional convolutions and it’s what we’re going to use when applying convolutions on the images obtained from MNIST datasets. The reason being,
The input of grayscale images are 2 dimensional in the sense that they are only represented by a single channel of pixel intensities whereas our RGB color images would be composed of three channels and are thus 3 dimensional.

RGB images not only have a width, a height but also a depth where depth corresponds to the number of color channels. Accordingly one dealing with a 3-D image the kernel will have to be three dimensional as well. Your kernel instead of it being just a three by three kernel it would be a three by three by three. The depth of the kernel must match the depth of the image.

In the case of this section we’ll be dealing with a two dimensional twenty eight by twenty eight grayscale image and thus we will work with only two dimensional kernels not necessarily three by three could be five by five but our depth will always be one.

After convolving the kernel onto the image which in turn would result in a feature map, we then apply the “relu activation function.”

Relu activation function:

Any kind of neural network must contain nonlinearity since most of the real world data that the neural network is required to learn is non-linear. Yet the convolutional operation itself will involve the kernel throughout the image.

What we’re doing is performing a matrix multiplication and addition between the kernel and the image. This is a linear operation resulting in a linear feature map.

Then relu function is used to introduce non-linearity.

Now we apply “pooling”

Pooling:

The pooling layer acts to shrink the image stuck by reducing the dimensionality of the representation of each feature up thereby also reducing the computation or complexity of the model.

Pooling is done to help avoid overfitting

There are three different types of pooling operations sum, average, max.

We’re just going to talk about Max pooling we’ll make use of the Max pooling operation which reports the maximum output within a rectangular neighborhood.

what we’ll do is specify a kernel with two by two dimensionality and it can involve it throughout each image such that it only takes the maximum value in each neighborhood and it does with a stride of two along the spatial dimensions of the up. Meaning whenever it processes the rectangle or neighborhood of the image of the feature map obtained from the convolutional layer it’s going to shift the step size of two spaces and process the next rectangular neighborhood. If we continue this entirely across the entire image for all the images Eventually we end up with the following images with the following feature maps.

Ultimately what it does is that scales down the feature size to only account for the maximum values. But at the same time notice our feature map is still consistent with the feature of interest. The first feature map that was scaled down still retains the feature of interest the front slash. The second feature maps scaled down still retains the X feature in the middle and the third one scaled down also retained the feature of interest all the other feature maps all of them still maintain their relative pattern and this can actually be better observed if the bright pixels in each feature map are highlighted.

why was it scaled down?

Well the first The reason being is it reduces computational costs reduces the number of parameters in the image and helps to reduce overfitting by providing an abstracted form of the original feature map. But even then it still preserves the general patterns such that it’s almost half the size of the original filtered image. Now our image is more manageable.

Max pooling provides a scale invariant representation of the image which is very useful as it allows to detect features in an image no matter where they are located. Pooling helps make the network remain unaffected by small translations in the input image or any distortion because of one taking the maximum value in a local neighborhood.
In summary, pooling helps reduce overfitting by reducing the number of parameters and computation in the network by providing a more abstracted generalized representation of the original feature map.

Fully Connected Layer:

First the model extracts the necessary features in the convolutional and pooling layers, which we discussed in above sections.

And subsequently the output from the convolutional and pulling operations that is each feature map must be flattened into a one dimensional array of pixels for it to be fed into the input layer of the fully connected network, where each pixel corresponds to a node in the input layer.

The fully connected neural network is responsible in taking these features as inputs and processing them to attain a final probability as to what class the image belongs to.

The fully connected layers work the same way as the multi-layered Perceptron, where every node in the preceding layer is connected to every node in the subsequent layer. This is what separates it from the convolutional layer. Each neuron is connected to all the neurons in the previous layer and each connection has its own weights.

How does the process actually work?

Fully connected layer simply updates its weights and bias values to minimize the total error function based on a gradient algorithm, which we already discussed.

First like always random values are initialized for all the filters and parameters in the convolutional layers and random values are also assigned for the weights and bias values and the fully connected layer then like always the network receives inputs.
In case of above example, it receives a training image which goes through the length of the neural network. Its features are extracted with convolutions and scaled down into abstracted representations with the pooling layers.
After extracting the necessary features that are then classified. As our network attempts to classify the image it outputs a prediction such that malignant received the higher probability. This output prediction is compared to the target prediction label. And supposedly this image corresponds to a benign breast Legion, It’s not malignant. It made the wrong prediction, as shown below.

The overall error that is cross entropy value is calculated and it turns out the error is quite large.
And so to minimize this error we must update all the filter values the weights and as values of our network.
And once again this is done using back propagation where we still use gradient descent to update the parameters in the network based on the gradient of the error.
After minimizing the error, Eventually the network learns the proper filter and parameter values to correctly extract features and classify them accordingly.
Ultimately outputting the correct predictions for the image.

The values of the filter matrix and the convolutional layer and connection rates and the fully connected layer are the only things that change during training.

Now that we completed learning the concepts of CNN, Lets Continue with implementing our project on Digit Recognition using CNN.