Pallavi Dash

# Intuition to Generalization in Deep Learning Models

Our expectation from a neural network is to deliver good performance when the model encounters an unseen data. Generally, when the dataset is sufficiently small, the model tends to memorize the given data, as a result of which we come across a situation called the ** overfitting **condition. The model then performs poorly on the testing dataset , consequence of which is several wrong predictions by the model.

**Fig 1 : Graphical Representation of Overfitting Condition (taken from Wikipedia)**

This is where the concept of ** generalization** is introduced. It is the procedure of making the model learn relevant features from the training dataset and apply the knowledge in correctly predicting the outcomes on a testing or validation dataset.

Let us try to understand with an example:

Suppose our task for classification is to recognize whether the image is of a hibiscus flower or not. So we train our model with 100 images of various hibiscus flowers.

**Fig 2(a): Training Dataset**

**Fig 2(b) :Testing Dataset**

To our surprise , when we test our model with 2 new varieties of hibiscus flowers (which was absent initially in the training dataset), it is unable to identify them as hibiscus. This clearly indicates that our model has retained the images in the dataset and only recognizes them as hibiscus, and has not been able to extract intricate features that help in distinguishing hibiscus from all other flowers. This is the lack of generalization that the model has not been able to tackle.

Therefore , in this article, we are going to cover a few methodologies that can be adopted to significantly reduce overfitting and improve modelâ€™s generalization power to the data.

Augmentation of Training Dataset

Controlling Number of Parameters

Dropout Neurons

Regularization of weights

Early Stopping

__AUGMENTATION OF TRAINING DATASET__

One simple way of attaining generalization of the model is by augmentation of our dataset that is used to train our model. The technique for data augmentation varies for text and vision applications.

For augmentation based on vision, we can escalate the training dataset efficiently by either zooming it, injecting noise , zooming or by even flipping the images. This can increase the probability of recreating an image close to the ones available in the testing dataset.

**Fig 3 : Data Augmentation using Back Translation**

Whereas for text based augmentation, the most popular techniques include the back translation and Easy Data Augmentation like synonym replacement.

**Fig 4: Easy Data Augmentation using Synonym Replacement**

Also, the method acquired primarily depends on the tasks. Sometimes, for computer vision based implementation, flipping images might be helpful for object recognition based work but will be of less to no use when we are trying to classify handwritten digits.

__CONTROLLING NUMBER OF PARAMETERS__

If we try to understand in simple words, the parameters in a neural network represent the weights that are assigned to the neurons in the network. And with each layer added, we add more numbers of these neurons to our model. From here, we can establish the relationship that the number of parameters is directly proportional to the total number of neurons added in the network.

With each layer increased, we introduce more non-linearity relationships between our input and output. This increases the chances of our model to memorize in case we are handling a smaller dataset. Since, our goal is to minimize the memorizing capability and maximize the general understanding of our model, we can hence control the number of parameters according to the structure of our dataset. Though there is no direct formula to tweak this factor, one can initially train the model with a huge number of parameters and attain the stage of overfitting and then slowly and iteratively , can reduce the parameters, check the results and adjust the parameters accordingly.

__DROPOUT OF NEURONS__

In order to achieve better results, we could try multiple configurations of the neural network and then perform averaging of all the weights to obtain the final weights of the neural network.This technique is famously called the** ensemble**. But this task is computationally very expensive.

There is an easier way out of trying something similar to the above situation, called the dropout of neurons.

In a neural network, when we dropout neurons, there is a chance of the neuron in the network to completely dropout in an iteration, which means that at certain point of time some will be absolutely disconnected from all the other neurons in that iteration.

What happens is, in the absence of these disconnected neurons, the other neurons connect each other and form a different configuration from the earlier setup, leading to various subsets of the original neural network. There will always be a different configuration due to its indiscriminate nature in dropout neurons.

After several iterations, we will have multiple subsets of the main network, which is close to what we were trying to achieve through ensemble, but is computationally more feasible.

**Fig 5: In the above figure, the neurons denoted with black are dropped during a particular iteration. The selection of neurons that are dropped from the network is in random fashion. After several epochs, the weights from the subsets are aggregated to form final weights of the network.**

__REGULARIZATION OF WEIGHTS__

There is a chance during the training of our neural network that few of the neurons focus on a certain feature or trend in the dataset and hence, these neurons get assigned huge weights. Eventually, these weights start to increase with increasing epochs and overfit the training data.

While we are training our model, we have to check that in order to extract a certain feature, it is necessary that weight assigned is relatively larger in comparison to other weight, while the absolute value need not be very huge.

The disadvantage associated with larger weights is that there is an increase in input output variance, due to which the network becomes very sensitive to even small changes. According to deep learning theory, similar inputs should result in very similar outputs. But assigning huge weights results in varied results for similar inputs and hence , the network predicts wrongly.

Now, our task is to restrict the magnification of the weights allocated and not allow them to gain huge weights. But how can we accomplish this ? One way of doing this is through append a portion of the weights in the loss function , so that when we try to minimize our loss function during training of the model, our weights are also minimized.

This is exactly when L1 and L2 regularization comes into place. Intuition behind L1 and L2 regularization is vast. To understand in depth and the mathematical reasoning, please visit this __blog__.

__EARLY STOPPING__

Generally during training of the model, the training error keeps decreasing as we are performing optimization for the error. At the same time, when we focus on the validation error, initially there is improvement in the validation error, but with increase in training the validation error starts increasing.

**Fig 6 : Graphical Representation of Early Stopping (taken from Analytics Vidhya)**

This happens because our model starts overfitting on the training dataset and results in wrong predictions on the validation dataset. In such scenarios, we can stop our training when our modelâ€™s validation error performance is decreased. This strategy will help us in preventing overfitting.

Hope this blog helps people who have started with Deep Learning. Any suggestions from the readers are most welcome!