RANDOM FOREST ALGORITHM AND ITS PYTHON IMPLEMENTATION
Random Forest is one of the most popular machine learning algorithms which fall under the category of supervised learning technique. This can be used for both classification and regression tasks in machine learning domain. Random Forest is based on the concept of ensemble learning which is a process of combining multiple classifiers to solve a complex problem and to improve the overall performance and accuracy of the machine learning model. Random Forest contains a number of decision trees on various sub-samples or subsets of the given dataset and takes the average of the output from all the decision trees to improve the predictive accuracy of that dataset. Instead of relying on the prediction of one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions it predicts the final output. The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. This diagram explains the working of the Random Forest algorithm:-
WHY TO USE RANDOM FOREST ?
1. Random Forest takes less training time as compared to other classification algorithms. 2. Random Forest predicts output with high accuracy even for the large dataset with high
dimensionality. 3. Random Forest can also maintain accuracy when a large proportion of data is missing in the
HOW DOES THE ALGORITHM WORK ? Step-1: Select random N data points from the training set. Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number M for decision trees that you want to build. Step-4: Repeat Step 1 & 2. Step-5: For new data points, find the predictions of each decision tree and assign the new data
points to the category that wins the majority votes.
EXAMPLE TO DEMONSTRATE RANDOM FOREST CLASSIFIER: Suppose there is a dataset that contains multiple fruit images (as shown in the below image). So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree gives a prediction result and when a new data point occurs then based on the majority of results, the Random Forest classifier predicts the final decision. The below diagram can be referred to understand the given context.
APPLICATIONS OF RANDOM FOREST:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk. 2. Medicine: With the help of this algorithm risks of the disease can be identified. 3. Land Use: We can identify the areas of similar land use by this algorithm. 4. Marketing: Marketing trends can be identified using this algorithm. ADVANTAGES OF RANDOM FOREST: 1. Random Forest is capable of performing both Classification & Regression tasks but mostly it is used for classification purposes only. 2. Random Forest is capable of handling large datasets with high dimensionality with a lot of ease. 3. Increasing the number of decision trees for each subset of dataset enhances the accuracy of the model and prevents the overfitting issue.
PYTHON IMPLEMENTATION OF RANDOM FOREST ALGORITHM:
Implementation steps are as follows:- 1. Data Preprocessing. 2. Fitting the algorithm to the Training set. 3. Predicting the test result. 4. Testing the accuracy of the results (Creation of Confusion matrix) 5. Visualizing the test set results.
*INPUT DATASET :
The dataset used to demonstrate Random Forest Classification Algorithm is the information of various users obtained from the social networking sites. There is a automobile manufacturing company that has launched a new SUV. So the MNC wanted to check how many users from the database would like to purchase the car. For this problem, we will build a Machine Learning model using the Random Forest Algorithm. The dataset is shown in this image.
In this use case, we will predict the purchased variable (Dependent Variable) by using age and salary (Independent variables) from the dataset.
I. DATA PRE-PROCESSING :
*POINTS TO BE NOTED :
Why to standardize before fitting a ML model ? Well, the logic is quite simple. Variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting. The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model. Thus, StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1. fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data. The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance.
Now, we want scaling to be applied to our test data too and at the same time do not want to be biased with our model. We want our test data to be a completely new and a surprise set for our model. The transform method helps us in this case. Using the transform() method we can use the same mean and variance as it is calculated from our training data to transform our test data. Thus, the parameters learned by our model using the training data will help us to transform our test data. Now the question is why we did this? Here is the simple logic behind it! If we will use the fit method on our test data too, we will compute a new mean and variance that is a new scale for each feature and will let our model learn about our test data too. Thus, what we want to keep as a surprise is no longer unknown to our model and we will not get a good estimate of how our model is performing on the test (unseen) data which is the ultimate goal of building a model using machine learning algorithm. This is the standard procedure to scale our data while building a machine learning model so that our model is not biased towards a particular feature of the dataset and at the same time prevents our model to learn the features/values/trends of our test data.
II. FITTING THE ALGORITHM ON THE TRAINING SET :
Now we will fit the Random forest algorithm to the training set. To fit it, we will import the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:
Here: n_estimators = The required number of trees in the Random Forest. The default value is 10. We can choose any number but need to take care of the overfitting issue. criterion = It is a function to analyze the accuracy of the split. Here we have taken "entropy" for the information gain.
III. PREDICTING THE RESULT :
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will create a new prediction vector y_pred. Below is the code for it:
IV. CONFUSION MATRIX : Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the code for it:
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92 correct predictions.
V. VISUALIZING TRAINING SET RESULTS :
The above image is the visualization result for the Random Forest classifier working with the training set result. It is very much similar to the Decision tree classifier. Each data point corresponds to each user of the user_data, and the purple and green regions are the prediction regions. The purple region is classified for the users who did not purchase the SUV car, and the green region is for the users who purchased the SUV. So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the Purchased variable. The classifier took the majority of the predictions and provided the result.
V. VISUALIZING TEST SET RESULTS :
The above image is the visualization result for the test set. We can see that there is a minimum number of incorrect predictions (8) without the Overfitting issue. We can get different/better results by changing the number of trees in the random forest classifier (value of n_estimators variable).
This blog focused on giving a brief introduction to Random Forest Algorithm. Especially on how to use is as a classification algorithm for various Machine Learning tasks. The working along with practical python implementation of Random Forest was duly explained through this blog. Hope so this blog helps someone who wishes to use Random Forest as a classification algorithm in their Machine Learning models.