Natasha Pradhan

Mar 14, 20213 min

A Case to Study Supervised Learning : Logistic Regression , Naive Bayes , KNN , SVM

A dataset showcasing the detailed variable analysis whether the output is dependent or independent of the given variables of the dataset.

Importing packages that are required and calling the data file on which supervised learning will be implemented.

import numpy as np
 
import pandas as pd
 
import matplotlib.pyplot as plt
 
import seaborn as sns
 

 
data=pd.read_csv(r"E:/D _ FOLDER TRANSFER/MACHINE LEARNING AI/Churn_Modelling.csv")
 

 
data.isnull().sum()
 

 
for value in data.columns:
 
print(data[value].unique())

Data is checked for Unique values or any Null values as in "NaN" , "?" , etc then is replaced by mean or median or the most common value of that dataset .Graphically the data are plotted to analyze which values are affecting the output and which are not hence gets dropped after the data analysis.

Data Analysis


 
plt.figure(figsize=(12,10))
 
sns.distplot(data.CreditScore[data.Exited==0])
 
sns.distplot(data.CreditScore[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
sns.countplot(data.Geography,hue=data.Exited)
 

 
sns.countplot(data.Gender,hue=data.Exited)
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.Age[data.Exited==0])
 
sns.distplot(data.Age[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.Tenure[data.Exited==0])
 
sns.distplot(data.Tenure[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.Balance[data.Exited==0])
 
sns.distplot(data.Balance[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.NumOfProducts[data.Exited==0])
 
sns.distplot(data.NumOfProducts[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.EstimatedSalary[data.Exited==0])
 
sns.distplot(data.EstimatedSalary[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
sns.countplot(data.HasCrCard,hue=data.Exited)
 

 
sns.countplot(data.IsActiveMember,hue=data.Exited)
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.RowNumber[data.Exited==0])
 
sns.distplot(data.RowNumber[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])
 

 
plt.figure(figsize=(12,10))
 
sns.distplot(data.CustomerId[data.Exited==0])
 
sns.distplot(data.CustomerId[data.Exited==1])
 
plt.legend(['no_leaving','leaving'])

Here as we can see "Age" is affecting the output hence its to be included in the training dataset.

Here as we can see "Tenure" is not affecting the output hence it's to be excluded from the training dataset.

As we can see "CreditScore" is not affecting the output that much it's to be excluded from the training dataset.

Here as we can see "Geography" is affecting the output hence its to be included in the training dataset.

Here as we can see "Gender" is affecting the output hence its to be included in the training dataset.

Here as we can see "Balance" is affecting the output hence its to be included in the training dataset.

Here as we can see "EstimatedSalary" is affecting the output hence its to be included in the training dataset.

Here as we can see "HasCrCard" is affecting the output hence its to be included in the training dataset.

Here as we can see "NumOfProducts" is affecting the output hence its to be included in the training dataset.

As we can see "RowNumber" is not affecting the output that much it's to be excluded from the training dataset.

Here as we can see "IsActiveMember" is affecting the output hence its to be included in the training dataset.

As we can see "CustomerId" is not affecting the output that much it's to be excluded from the training dataset.

data.dtypes
 
from sklearn.preprocessing import LabelEncoder
 
le1=LabelEncoder()
 
data.Geography=le1.fit_transform(data.Geography)
 

 
le2=LabelEncoder()
 
data.Gender=le2.fit_transform(data.Gender)
 

 
plt.figure(figsize=(12,10))
 
cor=data.corr()
 
sns.heatmap(cor,annot=True,cmap='coolwarm')

For ease of training all the datatypes should be encoded with either LabelEncoder() or OneHotEncoder() .

From this HeatMap its almost clear to us on what variables the output is dependent.

Dropping the data columns of which the output is independent. Hence defining Input (ip) and Output(op) .

data1=data.drop(['RowNumber','CustomerId','Surname','CreditScore',
 
'Tenure'],axis=1)
 
ip=data1.drop(['Exited'],axis=1)
 
op=data1['Exited']
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.compose import ColumnTransformer
 
ct=ColumnTransformer([('geography',OneHotEncoder(),[0])],
 
remainder='passthrough')
 
ip=np.array(ct.fit_transform(ip),dtype=np.str)
 

Splitting the training and testing dataset by defining a test size. Using StandardScalerTranform the model is fitted. We will get different model accuracy and recall for different algorithms.

from sklearn.model_selection import train_test_split
 
xtr,xts,ytr,yts=train_test_split(ip,op,test_size=0.1)
 

 
from sklearn.preprocessing import StandardScaler
 
sc=StandardScaler()
 
sc.fit(xtr)
 
sc.fit(xts)
 
xtr=sc.transform(xtr)
 
xts=sc.transform(xts)
 

Applying Logistic Regression

from sklearn.linear_model import LogisticRegression
 
alg=LogisticRegression()
 
alg.fit(xtr,ytr)
 

 
yp=alg.predict(xts)
 
sc.fit(x)
 
x=sc.transform(x)
 
alg.predict(x)
 
from sklearn.metrics import accuracy_score
 

 
from sklearn.metrics import recall_score
 
a=accuracy_score(yts,yp)
 
b=recall_score(yts,yp)
 
print(a,b)
 

a=0.815
 
b=0.22164948453608246

Applying Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB
 
clf=GaussianNB()
 
clf.fit(xtr,ytr)
 

 
yp1=clf.predict(xts)
 

 
from sklearn.metrics import accuracy_score
 
from sklearn.metrics import recall_score
 
a1=accuracy_score(yts,yp1)
 
b1=recall_score(yts,yp1)
 
print(a1,b1)

a1=0.815
 
b1=0.3556701030927835

Applying KNN

With KNN algorithm we can tune the training by tuning the 'n_neighbors '.

from sklearn.neighbors import KNeighborsClassifier
 
alg=KNeighborsClassifier(n_neighbors=3)
 
alg.fit(xtr,ytr)
 

 
yp2=alg.predict(xts)
 

 
from sklearn.metrics import accuracy_score
 
from sklearn.metrics import recall_score
 
a2=accuracy_score(yts,yp2)
 
b2=recall_score(yts,yp2)
 
print(a2,b2)

a2=0.829
 
b2=0.4845360824742268

Applying SVM

With SVM we can tune the training by tuning the 'gamma '.

from sklearn import svm
 
alg1=svm.SVC(kernel='rbf',C=500,gamma=0.01)
 
alg1.fit(xtr,ytr)
 
yp3=alg1.predict(xts)
 

 
from sklearn.metrics import accuracy_score
 
from sklearn.metrics import recall_score
 
a3=accuracy_score(yts,yp3)
 
b3=recall_score(yts,yp3)
 
print(a3,b3)
 

a3=0.861
 
b3=0.4381443298969072

I tried to showcase few algorithms of Supervised Learning using same dataset as a case study.

    580
    6