top of page
  • Writer's pictureNatasha Pradhan

A Case to Study Supervised Learning : Logistic Regression , Naive Bayes , KNN , SVM

A dataset showcasing the detailed variable analysis whether the output is dependent or independent of the given variables of the dataset.

Importing packages that are required and calling the data file on which supervised learning will be implemented.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data=pd.read_csv(r"E:/D _ FOLDER TRANSFER/MACHINE LEARNING AI/Churn_Modelling.csv")

data.isnull().sum()

for value in data.columns:
    print(data[value].unique())

Data is checked for Unique values or any Null values as in "NaN" , "?" , etc then is replaced by mean or median or the most common value of that dataset .Graphically the data are plotted to analyze which values are affecting the output and which are not hence gets dropped after the data analysis.


Data Analysis


plt.figure(figsize=(12,10))
sns.distplot(data.CreditScore[data.Exited==0])
sns.distplot(data.CreditScore[data.Exited==1])
plt.legend(['no_leaving','leaving'])

sns.countplot(data.Geography,hue=data.Exited)

sns.countplot(data.Gender,hue=data.Exited)

plt.figure(figsize=(12,10))
sns.distplot(data.Age[data.Exited==0])
sns.distplot(data.Age[data.Exited==1])
plt.legend(['no_leaving','leaving'])

plt.figure(figsize=(12,10))
sns.distplot(data.Tenure[data.Exited==0])
sns.distplot(data.Tenure[data.Exited==1])
plt.legend(['no_leaving','leaving'])

plt.figure(figsize=(12,10))
sns.distplot(data.Balance[data.Exited==0])
sns.distplot(data.Balance[data.Exited==1])
plt.legend(['no_leaving','leaving'])

plt.figure(figsize=(12,10))
sns.distplot(data.NumOfProducts[data.Exited==0])
sns.distplot(data.NumOfProducts[data.Exited==1])
plt.legend(['no_leaving','leaving'])

plt.figure(figsize=(12,10))
sns.distplot(data.EstimatedSalary[data.Exited==0])
sns.distplot(data.EstimatedSalary[data.Exited==1])
plt.legend(['no_leaving','leaving'])

sns.countplot(data.HasCrCard,hue=data.Exited)

sns.countplot(data.IsActiveMember,hue=data.Exited)

plt.figure(figsize=(12,10))
sns.distplot(data.RowNumber[data.Exited==0])
sns.distplot(data.RowNumber[data.Exited==1])
plt.legend(['no_leaving','leaving'])

plt.figure(figsize=(12,10))
sns.distplot(data.CustomerId[data.Exited==0])
sns.distplot(data.CustomerId[data.Exited==1])
plt.legend(['no_leaving','leaving'])

Here as we can see "Age" is affecting the output hence its to be included in the training dataset.

Here as we can see "Tenure" is not affecting the output hence it's to be excluded from the training dataset.

As we can see "CreditScore" is not affecting the output that much it's to be excluded from the training dataset.

Here as we can see "Geography" is affecting the output hence its to be included in the training dataset.

Here as we can see "Gender" is affecting the output hence its to be included in the training dataset.

Here as we can see "Balance" is affecting the output hence its to be included in the training dataset.

Here as we can see "EstimatedSalary" is affecting the output hence its to be included in the training dataset.

Here as we can see "HasCrCard" is affecting the output hence its to be included in the training dataset.

Here as we can see "NumOfProducts" is affecting the output hence its to be included in the training dataset.

As we can see "RowNumber" is not affecting the output that much it's to be excluded from the training dataset.

Here as we can see "IsActiveMember" is affecting the output hence its to be included in the training dataset.

As we can see "CustomerId" is not affecting the output that much it's to be excluded from the training dataset.


data.dtypes
from sklearn.preprocessing import LabelEncoder
le1=LabelEncoder()
data.Geography=le1.fit_transform(data.Geography)

le2=LabelEncoder()
data.Gender=le2.fit_transform(data.Gender)

plt.figure(figsize=(12,10))
cor=data.corr()
sns.heatmap(cor,annot=True,cmap='coolwarm')

For ease of training all the datatypes should be encoded with either LabelEncoder() or OneHotEncoder() .

From this HeatMap its almost clear to us on what variables the output is dependent.


Dropping the data columns of which the output is independent. Hence defining Input (ip) and Output(op) .

data1=data.drop(['RowNumber','CustomerId','Surname','CreditScore',
'Tenure'],axis=1)
ip=data1.drop(['Exited'],axis=1)
op=data1['Exited']
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct=ColumnTransformer([('geography',OneHotEncoder(),[0])],
                     remainder='passthrough')
ip=np.array(ct.fit_transform(ip),dtype=np.str)

Splitting the training and testing dataset by defining a test size. Using StandardScalerTranform the model is fitted. We will get different model accuracy and recall for different algorithms.

from sklearn.model_selection import train_test_split
xtr,xts,ytr,yts=train_test_split(ip,op,test_size=0.1)

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc.fit(xtr)
sc.fit(xts)
xtr=sc.transform(xtr)
xts=sc.transform(xts)

Applying Logistic Regression

from sklearn.linear_model import LogisticRegression
alg=LogisticRegression()
alg.fit(xtr,ytr)

yp=alg.predict(xts)
sc.fit(x)
x=sc.transform(x)
alg.predict(x)
from sklearn.metrics import accuracy_score

from sklearn.metrics import recall_score
a=accuracy_score(yts,yp)
b=recall_score(yts,yp)
print(a,b)
a=0.815
b=0.22164948453608246

Applying Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
clf.fit(xtr,ytr)

yp1=clf.predict(xts)

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a1=accuracy_score(yts,yp1)
b1=recall_score(yts,yp1)
print(a1,b1)
a1=0.815 
b1=0.3556701030927835

Applying KNN

With KNN algorithm we can tune the training by tuning the 'n_neighbors '.

from sklearn.neighbors import KNeighborsClassifier
alg=KNeighborsClassifier(n_neighbors=3)
alg.fit(xtr,ytr)

yp2=alg.predict(xts)

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a2=accuracy_score(yts,yp2)
b2=recall_score(yts,yp2)
print(a2,b2)
a2=0.829
b2=0.4845360824742268

Applying SVM

With SVM we can tune the training by tuning the 'gamma '.

from sklearn import svm
alg1=svm.SVC(kernel='rbf',C=500,gamma=0.01)
alg1.fit(xtr,ytr)
yp3=alg1.predict(xts)

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a3=accuracy_score(yts,yp3)
b3=recall_score(yts,yp3)
print(a3,b3)
a3=0.861 
b3=0.4381443298969072

I tried to showcase few algorithms of Supervised Learning using same dataset as a case study.

58 views0 comments

Recent Posts

See All
bottom of page