A dataset showcasing the detailed variable analysis whether the output is dependent or independent of the given variables of the dataset.
Importing packages that are required and calling the data file on which supervised learning will be implemented.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv(r"E:/D _ FOLDER TRANSFER/MACHINE LEARNING AI/Churn_Modelling.csv")
data.isnull().sum()
for value in data.columns:
print(data[value].unique())
Data is checked for Unique values or any Null values as in "NaN" , "?" , etc then is replaced by mean or median or the most common value of that dataset .Graphically the data are plotted to analyze which values are affecting the output and which are not hence gets dropped after the data analysis.
Data Analysis
plt.figure(figsize=(12,10))
sns.distplot(data.CreditScore[data.Exited==0])
sns.distplot(data.CreditScore[data.Exited==1])
plt.legend(['no_leaving','leaving'])
sns.countplot(data.Geography,hue=data.Exited)
sns.countplot(data.Gender,hue=data.Exited)
plt.figure(figsize=(12,10))
sns.distplot(data.Age[data.Exited==0])
sns.distplot(data.Age[data.Exited==1])
plt.legend(['no_leaving','leaving'])
plt.figure(figsize=(12,10))
sns.distplot(data.Tenure[data.Exited==0])
sns.distplot(data.Tenure[data.Exited==1])
plt.legend(['no_leaving','leaving'])
plt.figure(figsize=(12,10))
sns.distplot(data.Balance[data.Exited==0])
sns.distplot(data.Balance[data.Exited==1])
plt.legend(['no_leaving','leaving'])
plt.figure(figsize=(12,10))
sns.distplot(data.NumOfProducts[data.Exited==0])
sns.distplot(data.NumOfProducts[data.Exited==1])
plt.legend(['no_leaving','leaving'])
plt.figure(figsize=(12,10))
sns.distplot(data.EstimatedSalary[data.Exited==0])
sns.distplot(data.EstimatedSalary[data.Exited==1])
plt.legend(['no_leaving','leaving'])
sns.countplot(data.HasCrCard,hue=data.Exited)
sns.countplot(data.IsActiveMember,hue=data.Exited)
plt.figure(figsize=(12,10))
sns.distplot(data.RowNumber[data.Exited==0])
sns.distplot(data.RowNumber[data.Exited==1])
plt.legend(['no_leaving','leaving'])
plt.figure(figsize=(12,10))
sns.distplot(data.CustomerId[data.Exited==0])
sns.distplot(data.CustomerId[data.Exited==1])
plt.legend(['no_leaving','leaving'])
Here as we can see "Age" is affecting the output hence its to be included in the training dataset.
Here as we can see "Tenure" is not affecting the output hence it's to be excluded from the training dataset.
As we can see "CreditScore" is not affecting the output that much it's to be excluded from the training dataset.
Here as we can see "Geography" is affecting the output hence its to be included in the training dataset.
Here as we can see "Gender" is affecting the output hence its to be included in the training dataset.
Here as we can see "Balance" is affecting the output hence its to be included in the training dataset.
Here as we can see "EstimatedSalary" is affecting the output hence its to be included in the training dataset.
Here as we can see "HasCrCard" is affecting the output hence its to be included in the training dataset.
Here as we can see "NumOfProducts" is affecting the output hence its to be included in the training dataset.
As we can see "RowNumber" is not affecting the output that much it's to be excluded from the training dataset.
Here as we can see "IsActiveMember" is affecting the output hence its to be included in the training dataset.
As we can see "CustomerId" is not affecting the output that much it's to be excluded from the training dataset.
data.dtypes
from sklearn.preprocessing import LabelEncoder
le1=LabelEncoder()
data.Geography=le1.fit_transform(data.Geography)
le2=LabelEncoder()
data.Gender=le2.fit_transform(data.Gender)
plt.figure(figsize=(12,10))
cor=data.corr()
sns.heatmap(cor,annot=True,cmap='coolwarm')
For ease of training all the datatypes should be encoded with either LabelEncoder() or OneHotEncoder() .
From this HeatMap its almost clear to us on what variables the output is dependent.
Dropping the data columns of which the output is independent. Hence defining Input (ip) and Output(op) .
data1=data.drop(['RowNumber','CustomerId','Surname','CreditScore',
'Tenure'],axis=1)
ip=data1.drop(['Exited'],axis=1)
op=data1['Exited']
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct=ColumnTransformer([('geography',OneHotEncoder(),[0])],
remainder='passthrough')
ip=np.array(ct.fit_transform(ip),dtype=np.str)
Splitting the training and testing dataset by defining a test size. Using StandardScalerTranform the model is fitted. We will get different model accuracy and recall for different algorithms.
from sklearn.model_selection import train_test_split
xtr,xts,ytr,yts=train_test_split(ip,op,test_size=0.1)
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc.fit(xtr)
sc.fit(xts)
xtr=sc.transform(xtr)
xts=sc.transform(xts)
Applying Logistic Regression
from sklearn.linear_model import LogisticRegression
alg=LogisticRegression()
alg.fit(xtr,ytr)
yp=alg.predict(xts)
sc.fit(x)
x=sc.transform(x)
alg.predict(x)
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a=accuracy_score(yts,yp)
b=recall_score(yts,yp)
print(a,b)
a=0.815
b=0.22164948453608246
Applying Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
clf.fit(xtr,ytr)
yp1=clf.predict(xts)
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a1=accuracy_score(yts,yp1)
b1=recall_score(yts,yp1)
print(a1,b1)
a1=0.815
b1=0.3556701030927835
Applying KNN
With KNN algorithm we can tune the training by tuning the 'n_neighbors '.
from sklearn.neighbors import KNeighborsClassifier
alg=KNeighborsClassifier(n_neighbors=3)
alg.fit(xtr,ytr)
yp2=alg.predict(xts)
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a2=accuracy_score(yts,yp2)
b2=recall_score(yts,yp2)
print(a2,b2)
a2=0.829
b2=0.4845360824742268
Applying SVM
With SVM we can tune the training by tuning the 'gamma '.
from sklearn import svm
alg1=svm.SVC(kernel='rbf',C=500,gamma=0.01)
alg1.fit(xtr,ytr)
yp3=alg1.predict(xts)
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
a3=accuracy_score(yts,yp3)
b3=recall_score(yts,yp3)
print(a3,b3)
a3=0.861
b3=0.4381443298969072
I tried to showcase few algorithms of Supervised Learning using same dataset as a case study.
Comments