Chronic kidney disease (CKD) is a type of kidney disease in which there is huge loss of kidney function over a long period might be months to years. In this project we used different Machine Learning models and compared between their predictions to get the best classification for the chronic kidney disease.
In our data we found that column 21 have alot of missing data so we removed it
Data imputation
In chronic_kidney_disease_full1.csv file the missing data is replaced with a ‘?’ so in order to deal with it we had to convert it to ‘NAN’
As our data has two types : numeric and catergorical data
We replaced any missing cells with the mean of its feature -if it’s numeric-, Or replaced any missing cells with the mode of its feature -if it’s categorical-.
Feature normalization
To avoid the increasing in error when the range of the numeric data in each column is different in the width,we attempted feature scalling the data.
Methodology
To increase our accuarcy we used cross validation concepts by creating data folds.
Used Algorithms
Different algorithms were used to get the most accurate one
1.KNN
2.Tree Decision
3.Logistic Regression
4.Naive Bayes
1. KNN
KNN is the simplest algorithm with a very high accuracy for binary classification, It classifies a new data point based on the similarity.
To apply KNN model we must select K of neighbors after calculating distance between all points and the new data point , It will take the closest k points compared to the new point then it classifies the new point based on the majority. for example: if the closest 5 points 3 of them from A category and 2 from B category ,So this point will be classified as A category.
code
This function will build the model and calculate accuracy, sensitivty and specifity for each fold, Then we put the output in a vector called statistics
lapply function will return the output as a list so to deal with it , convert cv_KNN into data frame
Then to show the results , Use paste function .
The results
2. Deision Tree
It is an algorithm used to classify data. This algorithm uses the principle of divide and conquer to divide the problem into parts and solve all of them separately, Then the solution is grouped . The decision tree is created based on the choice of the best attribute, The training set can be divide so that the depth of the tree decreases at the same time as the data is correctly categorized. For more illustration you can check this link
code
The results
3. Logistic regression
To understand logistic regression you might want to check Linear regression. As you have seen in linear regression , we used a hypothesis relation to predict the output but this time we need that our predictions be the binary outcome of either 0 or 1. So, we use the same hypothesis but with a little modification by using the sigmoid function.
As seen in the figure above , The sigmoid function changed the output into a range between zero and one. To predict whether the output is one or zero ,we need to set a boundary value which is set by default equals to 0.5 , when the output >= 0.5 then it is rounded up to 1 and if it is < 0.5 then it is rounded down to 0. For further details please check : Logistic Regression
Code
The results
4. Naive Bayes
This method is based on Bayes Theorem in statistics . It considers two main assumptions ,The first one is that every feature is independent of the other features , The second one is that every feature has the same contribution (Weight) as the others . It calculates the probability of each feature individually given the output and the probability of the output and hence be able to predict the output given the features values . As mentioned in the Logistic regression method we are going to use the same boundary value . For Further details for the algorithm Check Naive Algorithm
Code
The results
Conclusion
From the shown results , Although each model acheived a good accuracy , you still need to check for the most suitable model that goes along best with your data.