Detection of diabetes mellitus using machine learning algorithms


Research Scholar, Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu, India, 9790540045
Department of Computer Science and Engineering, Chennai Institute of Technology, Chennai, Tamil Nadu, India

Abstract

The growth of technology has brought in sophistication in our day to day activities. This sophistication has brought in many health issues. One among the most important problems that has currently become a typical issue is Diabetics Mellitus. Diabetes Mellitus has affected over 246 million people worldwide with a majority of them being women. The WHO reports that by 2025 this number is expected to rise to over 380 million. Prevention is better than cure. The health care data is large and complex. It is better to predict the disease at an earlier stage which may save the life and also have a preventive measure in controlling the diseases. In this paper, we have taken up a heterogeneous data to analyze the various factors which are affecting this disease. The various machine learning algorithms used in this paper help us to decide the attributes which play a major role in diagnosis of Diabetes Mellitus.

Keywords

Machine learning, Diabetes Mellitus, Diagnosis, Prediction

Introduction

Chronic increase of glucose level in the blood is called Diabetes mellitus (DM), commonly referred to as diabetes. It is caused by the inability of the body to produce required amount of insulin for its own needs. It may be due to the marred secretion of insulin or impaired action or both. High blood sugar levels over a prolonged period leads to renal failure, loss of vision and several other tissue damages. The incidence of diabetes is increasing because female diabetics are able to have children. The incidence of diabetes is higher in persons above 40 years of age. Females, especially the married ones are at a higher risk in getting this disease. Obesity, dietary factors and heredity are the other contributory factors for diabetes. Alcoholic beverages increase appetite, encourage weight gain and when taken in excess damage the pancreas and thereby increase the risk of diabetes. In short, DM leads to several metabolic disorders in our body. DM can be classified into several types. Mainly there are two clinical types

  • The juvenile onset type or insulin dependent diabetes (IDDM),

  • The adult or maturity onset type or non-insulin dependent diabetes

The main causes of Type 2 DM are

Life style Physical activity Heredity Diet

90% of the diabetic people are affected by Type 2 DM . Only 10% of the people are affected by Type 1 DM. The other type of diabetes is gestational diabetes which happens only during pregnancy it may not be at higher risk. This type of diabetes may develop into Type 2 diabetes

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/d7667095-d63e-4ed4-80e0-7f787dfa04d5-upicture1.png
Figure 1: Data set

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/1d645083-27f8-4358-9b50-5ca0ffaf5262-upicture2.png
Figure 2: Distribution of attributes

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/6ef8ccfa-db96-4849-9faf-adfa7df09467-upicture3.png
Figure 3: Decision tree for the attributes

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/6be221f2-e3e3-49f7-a8d1-01c5b0d56336-upicture4.png
Figure 4: Decision tree features

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/dbb2ffdb-92f9-44ab-adf6-0b4f0fc4635b-upicture5.png
Figure 5: Ranking of features using decision tree

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/78d3b4cc-0bb7-46cf-97a3-c4188f8b4885-upicture6.png
Figure 6: ROC curve for KNN and Decision tree

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/b2523419-d159-4a34-86ca-db4e84e8a2ea-upicture7.png
Figure 7: Confusion Matrix decision tree

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b0aecaba-813c-4d35-bbc2-728ee9f65248/image/28c91f69-125f-44f1-b8aa-af5b02836b21-upicture8.png
Figure 8: Confusion Matrix KNN

Table 1: Literature Survey

Author & year

Data mining Techniques

Accuracy

(Hayashi & Yukita, 2016)

Re-RXalgorithm, J48graft

90%

(Agarwal et al., 2016)

fuzzy ontology-based Case-based reasoning

76.78%

(Tirunagari, Poh, Hu, & Windridge, 2015)

SOM

83.5%

(B & Srinivasan, 2014)

C4.5, SVM,

k- NN,PNN,BL R

86%

(Mohammadi, Hosseini, & Tabatabaee, 2014)

Bayesian Network, Decision Tree

90.4%

(Rao & Kumar, 2014)

Genetic algorithm

80.5%

(Vijayanv & Ravikumar, 2014)

KNN,K means,

ANFIS algorithm

80%

(Beloufa & Chikh, 2013)

Bee Colony

80%

(Lakshmi, 2013)

C4.5, SVM,

k-NN, PNN, BLR, MLR, PLS-DA, PLS-LDA,

k-means & Apriori

74.7%

(Koklu & Unal, 2013)

Multilayer Perceptron, J48 and Navie Bayes Classifier

74%

(Zarkogianni, Litsa, Vazeou, & Nikita, 2013)

SOM

81%

(Ganji & Abadeh, 2011)

Ant Colony, FCS-ANTMINER,

78.5%

(Çalişir & Doğantekin, 2011)

LDA, MWSVM

89.74

(Kaur & Chhabra, 2014)

J-48,WEKA tool

99.87

Signs and Symptoms

Frequent urination Increased thirst. Increased appetite. Loss of weight. General weakness and fatigue. Pain in the legs. Irritability. Lack of concentration. Prone to infection. Delayed wound healing

Table 1 shows,

Machine Learning Algorithms

The following are few of the Machine learning algorithms,

  • Decision Trees

  • Naive Bayes Classification

  • Support vector machines for classification problems

  • Random forest for classification and regression problems

  • Linear regression

  • Logistic Regression

  • Ensemble Methods

  • K-means for clustering problems

  • Apriori algorithm for association rule learning problems

  • Principal Component Analysis

  • Singular Value Decomposition; 12. Independent Component Analysis

Proposed System

Attributes Details

Pregnancies

No. of times pregnant

Glucose - Plasma Glucose Concentration

A 2 hour in an oral glucose tolerance test (mg/dl) A 2-hour value between 140 and 200 mg/dL is called impaired glucose tolerance. This is called "prediabetes." It means you are at increased risk of developing diabetes over time. A glucose level of 200 mg/dL or higher is used to diagnose diabetes.

Blood Pressure

Diastolic Blood Pressure (mmHg) If Diastolic B.P > 90 means High B.P (High Probability of Diabetes) Diastolic B.P < 60 means low B.P (Less Probability of Diabetes

Data Preparation

Data cleaning and transformation of data: The zero values or the null values in the following (Figure 1) cannot be zero. So the values have been replaced by the mean of that column.

Removing outliers

The below graph (Figure 2) shows the distribution of data set of different attributes. By carefully studying the graph, we can figure out that insulin and skin thickness have outliers.

The outliers are removed using IQR ( Inter Quartile Range ) method.

The algorithms used are decision tree algorithm and KNN algorithm

The decision tree model (Figure 3) was applied on the training dataset. The depth of the tree is 4 and the total numbers of nodes are 21.

With the help of the decision tree we are able to select the features that play an important role in the detection of diabetes. The following (Figure 5; Figure 4) shows the important features and their ranking

Results and Discussion

Comparison between Decision tree and KNN

A Receiver Operating Characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve plotted for True Positive Rate and False Positive rate give us a good result (Figure 6). The curve area is 0.72, whereas, the curve area for KNN is 0.67.

The dashed line in the diagonal represents the ROC Curve of a random predictor. It is a baseline to check if the model is useful or not. The confusion matrix is tabulated for both the methods (Figure 8; Figure 7).

Conclusions

The decision tree model has achieved 76% accuracy. After considering various options to improve the accuracy, we were able to achieve the desired accuracy by removing outliers, categorizing data and keeping the tree depth to 4. During this process, only few attributes out of the eight attributes play an important role. Glucose, BMI, Pregnancies, Age and Insulin were important. And also the factors like skin thickness, Diabetes Pedigree function and blood pressure had negligible effect. Hence we conclude decision tree model best among the two when compared with KNN.

Conflict of Interest

The authors declare that they have no conflict of interest for this study.

Funding Support

The authors declare that they have no funding support for this study.