Detection of diabetes mellitus using machine learning algorithms
Abstract
The growth of technology has brought in sophistication in our day to day activities. This sophistication has brought in many health issues. One among the most important problems that has currently become a typical issue is Diabetics Mellitus. Diabetes Mellitus has affected over 246 million people worldwide with a majority of them being women. The WHO reports that by 2025 this number is expected to rise to over 380 million. Prevention is better than cure. The health care data is large and complex. It is better to predict the disease at an earlier stage which may save the life and also have a preventive measure in controlling the diseases. In this paper, we have taken up a heterogeneous data to analyze the various factors which are affecting this disease. The various machine learning algorithms used in this paper help us to decide the attributes which play a major role in diagnosis of Diabetes Mellitus.
Keywords
Machine learning, Diabetes Mellitus, Diagnosis, Prediction
Introduction
Chronic increase of glucose level in the blood is called Diabetes mellitus (DM), commonly referred to as diabetes. It is caused by the inability of the body to produce required amount of insulin for its own needs. It may be due to the marred secretion of insulin or impaired action or both. High blood sugar levels over a prolonged period leads to renal failure, loss of vision and several other tissue damages. The incidence of diabetes is increasing because female diabetics are able to have children. The incidence of diabetes is higher in persons above 40 years of age. Females, especially the married ones are at a higher risk in getting this disease. Obesity, dietary factors and heredity are the other contributory factors for diabetes. Alcoholic beverages increase appetite, encourage weight gain and when taken in excess damage the pancreas and thereby increase the risk of diabetes. In short, DM leads to several metabolic disorders in our body. DM can be classified into several types. Mainly there are two clinical types
-
The juvenile onset type or insulin dependent diabetes (IDDM),
-
The adult or maturity onset type or non-insulin dependent diabetes
The main causes of Type 2 DM are
Life style Physical activity Heredity Diet
90% of the diabetic people are affected by Type 2 DM . Only 10% of the people are affected by Type 1 DM. The other type of diabetes is gestational diabetes which happens only during pregnancy it may not be at higher risk. This type of diabetes may develop into Type 2 diabetes
Author & year |
Data mining Techniques |
Accuracy |
---|---|---|
Re-RXalgorithm, J48graft |
90% |
|
fuzzy ontology-based Case-based reasoning |
76.78% |
|
SOM |
83.5% |
|
C4.5, SVM, k- NN,PNN,BL R |
86% |
|
Bayesian Network, Decision Tree |
90.4% |
|
Genetic algorithm |
80.5% |
|
KNN,K means, ANFIS algorithm |
80% |
|
Bee Colony |
80% |
|
C4.5, SVM, k-NN, PNN, BLR, MLR, PLS-DA, PLS-LDA, k-means & Apriori |
74.7% |
|
Multilayer Perceptron, J48 and Navie Bayes Classifier |
74% |
|
SOM |
81% |
|
Ant Colony, FCS-ANTMINER, |
78.5% |
|
LDA, MWSVM |
89.74 |
|
J-48,WEKA tool |
99.87 |
Signs and Symptoms
Frequent urination Increased thirst. Increased appetite. Loss of weight. General weakness and fatigue. Pain in the legs. Irritability. Lack of concentration. Prone to infection. Delayed wound healing
Table 1 shows,
Machine Learning Algorithms
The following are few of the Machine learning algorithms,
-
Decision Trees
-
Naive Bayes Classification
-
Support vector machines for classification problems
-
Random forest for classification and regression problems
-
Linear regression
-
Logistic Regression
-
Ensemble Methods
-
K-means for clustering problems
-
Apriori algorithm for association rule learning problems
-
Principal Component Analysis
-
Singular Value Decomposition; 12. Independent Component Analysis
Proposed System
Attributes Details
Pregnancies
No. of times pregnant
Glucose - Plasma Glucose Concentration
A 2 hour in an oral glucose tolerance test (mg/dl) A 2-hour value between 140 and 200 mg/dL is called impaired glucose tolerance. This is called "prediabetes." It means you are at increased risk of developing diabetes over time. A glucose level of 200 mg/dL or higher is used to diagnose diabetes.
Blood Pressure
Diastolic Blood Pressure (mmHg) If Diastolic B.P > 90 means High B.P (High Probability of Diabetes) Diastolic B.P < 60 means low B.P (Less Probability of Diabetes
Data Preparation
Data cleaning and transformation of data: The zero values or the null values in the following (Figure 1) cannot be zero. So the values have been replaced by the mean of that column.
Removing outliers
The below graph (Figure 2) shows the distribution of data set of different attributes. By carefully studying the graph, we can figure out that insulin and skin thickness have outliers.
The outliers are removed using IQR ( Inter Quartile Range ) method.
The algorithms used are decision tree algorithm and KNN algorithm
The decision tree model (Figure 3) was applied on the training dataset. The depth of the tree is 4 and the total numbers of nodes are 21.
With the help of the decision tree we are able to select the features that play an important role in the detection of diabetes. The following (Figure 5; Figure 4) shows the important features and their ranking
Results and Discussion
Comparison between Decision tree and KNN
A Receiver Operating Characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The ROC curve plotted for True Positive Rate and False Positive rate give us a good result (Figure 6). The curve area is 0.72, whereas, the curve area for KNN is 0.67.
The dashed line in the diagonal represents the ROC Curve of a random predictor. It is a baseline to check if the model is useful or not. The confusion matrix is tabulated for both the methods (Figure 8; Figure 7).
Conclusions
The decision tree model has achieved 76% accuracy. After considering various options to improve the accuracy, we were able to achieve the desired accuracy by removing outliers, categorizing data and keeping the tree depth to 4. During this process, only few attributes out of the eight attributes play an important role. Glucose, BMI, Pregnancies, Age and Insulin were important. And also the factors like skin thickness, Diabetes Pedigree function and blood pressure had negligible effect. Hence we conclude decision tree model best among the two when compared with KNN.
Conflict of Interest
The authors declare that they have no conflict of interest for this study.
Funding Support
The authors declare that they have no funding support for this study.