Home Computer Science

# SYSTEM ANALYSIS

System analysis is a process of understanding facts and identifying the problems. The purpose of system analysis is to study system and understand its objectives. This helps improve the system and accomplish purpose of the system.

6.4.1 EXISTING SYSTEM

In the existing system, as discussed in the literature survey part, the diabetes risk score system with collected samples from certain region was developed. Based on age, family history, gender, WC, physical activity, BP, and smoking, various diabetes risk score tools were developed. To identify the variables, univariate regression analysis is done. To derive the risk score, the p coefficient values are identified using the analysis called logistic regression. For finding the cumulative regression coefficient, all p coefficients derived from the logistic regression are added. The authors also find the optimal cutoff value, sensitivity, and specificity. Based on the ROC analysis, they calculated the optimum value. They also validate the system using AUC.

6.4.2 PROPOSED METHOD

hi the proposed method, we used IDRS as template to reverse calculation and create an imputed dataset for Asian and European countries. As the aim is to provide individual age-specific personalized T2D risk score, we calculated the p coefficient for each year instead of making an age group. Once the above step is completed, impute the data according to the average value between the ranges. While doing this, the value of p increases according to the individual specific age; similar calculation was done for China, Sri Lanka, Oman, Cambridge, France, UK, and Danish. A similar approach was taken to personalize BMI and WC. Several risk score tools are developed, but predicting the collect risk score without losing simplicity is really a challenging task. The proposed system is compared with the existing risk score system for the accuracy and performance. It can also be applied to different ethnic groups. Therefore, the diabetes risk score system is designed without any laboratory tests using AI techniques.

# METHODOLOGY

Machine learning is the logical field, managing the maimers by which machines gain the fact from expertise. Python apparatuses and modules are used. Here, in this case, matplotlib, numpy, and pyplots for plotting yield results additionally bolsters machine learning algorithms such as classification, logistic regression, decision tree (DT), random forest (RF), linear, and different algorithms were utilized. Here, accuracy, confusion matrix, sensitivity, and specificity are calculated using the machine learning algorithm. Specificity or true negative rate is defined as the level of patients who are accurately distinguished as being healthy. (1 - specificity) is the level of patients who are mistakenly recognized as being diseased. Sensitivity or true positive rate is defined as the percentage of patients who are correctly identified as being having the disease. In machine learning grouping models, one basic proportion of model exactness is AUC. By bend, ROC bend is inferred. ROC represents receiver operating trademark, which can be drawn as sensitivity versus 1 - specificity.

The motivation behind this work is to detect T2D of individuals who are interested to know about then risk score. Therefore, the diabetes risk score system is designed without any laboratory tests.

Its design steps are as follows.

• • Different diabetes risk score systems are studied to understand the parameters that are being used in risk estimation.
• • Built a dataset using the parameters that are being used for the prediction.
• • Apply the suitable machine learning algorithm and find the score on this designed dataset (with some parameter).
• • The built dataset should represent all the existing scoring systems and should be able to represent people from all around the world.
• • Add more features to the designed system and again calculate the score and compare it with the original system. (How the original system can be fine-tuned if we add some other feature?).
• • Validate the data using the existing diabetes risk score system.

The proposed model is shown in Figure 6.1.

FIGURE 6.1 Proposed model diagram.

6.5.1 DATASET SELECTION

To develop a uniform T2D risk scoring system for Asians and Europeans, we used scoring values from IDRS (India), Chinese score system, SLDRISK (Sri Lanka), Omanese, Cambridge, France, UK, and Danish. The details of the scores of these four systems are given in Table 6.2.

Table 6.2 represents p coefficient values of the Asian system by considering five parameters: age, WC, physical activity, family history, and BMI. To design a uniform T2D risk scoring system for Asians among the different parameters, WC has a significant role. The protection from insulin increases as the individual progresses toward becoming overweight. The risk factor to diabetes also increases if an individual has a family ancestry, that is, if a parent or sibling of the subject has/had diabetes. As the age increases, the risk of diabetes also increases due to the lack of physical activity, yet T2D is also being observed in youths. Using all these constraints, we have identified strong parameters that affect T2D; based on these parameters, we develop the diabetes risk score system. Important parameters are age, gender, WC, physical activity, family history, and BMI.

 Variable India China Sri Lanka Oman Cambridge France UK Danish Australia Brazil US Age <35 0 0 0 0 0 0 0 0 0 35-49 0.84 0.845 0.95 1.8 0 0.42 0 0.6926 0.455 0.743 0.95 >=50 1.47 1.357 1.61 2.3 0.44 0.65 0.53 1.311 0.919 1.227 1.57 60-69 0.861 0.94 0.94 1.8475 1.3 2.09 >=70 1.16 1.26 1.645 Waist circumference Female < 80 Male < 90 0 0 0 0 0 0 0 0 0 Female 80-89 Male 90-99 0.44 0.952 1 0.38 0.5 1.021 0.43 0.884 0.27 Female >= 90 Male >= 100 0.81 1.493 0.64 1.424 0.56 1.411 1.12 >109 2.271 0.956 0.86 1.99 Physical Activity Vigorous 0 0 0 0 0 0 -0.34 Mild 1.13 0.17 No 1.45 0.352 0.32 0.268 0.6488 0.428 0 Family History Two nondiabetic 0 0 0 0 0 0 0 0 0 0 Either parent 0.54 0.656 0.52 1.9 0.475 1.021 0.47 0.6835 0.624 0.67 Both parent 0.83 BMI <25 0 0 0 0 0 0 0 0 0 25-29 0.679 0.52 0.54 0.137 0.015 0.26 0.7401 0.569 0.473 30-34 0.948 0.72 0.69 0.247 0.938 0.45 1.4672 1.224 1.802 >=35 1.418 0.458 0.75 1.698 1.784

The risk score can be easily calculated using the p coefficient value. It can be mostly used in developing countries, p coefficient calculations are explained in Section 6.2.

The risk score of probability using p coefficient value can be calculated as follows:

wherexv x ,... are independent risk factors, p0 is the intercept, and pr p,,... are regression coefficients.

6.5.2 DATA PREPARATION AND

The IDRS is used as template to reverse calculation and create an imputed dataset. As the aim is to provide individual age-specific personalized T2D risk score, the p coefficient is calculated for each year instead of making an age group. To achieve this, we took IDRS as reference and created the imputed dataset. In IDRS, values of p coefficient for age groups <35, 35-49, and >50 are 0, 0.84, and 1.47, respectively. We created a continuous dataset for individual ages from 21 to 80 using these p coefficient values. To do so, the considered lowest value is -0.4 for 21-34 years and highest value for these 21-34 years is calculated based on the next value of the categoiy. Therefore, here, the highest p coefficient value is determined as 0.2. A similar technique is applied for the age categories 35-49 and 50-80. Therefore, values obtained for the ages 35, 49, 50, and 80 are 0.699, 1.1, 1.2, and 1.64, respectively. Once the above step is completed, impute the data according to the average value between the ranges. While doing this, the p value increases according to the individual specific age. Similarly, the calculation is done for China, Sri Lanka, and Oman. A similar approach was taken to personalize BMI and WC.

6.5.2.1 FOR PHYSICAL ACTIVITY

Physical activity is one of the important parameters for predicting the T2D. Three categories of physical activity are considered according to the IDRS: vigorous exercise with the p coefficient of 0, no exercise with the p coefficient of 1.45, and mild exercise with the p coefficient of 1.13. In view of inquiries shaped by the International Physical Activity Questionnaires, physical activity was separated as low, moderate, and high. Here, lively physical exercises are the exercises that require hard physical exertion and influence you to inhale a lot harder than typical. Such physical exercises resemble hard work, burrowing, high impact exercise, and quick bicycling. Moderate exercises are exercises that require moderate physical exertion and influence you to inhale fairly harder than ordinary.

6.5.2.2 FOR FAMILY HISTORY

Family history is another important parameter for predicting the T2D. We have considered three categories of family history according to the IDRS: two nondiabetic parents with the p coefficient of 0, either parent with the p coefficient of 0.54, and both parents with the p coefficient of 0.83. All these categories are included for dataset creation.

6.5.3 COMPUTATION FOR DATA IMPUTATION

Once p coefficients are calculated as explained in the data computation part, in the next stage, imputing the data is veiy much essential. Here, Python library Scikit leam is used, and also, there is a Python module dedicated to permutations and combinations called itertools. It is one of the greatest comers of the Python 3 standard library: itertools. Itertools. Product (): This tool computes the Cartesian product of input timetables. This module implements a number of iterator building blocks in a form suitable for Python. This is the efficient tool that can be used for a variety of combinations. Initially, we took four parameters, namely, age, waist, physical activity, and family history; later, BMI was also included in the list. Once all the values are added in the particular list, the product (*) with the itertools module was used. Therefore, it is acting like nested for loop, and we got all the combinations of four parameters so that the total number of samples obtained is 514,384. A similar approach was taken to create the dataset of India (IDRS) with BMI, China, SLDRISK, and Oman. The steps involved in creating the dataset are shown in Figure 6.2.

6.5.4 DESCRIPTION OF THE DATASET

The dataset is created for India, China, Sri Lanka and Oman based on certain important attributes and does not contain any missing values.

All variables were categorized as age (21-34 years versus 35-49 and >50 years), WC (men <90 cm, 90-99 cm, >100 versus women < 80 cm, 80-89, >90 cm), BMI (weight in kg divided by height in m2) (BMI < 25 vs. 25-29 vs. 30-34 and >35), family history of diabetes (two nondiabetic parents versus either parent having diabetes and both parent having diabetes), and physical activity (vigorous exercise versus mild and no exercise is considered). Based on the outcome of the diabetes (that is 0/1), training data should be classified such that 0 indicates no diabetes and 1 indicates diabetes. Table 6.3 represents the attribute of the created dataset.

FIGURE 6.2 Steps involved in creating the dataset.

TABLE 6.3 Attributes

 Attribute number Attribute 1 Age 2 Waist 3 Physical activity 4 Family history 5 BMI 6 Outcome

6.5.5 DATA MODELING AND ALGORITHMS USED FOR PREDICTION

Different algorithms, namely, multiple logistic regression, Gaussian Bayes (GB), RF, and DT [21] are applied to the imputed Indian, Chinese, SLDRISK, and Oman diabetic datasets. The data were grouped into training (70%) and test sets (30%) comprising of 50% of T2D. Two combinations of parameters, such as (i) age, gender, physical activity, family history, and WC and (ii) age, gender, physical activity, family history, WC, and BMI, were used in predicting the efficacy (specificity and sensitivity) of each algorithm using ROC andAUC with 95% confidence interval. Furthermore, the outcomes of each algorithm were compared with each other, and the best model is selected. In another approach, we used a consensus algorithm [22] to get the average of scores from the entire algorithm. Similarly, we developed the consensus-based Asian score, as described in Figure 6.3. The essential issue to decide positioning accord is an issue to join a few rankings, which are chosen by at least two decision makers into positioning agreement. For different Asian countries, machine learning algorithms were initially applied. The average value of each method is identified; then, final prediction is done using the consensus-based average rank algorithm, as shown in Figure 6.3.

 Related topics