Machine Learning-Based Cardiovascular Disease Prediction Model: A Cohort Study on the Korean National Health Insurance Service Health Screening Database

Joung Ouk (Ryan) Kim; Yong-Suk Jeong; Jin Ho Kim; Jong-Weon Lee; Dougho Park; Hyoung-Seop Kim

doi:10.3390/diagnostics11060943

Machine Learning-Based Cardiovascular Disease Prediction Model: A Cohort Study on the Korean National Health Insurance Service Health Screening Database

Diagnostics ◽

10.3390/diagnostics11060943 ◽

2021 ◽

Vol 11 (6) ◽

pp. 943

Author(s):

Joung Ouk (Ryan) Kim ◽

Yong-Suk Jeong ◽

Jin Ho Kim ◽

Jong-Weon Lee ◽

Dougho Park ◽

...

Keyword(s):

Machine Learning ◽

Health Insurance ◽

Prediction Model ◽

National Health Insurance ◽

National Health ◽

Prediction Models ◽

Characteristic Curve ◽

Health Screening ◽

Gradient Boosting ◽

Extreme Gradient Boosting

Background: This study proposes a cardiovascular diseases (CVD) prediction model using machine learning (ML) algorithms based on the National Health Insurance Service-Health Screening datasets. Methods: We extracted 4699 patients aged over 45 as the CVD group, diagnosed according to the international classification of diseases system (I20–I25). In addition, 4699 random subjects without CVD diagnosis were enrolled as a non-CVD group. Both groups were matched by age and gender. Various ML algorithms were applied to perform CVD prediction; then, the performances of all the prediction models were compared. Results: The extreme gradient boosting, gradient boosting, and random forest algorithms exhibited the best average prediction accuracy (area under receiver operating characteristic curve (AUROC): 0.812, 0.812, and 0.811, respectively) among all algorithms validated in this study. Based on AUROC, the ML algorithms improved the CVD prediction performance, compared to previously proposed prediction models. Preexisting CVD history was the most important factor contributing to the accuracy of the prediction model, followed by total cholesterol, low-density lipoprotein cholesterol, waist-height ratio, and body mass index. Conclusions: Our results indicate that the proposed health screening dataset-based CVD prediction model using ML algorithms is readily applicable, produces validated results and outperforms the previous CVD prediction models.

Download Full-text

Prediction of the risk of developing hepatocellular carcinoma in health screening examinees: a Korean cohort study

BMC Cancer ◽

10.1186/s12885-021-08498-w ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Chansik An ◽

Jong Won Choi ◽

Hyung Soon Lee ◽

Hyunsun Lim ◽

Seok Jong Ryu ◽

...

Keyword(s):

Machine Learning ◽

Hepatocellular Carcinoma ◽

Health Insurance ◽

Liver Disease ◽

Prediction Model ◽

National Health Insurance ◽

National Health ◽

Claim Data ◽

Health Screening ◽

Training Cohort

Abstract Background Almost all Koreans are covered by mandatory national health insurance and are required to undergo health screening at least once every 2 years. We aimed to develop a machine learning model to predict the risk of developing hepatocellular carcinoma (HCC) based on the screening results and insurance claim data. Methods The National Health Insurance Service-National Health Screening database was used for this study (NHIS-2020-2-146). Our study cohort consisted of 417,346 health screening examinees between 2004 and 2007 without cancer history, which was split into training and test cohorts by the examination date, before or after 2005. Robust predictors were selected using Cox proportional hazard regression with 1000 different bootstrapped datasets. Random forest and extreme gradient boosting algorithms were used to develop a prediction model for the 9-year risk of HCC development after screening. After optimizing a prediction model via cross validation in the training cohort, the model was validated in the test cohort. Results Of the total examinees, 0.5% (1799/331,694) and 0.4% (390/85,652) in the training cohort and the test cohort were diagnosed with HCC, respectively. Of the selected predictors, older age, male sex, obesity, abnormal liver function tests, the family history of chronic liver disease, and underlying chronic liver disease, chronic hepatitis virus or human immunodeficiency virus infection, and diabetes mellitus were associated with increased risk, whereas higher income, elevated total cholesterol, and underlying dyslipidemia or schizophrenic/delusional disorders were associated with decreased risk of HCC development (p < 0.001). In the test, our model showed good discrimination and calibration. The C-index, AUC, and Brier skill score were 0.857, 0.873, and 0.078, respectively. Conclusions Machine learning-based model could be used to predict the risk of HCC development based on the health screening examination results and claim data.

Download Full-text

Prediction of the Risk of Developing Hepatocellular Carcinoma in Health Screening Examinees: a Korean Cohort Study

10.21203/rs.3.rs-343547/v1 ◽

2021 ◽

Author(s):

Chansik An ◽

Jong Won Choi ◽

Hyung Soon Lee ◽

Hyunsun Lim ◽

Seok Jong Ryu ◽

...

Keyword(s):

Machine Learning ◽

Hepatocellular Carcinoma ◽

Health Insurance ◽

Liver Disease ◽

Prediction Model ◽

Chronic Liver Disease ◽

National Health Insurance ◽

National Health ◽

Claim Data ◽

Health Screening

Abstract BackgroundAlmost all Koreans are covered by mandatory national health insurance and are required to undergo health screening at least once every 2 years. We aimed to develop a machine learning model to predict the risk of developing hepatocellular carcinoma (HCC) based on the screening results and insurance claim data.MethodsThe National Health Insurance Service-National Health Screening database was used for this study (NHIS-2020-2-146). Our study cohort consisted of health screening examinees in 2004 or 2005 without cancer history, which was randomly split into training and test cohorts. Robust predictors were selected using Cox proportional hazard regression with 1,000 different bootstrapped datasets. Random forest and extreme gradient boosting algorithms were used to develop a prediction model for the 12-year risk of HCC development after screening. After optimizing a prediction model via cross validation in the training cohort, the model was validated in the test cohort.ResultsOf 331,694 examinees, 0.8% were diagnosed with HCC during the follow-up period (median, 11.2 years), respectively. Of the selected predictors, older age, male sex, abnormal liver function tests, the family history of chronic liver disease, and underlying chronic liver disease, chronic hepatitis virus or human immunodeficiency virus infection, and diabetes mellitus were associated with increased risk, whereas elevated total cholesterol and underlying dyslipidemia or schizophrenic/delusional disorders were associated with decreased risk of HCC development (p<0.001). In the test, our model showed good discrimination and calibration. The C-index, AUC, and Brier skill score were 0.868, 0.872, and 0.08, respectively. ConclusionsMachine learning-based model could be used to predict the risk of HCC development based on the health screening examination results and claim data.

Download Full-text

Machine Learning Prediction Models for Chronic Kidney Disease using National Health Insurance Claim Data in Taiwan

10.1101/2020.06.25.20139147 ◽

2020 ◽

Author(s):

Surya Krishnamurthy ◽

Kapeleshh KS ◽

Erik Dovgan ◽

Mitja Luštrek ◽

Barbara Gradišek Piletič ◽

...

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Health Insurance ◽

Kidney Disease ◽

National Health Insurance ◽

National Health ◽

Performance Metrics ◽

Prediction Models ◽

Research Database ◽

Number Of Patients

ABSTRACTBackground and ObjectiveChronic kidney disease (CKD) represent a heavy burden on the healthcare system because of the increasing number of patients, high risk of progression to end-stage renal disease, and poor prognosis of morbidity and mortality. The aim of this study is to develop a machine-learning model that uses the comorbidity and medication data, obtained from Taiwan’s National Health Insurance Research Database, to forecast whether an individual will develop CKD within the next 6 or 12 months, and thus forecast the prevalence in the population.MethodsA total of 18,000 people with CKD and 72,000 people without CKD diagnosis along with the past two years of medication and comorbidity data matched by propensity score were used to build a predicting model. A series of approaches were tested, including Convoluted Neural Networks (CNN). 5-fold cross-validation was used to assess the performance metrics of the algorithms.ResultsBoth for the 6 month and 12-month models, the CNN approach performed best, with the AUROC of 0.957 and 0.954, respectively. The most prominent features in the tree-based models were identified, including diabetes mellitus, age, gout, and medications such as sulfonamides, angiotensins which had an impact on the progression of CKD.ConclusionsThe model proposed in this study can be a useful tool for the policy-makers helping them in predicting the trends of CKD in the population in the next 6 to 12 months. Information provided by this model can allow closely monitoring the people with risk, early detection of CKD, better allocation of resources, and patient-centric management

Download Full-text

A Self-Care Prediction Model for Children with Disability Based on Genetic Algorithm and Extreme Gradient Boosting

Mathematics ◽

10.3390/math8091590 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1590

Author(s):

Muhammad Syafrudin ◽

Ganjar Alfian ◽

Norma Latif Fitriyani ◽

Muhammad Anshari ◽

Tony Hadibarata ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Model ◽

Prediction Models ◽

Self Care ◽

Gradient Boosting ◽

Children With Disability ◽

Study Results ◽

Extreme Gradient Boosting ◽

Care Problems

Detecting self-care problems is one of important and challenging issues for occupational therapists, since it requires a complex and time-consuming process. Machine learning algorithms have been recently applied to overcome this issue. In this study, we propose a self-care prediction model called GA-XGBoost, which combines genetic algorithms (GAs) with extreme gradient boosting (XGBoost) for predicting self-care problems of children with disability. Selecting the feature subset affects the model performance; thus, we utilize GA to optimize finding the optimum feature subsets toward improving the model’s performance. To validate the effectiveness of GA-XGBoost, we present six experiments: comparing GA-XGBoost with other machine learning models and previous study results, a statistical significant test, impact analysis of feature selection and comparison with other feature selection methods, and sensitivity analysis of GA parameters. During the experiments, we use accuracy, precision, recall, and f1-score to measure the performance of the prediction models. The results show that GA-XGBoost obtains better performance than other prediction models and the previous study results. In addition, we design and develop a web-based self-care prediction to help therapist diagnose the self-care problems of children with disabilities. Therefore, appropriate treatment/therapy could be performed for each child to improve their therapeutic outcome.

Download Full-text

Machine learning to predict distal caries in mandibular second molars associated with impacted third molars

Scientific Reports ◽

10.1038/s41598-021-95024-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung-Hwi Hur ◽

Eun-Young Lee ◽

Min-Kyung Kim ◽

Somi Kim ◽

Ji-Yeon Kang ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Clinical Decision Making ◽

Prediction Models ◽

Contact Point ◽

Characteristic Curve ◽

Gradient Boosting ◽

Support Vector ◽

Third Molars ◽

Extreme Gradient Boosting

AbstractImpacted mandibular third molars (M3M) are associated with the occurrence of distal caries on the adjacent mandibular second molars (DCM2M). In this study, we aimed to develop and validate five machine learning (ML) models designed to predict the occurrence of DCM2Ms due to the proximity with M3Ms and determine the relative importance of predictive variables for DCM2Ms that are important for clinical decision making. A total of 2642 mandibular second molars adjacent to M3Ms were analyzed and DCM2Ms were identified in 322 cases (12.2%). The models were trained using logistic regression, random forest, support vector machine, artificial neural network, and extreme gradient boosting ML methods and were subsequently validated using testing datasets. The performance of the ML models was significantly superior to that of single predictors. The area under the receiver operating characteristic curve of the machine learning models ranged from 0.88 to 0.89. Six features (sex, age, contact point at the cementoenamel junction, angulation of M3Ms, Winter's classification, and Pell and Gregory classification) were identified as relevant predictors. These prediction models could be used to detect patients at a high risk of developing DCM2M and ultimately contribute to caries prevention and treatment decision-making for impacted M3Ms.

Download Full-text

Machine Learning Prediction Models for Chronic Kidney Disease Using National Health Insurance Claim Data in Taiwan

Healthcare ◽

10.3390/healthcare9050546 ◽

2021 ◽

Vol 9 (5) ◽

pp. 546

Author(s):

Surya Krishnamurthy ◽

Kapeleshh KS ◽

Erik Dovgan ◽

Mitja Luštrek ◽

Barbara Gradišek Piletič ◽

...

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Health Insurance ◽

Kidney Disease ◽

National Health Insurance ◽

National Health ◽

Prediction Models ◽

Research Database ◽

Close Monitoring ◽

Number Of Patients

Chronic kidney disease (CKD) represents a heavy burden on the healthcare system because of the increasing number of patients, high risk of progression to end-stage renal disease, and poor prognosis of morbidity and mortality. The aim of this study is to develop a machine-learning model that uses the comorbidity and medication data obtained from Taiwan’s National Health Insurance Research Database to forecast the occurrence of CKD within the next 6 or 12 months before its onset, and hence its prevalence in the population. A total of 18,000 people with CKD and 72,000 people without CKD diagnosis were selected using propensity score matching. Their demographic, medication and comorbidity data from their respective two-year observation period were used to build a predictive model. Among the approaches investigated, the Convolutional Neural Networks (CNN) model performed best with a test set AUROC of 0.957 and 0.954 for the 6-month and 12-month predictions, respectively. The most prominent predictors in the tree-based models were identified, including diabetes mellitus, age, gout, and medications such as sulfonamides and angiotensins. The model proposed in this study could be a useful tool for policymakers in predicting the trends of CKD in the population. The models can allow close monitoring of people at risk, early detection of CKD, better allocation of resources, and patient-centric management.

Download Full-text

Impact of comorbidity assessment methods to predict non-cancer mortality risk in cancer patients: a retrospective observational study using the National Health Insurance Service claims-based data in Korea

BMC Medical Research Methodology ◽

10.1186/s12874-021-01257-2 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Sanghee Lee ◽

Yoon Jung Chang ◽

Hyunsoon Cho

Keyword(s):

Risk Assessment ◽

Health Insurance ◽

Cancer Patients ◽

National Health Insurance ◽

National Health ◽

Cancer Mortality ◽

Prediction Models ◽

Assessment Methods ◽

Chronic Obstructive ◽

Mortality Risks

Abstract Background Cancer patients’ prognoses are complicated by comorbidities. Prognostic prediction models with inappropriate comorbidity adjustments yield biased survival estimates. However, an appropriate claims-based comorbidity risk assessment method remains unclear. This study aimed to compare methods used to capture comorbidities from claims data and predict non-cancer mortality risks among cancer patients. Methods Data were obtained from the National Health Insurance Service-National Sample Cohort database in Korea; 2979 cancer patients diagnosed in 2006 were considered. Claims-based Charlson Comorbidity Index was evaluated according to the various assessment methods: different periods in washout window, lookback, and claim types. The prevalence of comorbidities and associated non-cancer mortality risks were compared. The Cox proportional hazards models considering left-truncation were used to estimate the non-cancer mortality risks. Results The prevalence of peptic ulcer, the most common comorbidity, ranged from 1.5 to 31.0%, and the proportion of patients with ≥1 comorbidity ranged from 4.5 to 58.4%, depending on the assessment methods. Outpatient claims captured 96.9% of patients with chronic obstructive pulmonary disease; however, they captured only 65.2% of patients with myocardial infarction. The different assessment methods affected non-cancer mortality risks; for example, the hazard ratios for patients with moderate comorbidity (CCI 3–4) varied from 1.0 (95% CI: 0.6–1.6) to 5.0 (95% CI: 2.7–9.3). Inpatient claims resulted in relatively higher estimates reflective of disease severity. Conclusions The prevalence of comorbidities and associated non-cancer mortality risks varied considerably by the assessment methods. Researchers should understand the complexity of comorbidity assessments in claims-based risk assessment and select an optimal approach.

Download Full-text

Significant Physical and Exercise-Related Variables for Exercise-Centred Lifestyle: Big Data Analysis for Gynaecological Cancer Patients

BioMed Research International ◽

10.1155/2021/5362406 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Eun Joo Yang ◽

Hyunseok Jee

Keyword(s):

Ovarian Cancer ◽

Health Insurance ◽

Receiver Operating Characteristic Curve ◽

Receiver Operating Characteristic ◽

National Health Insurance ◽

National Health ◽

Operating Characteristic ◽

Uterine Cancer ◽

Characteristic Curve ◽

Operating Characteristic Curve

This study investigated the characteristics of gynaecological cancers and is aimed at identifying significant risk variables using the National Health Insurance Sharing Service database to develop practical interventions for affected patients. Data regarding patients with uterine and ovarian cancer from the National Health Insurance Sharing Service database were collected and analysed using Student’s t -test, logistic regression, and receiver operating characteristic curve analyses. Student’s t -test analyses revealed that age, body mass index, blood pressure, and waist variables differed significantly among patients with uterine cancer. Gamma-glutamyl transpeptidase levels were higher in patients with ovarian cancer than in patients with uterine cancer. Physical fitness function tests reflected the status of patients with cancer. Moreover, physical disability was associated with an increased incidence of ovarian cancer. Intensive exercise for 20 min more than 1 time per week must be avoided to prevent uterine cancer. Receiver operating characteristic curve analyses showed that the optimal cutoff value for one-leg standing time, a prognostic and preventive factor in ovarian cancer, was 9.50 s (sensitivity, 94.9%; specificity, 96.9%). Controlling significant variables for each gynaecological cancer type in an individualised and optimised manner is recommended, including by maintenance of an adjusted exercise-centred lifestyle.

Download Full-text

An Interpretable Early Dynamic Sequential Predictor for Sepsis-Induced Coagulopathy Progression in the Real-World Using Machine Learning

Frontiers in Medicine ◽

10.3389/fmed.2021.775047 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ruixia Cui ◽

Wenbo Hua ◽

Kai Qu ◽

Heran Yang ◽

Yingmu Tong ◽

...

Keyword(s):

Machine Learning ◽

Real World ◽

Time Series Data ◽

Time Window ◽

Medical Center ◽

Characteristic Curve ◽

Series Data ◽

Gradient Boosting ◽

Early Management ◽

Extreme Gradient Boosting

Sepsis-associated coagulation dysfunction greatly increases the mortality of sepsis. Irregular clinical time-series data remains a major challenge for AI medical applications. To early detect and manage sepsis-induced coagulopathy (SIC) and sepsis-associated disseminated intravascular coagulation (DIC), we developed an interpretable real-time sequential warning model toward real-world irregular data. Eight machine learning models including novel algorithms were devised to detect SIC and sepsis-associated DIC 8n (1 ≤ n ≤ 6) hours prior to its onset. Models were developed on Xi'an Jiaotong University Medical College (XJTUMC) and verified on Beth Israel Deaconess Medical Center (BIDMC). A total of 12,154 SIC and 7,878 International Society on Thrombosis and Haemostasis (ISTH) overt-DIC labels were annotated according to the SIC and ISTH overt-DIC scoring systems in train set. The area under the receiver operating characteristic curve (AUROC) were used as model evaluation metrics. The eXtreme Gradient Boosting (XGBoost) model can predict SIC and sepsis-associated DIC events up to 48 h earlier with an AUROC of 0.929 and 0.910, respectively, and even reached 0.973 and 0.955 at 8 h earlier, achieving the highest performance to date. The novel ODE-RNN model achieved continuous prediction at arbitrary time points, and with an AUROC of 0.962 and 0.936 for SIC and DIC predicted 8 h earlier, respectively. In conclusion, our model can predict the sepsis-associated SIC and DIC onset up to 48 h in advance, which helps maximize the time window for early management by physicians.

Download Full-text

Property Rental Price Prediction Using the Extreme Gradient Boosting Algorithm

IJIIS: International Journal of Informatics and Information Systems ◽

10.47738/ijiis.v3i2.65 ◽

2020 ◽

Vol 3 (2) ◽

pp. 54-59

Author(s):

Marco Febriadi Kokasih ◽

Adi Suryaputra Paramita

Keyword(s):

Prediction Model ◽

Prediction Models ◽

Gradient Boosting ◽

Fair Price ◽

Price Prediction ◽

Online Marketplace ◽

Extreme Gradient Boosting ◽

Property Owners ◽

Boosting Algorithm

Online marketplace in the field of property renting like Airbnb is growing. Many property owners have begun renting out their properties to fulfil this demand. Determining a fair price for both property owners and tourists is a challenge. Therefore, this study aims to create a software that can create a prediction model for property rent price. Variable that will be used for this study is listing feature, neighbourhood, review, date and host information. Prediction model is created based on the dataset given by the user and processed with Extreme Gradient Boosting algorithm which then will be stored in the system. The result of this study is expected to create prediction models for property rent price for property owners and tourists consideration when considering to rent a property. In conclusion, Extreme Gradient Boosting algorithm is able to create property rental price prediction with the average of RMSE of 10.86 or 13.30%.

Download Full-text