Predicting youth diabetes risk using NHANES data and machine learning

Mapping Intimacies ◽

10.1101/19007872 ◽

2019 ◽

Author(s):

Nita Vangeepuram ◽

Bian Liu ◽

Po-hsiang Chiu ◽

Linhua Wang ◽

Gaurav Pandey

Keyword(s):

Machine Learning ◽

Large Scale ◽

Diabetes Risk ◽

Screening Tools ◽

Data Set ◽

Screening Guidelines ◽

Race Ethnicity ◽

Pediatric Screening ◽

Serious Disease ◽

Better Than

AbstractType 2 diabetes has become alarmingly prevalent among youth in recent years. However, simple questionnaire-based screening tools to reliably identify diabetes risk and prevent the adverse effects of this serious disease are only available for adults, not for youth. As a first step in developing such a tool, we used a large-scale dataset from the National Health and Nutritional Examination Survey (NHANES), to examine the performance of a well-known adult diabetes risk self-assessment screener and published pediatric clinical screening guidelines in identifying youth with pre- diabetes/diabetes (pre-DM/DM) based on American Diabetes Association diagnostic biomarkers. We assessed the agreement between the adult screener/pediatric screening guidelines and biomarker diagnostic criteria by conducting comparisons using the overall data set and sub-datasets stratified by sex, race/ethnicity, and age. While the pediatric guidelines performed better than the adult screener in identifying youth with pre-DM/DM (sensitivity 43.1% vs 7.2%), both are inadequate for general deployment among youth. There were also notable differences in the performance of the pediatric guidelines across subgroups based on age, sex and race/ethnicity. In an effort to improve pre-DM/DM screening, we also evaluated data-driven machine learning-based classification algorithms, several of which performed slightly but statistically significantly better than the pediatric screening guidelines.

Download Full-text

Predicting youth diabetes risk using NHANES data and machine learning

Scientific Reports ◽

10.1038/s41598-021-90406-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nita Vangeepuram ◽

Bian Liu ◽

Po-hsiang Chiu ◽

Linhua Wang ◽

Gaurav Pandey

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clinical Guideline ◽

Diabetes Risk ◽

Screening Tools ◽

Clinical Screening ◽

Predictive Values ◽

Screening Guideline ◽

Demographic Subgroups ◽

Sensitivity Specificity

AbstractPrediabetes and diabetes mellitus (preDM/DM) have become alarmingly prevalent among youth in recent years. However, simple questionnaire-based screening tools to reliably assess diabetes risk are only available for adults, not youth. As a first step in developing such a tool, we used a large-scale dataset from the National Health and Nutritional Examination Survey (NHANES) to examine the performance of a published pediatric clinical screening guideline in identifying youth with preDM/DM based on American Diabetes Association diagnostic biomarkers. We assessed the agreement between the clinical guideline and biomarker criteria using established evaluation measures (sensitivity, specificity, positive/negative predictive value, F-measure for the positive/negative preDM/DM classes, and Kappa). We also compared the performance of the guideline to those of machine learning (ML) based preDM/DM classifiers derived from the NHANES dataset. Approximately 29% of the 2858 youth in our study population had preDM/DM based on biomarker criteria. The clinical guideline had a sensitivity of 43.1% and specificity of 67.6%, positive/negative predictive values of 35.2%/74.5%, positive/negative F-measures of 38.8%/70.9%, and Kappa of 0.1 (95%CI: 0.06–0.14). The performance of the guideline varied across demographic subgroups. Some ML-based classifiers performed comparably to or better than the screening guideline, especially in identifying preDM/DM youth (p = 5.23 × 10−5).We demonstrated that a recommended pediatric clinical screening guideline did not perform well in identifying preDM/DM status among youth. Additional work is needed to develop a simple yet accurate screener for youth diabetes risk, potentially by using advanced ML methods and a wider range of clinical and behavioral health data.

Download Full-text

Machine learning identifies an immunological pattern associated with multiple juvenile idiopathic arthritis subtypes

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2018-214354 ◽

2019 ◽

Vol 78 (5) ◽

pp. 617-628 ◽

Cited By ~ 5

Author(s):

Erika Van Nieuwenhove ◽

Vasiliki Lagou ◽

Lien Van Eyck ◽

James Dooley ◽

Ulrich Bodenhofer ◽

...

Keyword(s):

Machine Learning ◽

Juvenile Idiopathic Arthritis ◽

Large Scale ◽

Inflammatory Diseases ◽

Adaptive Immune System ◽

Healthy Children ◽

Learning Approaches ◽

Data Set ◽

Immune Signature ◽

Systemic Jia

ObjectivesJuvenile idiopathic arthritis (JIA) is the most common class of childhood rheumatic diseases, with distinct disease subsets that may have diverging pathophysiological origins. Both adaptive and innate immune processes have been proposed as primary drivers, which may account for the observed clinical heterogeneity, but few high-depth studies have been performed.MethodsHere we profiled the adaptive immune system of 85 patients with JIA and 43 age-matched controls with indepth flow cytometry and machine learning approaches.ResultsImmune profiling identified immunological changes in patients with JIA. This immune signature was shared across a broad spectrum of childhood inflammatory diseases. The immune signature was identified in clinically distinct subsets of JIA, but was accentuated in patients with systemic JIA and those patients with active disease. Despite the extensive overlap in the immunological spectrum exhibited by healthy children and patients with JIA, machine learning analysis of the data set proved capable of discriminating patients with JIA from healthy controls with ~90% accuracy.ConclusionsThese results pave the way for large-scale immune phenotyping longitudinal studies of JIA. The ability to discriminate between patients with JIA and healthy individuals provides proof of principle for the use of machine learning to identify immune signatures that are predictive to treatment response group.

Download Full-text

Different firm responses to the COVID-19 pandemic shocks: machine-learning evidence on the Vietnamese labor market

International Journal of Emerging Markets ◽

10.1108/ijoem-02-2021-0292 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Lam Hoang Viet Le ◽

Toan Luu Duc Huynh ◽

Bryan S. Weber ◽

Bao Khac Quoc Nguyen

Keyword(s):

Machine Learning ◽

Labor Market ◽

Large Scale ◽

Government Support ◽

Policy Implications ◽

Machine Learning Techniques ◽

Firm Characteristics ◽

Data Set ◽

Content Type ◽

Firm Responses

PurposeThis paper aims to identify the disproportionate impacts of the COVID-19 pandemic on labor markets.Design/methodology/approachThe authors conduct a large-scale survey on 16,000 firms from 82 industries in Ho Chi Minh City, Vietnam, and analyze the data set by using different machine-learning methods.FindingsFirst, job loss and reduction in state-owned enterprises have been significantly larger than in other types of organizations. Second, employees of foreign direct investment enterprises suffer a significantly lower labor income than those of other groups. Third, the adverse effects of the COVID-19 pandemic on the labor market are heterogeneous across industries and geographies. Finally, firms with high revenue in 2019 are more likely to adopt preventive measures, including the reduction of labor forces. The authors also find a significant correlation between firms' revenue and labor reduction as traditional econometrics and machine-learning techniques suggest.Originality/valueThis study has two main policy implications. First, although government support through taxes has been provided, the authors highlight evidence that there may be some additional benefit from targeting firms that have characteristics associated with layoffs or other negative labor responses. Second, the authors provide information that shows which firm characteristics are associated with particular labor market responses such as layoffs, which may help target stimulus packages. Although the COVID-19 pandemic affects most industries and occupations, heterogeneous firm responses suggest that there could be several varieties of targeted policies-targeting firms that are likely to reduce labor forces or firms likely to face reduced revenue. In this paper, the authors outline several industries and firm characteristics which appear to more directly be reducing employee counts or having negative labor responses which may lead to more cost–effect stimulus.

Download Full-text

Prediction of condition-specific regulatory genes using machine learning

Nucleic Acids Research ◽

10.1093/nar/gkaa264 ◽

2020 ◽

Vol 48 (11) ◽

pp. e62-e62 ◽

Cited By ~ 2

Author(s):

Qi Song ◽

Jiyoung Lee ◽

Shamima Akter ◽

Matthew Rogers ◽

Ruth Grene ◽

...

Keyword(s):

Machine Learning ◽

Transcription Factors ◽

Single Cell ◽

Control Cell ◽

Genomic Data ◽

Regulatory Genes ◽

Genomic Research ◽

Open Chromatin ◽

Data Set ◽

Better Than

Abstract Recent advances in genomic technologies have generated data on large-scale protein–DNA interactions and open chromatin regions for many eukaryotic species. How to identify condition-specific functions of transcription factors using these data has become a major challenge in genomic research. To solve this problem, we have developed a method called ConSReg, which provides a novel approach to integrate regulatory genomic data into predictive machine learning models of key regulatory genes. Using Arabidopsis as a model system, we tested our approach to identify regulatory genes in data sets from single cell gene expression and from abiotic stress treatments. Our results showed that ConSReg accurately predicted transcription factors that regulate differentially expressed genes with an average auROC of 0.84, which is 23.5–25% better than enrichment-based approaches. To further validate the performance of ConSReg, we analyzed an independent data set related to plant nitrogen responses. ConSReg provided better rankings of the correct transcription factors in 61.7% of cases, which is three times better than other plant tools. We applied ConSReg to Arabidopsis single cell RNA-seq data, successfully identifying candidate regulatory genes that control cell wall formation. Our methods provide a new approach to define candidate regulatory genes using integrated genomic data in plants.

Download Full-text

Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Information ◽

10.3390/info11060332 ◽

2020 ◽

Vol 11 (6) ◽

pp. 332

Author(s):

Ernest Kwame Ampomah ◽

Zhiguang Qin ◽

Gabriel Nyame

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Superior Performance ◽

Operating Characteristics ◽

Training Set ◽

Data Set ◽

Test Set ◽

Ensemble Machine Learning ◽

Better Than

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.

Download Full-text

Investigations on optimizing performance of the distributed computing in heterogeneous environment using machine learning technique for large scale data set

Materials Today Proceedings ◽

10.1016/j.matpr.2021.07.089 ◽

2021 ◽

Author(s):

Rajeev Pandey ◽

Sanjay Silakari

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Large Scale ◽

Heterogeneous Environment ◽

Machine Learning Technique ◽

Data Set ◽

Large Scale Data ◽

Learning Technique ◽

Scale Data

Download Full-text

Learning to Solve Large-Scale Security-Constrained Unit Commitment Problems

INFORMS Journal on Computing ◽

10.1287/ijoc.2020.0976 ◽

2020 ◽

Cited By ~ 1

Author(s):

Álinson S. Xavier ◽

Feng Qiu ◽

Shabbir Ahmed

Keyword(s):

Machine Learning ◽

Linear Programming ◽

Integer Linear Programming ◽

Mixed Integer Linear Programming ◽

Large Scale ◽

Unit Commitment ◽

Energy Sector ◽

Mixed Integer ◽

Data Set ◽

Computational Performance

Security-constrained unit commitment (SCUC) is a fundamental problem in power systems and electricity markets. In practical settings, SCUC is repeatedly solved via mixed-integer linear programming (MIP), sometimes multiple times per day, with only minor changes in input data. In this work, we propose a number of machine learning techniques to effectively extract information from previously solved instances in order to significantly improve the computational performance of MIP solvers when solving similar instances in the future. Based on statistical data, we predict redundant constraints in the formulation, good initial feasible solutions, and affine subspaces where the optimal solution is likely to lie, leading to a significant reduction in problem size. Computational results on a diverse set of realistic and large-scale instances show that using the proposed techniques, SCUC can be solved on average 4.3 times faster with optimality guarantees and 10.2 times faster without optimality guarantees, with no observed reduction in solution quality. Out-of-distribution experiments provide evidence that the method is somewhat robust against data-set shift. Summary of Contribution. The paper describes a novel computational method, based on a combination of mixed-integer linear programming (MILP) and machine learning (ML), to solve a challenging and fundamental optimization problem in the energy sector. The method advances the state-of-the-art, not only for this particular problem, but also, more generally, in solving discrete optimization problems via ML. We expect that the techniques presented can be readily used by practitioners in the energy sector and adapted, by researchers in other fields, to other challenging operations research problems that are solved routinely.

Download Full-text

Machine-Learning Algorithms Based on Screening Tests for Mild Cognitive Impairment

American Journal of Alzheimer s Disease & Other Dementias® ◽

10.1177/1533317520927163 ◽

2020 ◽

Vol 35 ◽

pp. 153331752092716

Author(s):

Jin-Hyuck Park

Keyword(s):

Machine Learning ◽

Cognitive Impairment ◽

Mild Cognitive Impairment ◽

Test Data ◽

Learning Algorithms ◽

Test System ◽

Machine Learning Algorithms ◽

Screening Tools ◽

Screening Tests ◽

Data Set

Background: The mobile screening test system for mild cognitive impairment (mSTS-MCI) was developed and validated to address the low sensitivity and specificity of the Montreal Cognitive Assessment (MoCA) widely used clinically. Objective: This study was to evaluate the efficacy machine learning algorithms based on the mSTS-MCI and Korean version of MoCA. Method: In total, 103 healthy individuals and 74 patients with MCI were randomly divided into training and test data sets, respectively. The algorithm using TensorFlow was trained based on the training data set, and then its accuracy was calculated based on the test data set. The cost was calculated via logistic regression in this case. Result: Predictive power of the algorithms was higher than those of the original tests. In particular, the algorithm based on the mSTS-MCI showed the highest positive-predictive value. Conclusion: The machine learning algorithms predicting MCI showed the comparable findings with the conventional screening tools.

Download Full-text

Determining Zygosity in Infant Twins – Revisiting the Questionnaire Approach

Twin Research and Human Genetics ◽

10.1017/thg.2021.24 ◽

2021 ◽

pp. 1-8

Author(s):

Irzam Hardiansyah ◽

Linnea Hamrefors ◽

Monica Siqueiros ◽

Terje Falck-Ytter ◽

Kristiina Tammimies

Keyword(s):

Machine Learning ◽

Large Scale ◽

Young Infant ◽

Twin Studies ◽

Data Set ◽

Physical Similarity ◽

Machine Learning Model ◽

Simulation Based ◽

Machine Learning Approach ◽

Established Technique

Abstract Accurate zygosity determination is a fundamental step in twin research. Although DNA-based testing is the gold standard for determining zygosity, collecting biological samples is not feasible in all research settings or all families. Previous work has demonstrated the feasibility of zygosity estimation based on questionnaire (physical similarity) data in older twins, but the extent to which this is also a reliable approach in infancy is less well established. Here, we report the accuracy of different questionnaire-based zygosity determination approaches (traditional and machine learning) in 5.5 month-old twins. The participant cohort comprised 284 infant twin pairs (128 dizygotic and 156 monozygotic) who participated in the Babytwins Study Sweden (BATSS). Manual scoring based on an established technique validated in older twins accurately predicted 90.49% of the zygosities with a sensitivity of 91.65% and specificity of 89.06%. The machine learning approach improved the prediction accuracy to 93.10%, with a sensitivity of 91.30% and specificity of 94.29%. Additionally, we quantified the systematic impact of zygosity misclassification on estimates of genetic and environmental influences using simulation-based sensitivity analysis on a separate data set to show the implication of our machine learning accuracy gain. In conclusion, our study demonstrates the feasibility of determining zygosity in very young infant twins using a questionnaire with four items and builds a scalable machine learning model with better metrics, thus a viable alternative to DNA tests in large-scale infant twin studies.

Download Full-text

Computational Prediction of the Isoform Specificity of Cytochrome P450 Substrates by an Improved Bayesian Method

10.21203/rs.2.9738/v1 ◽

2019 ◽

Author(s):

Hao Dai ◽

Yu-Xi Zheng ◽

Xiao-Qi Shan ◽

Yan-Yi Chu ◽

Wei Wang ◽

...

Keyword(s):

Machine Learning ◽

Cytochrome P450 ◽

Drug Interactions ◽

Predictive Models ◽

Large Scale ◽

Bayesian Method ◽

Small Data ◽

Human Beings ◽

Data Set ◽

Isoform Specificity

Abstract Cytochrome P450 (CYP) is the most important drug-metabolizing enzyme in human beings. Each CYP isoform is able to metabolize a large number of compounds, and if patients take more than one drugs during the treatment, it is possible that some drugs would be metabolized by the same CYP isoform, leading to potential drug-drug interactions and side effects. Therefore, it is necessary to investigate the isoform specificity of CYP substrates. In this study, we constructed a data set consisting of 10 major CYP isoforms associated with 776 substrates, and used machine learning methods to construct the predictive models based on the features of structural and physicochemical properties of substrates. We also proposed a new method called Improved Bayesian method, which is suitable for small data sets and is able to construct more stable and accurate predictive models compared with other traditional machine learning models. Based on this method, the predictive performance of our method got the accuracy of 86% for the independent test, which was significantly better to the existing models. We believe that our proposed method will facilitate the understanding of drug metabolisms and help the large-scale analysis of drug-drug interactions.

Download Full-text