Logistic regression and random forest unveil key molecular descriptors of druglikeness

AbstractAgeing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The best performing classifier, built using molecular descriptors, achieved an area under the curve score (AUC) of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1738 small-molecules. The chemical compounds of the screening database with a predictive probability of ≥ 0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.

Download Full-text

Prediction on the fluoride contamination in groundwater at the Datong Basin, Northern China: Comparison of Random Forest, Logistic Regression and Artificial Neural Network

Applied Geochemistry ◽

10.1016/j.apgeochem.2021.105054 ◽

2021 ◽

pp. 105054

Author(s):

Nafouanti Mouigni Baraka ◽

Junxia Li ◽

Nasiru Abba Mustapha ◽

Placide Uwamungu ◽

Dalal AL-Alimi

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Logistic Regression ◽

Random Forest ◽

Northern China ◽

Fluoride Contamination ◽

Datong Basin ◽

Artificial Neural

Download Full-text

Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

Laboratory Investigation ◽

10.1038/s41374-021-00662-x ◽

2021 ◽

Author(s):

Catherine H. Feng ◽

Mary L. Disis ◽

Chao Cheng ◽

Lanjing Zhang

Keyword(s):

Colorectal Cancer ◽

Logistic Regression ◽

Feature Selection ◽

Random Forest ◽

Regression Models ◽

Multinomial Logistic Regression ◽

Logistic Regression Models ◽

Selection For

Download Full-text

Preoperative predictions of in-hospital mortality using electronic medical record data

10.1101/329813 ◽

2018 ◽

Cited By ~ 1

Author(s):

Brian Hill ◽

Robert Brown ◽

Eilon Gabel ◽

Christine Lee ◽

Maxime Cannesson ◽

...

Keyword(s):

Logistic Regression ◽

Random Forest ◽

Hospital Mortality ◽

Risk Score ◽

Surgical Mortality ◽

Physical Status ◽

Preoperative Risk ◽

Physical Status Classification ◽

Postoperative Risk ◽

Status Classification

AbstractBackgroundPredicting preoperative in-hospital mortality using readily-available electronic medical record (EMR) data can aid clinicians in accurately and rapidly determining surgical risk. While previous work has shown that the American Society of Anesthesiologists (ASA) Physical Status Classification is a useful, though subjective, feature for predicting surgical outcomes, obtaining this classification requires a clinician to review the patient’s medical records. Our goal here is to create an improved risk score using electronic medical records and demonstrate its utility in predicting in-hospital mortality without requiring clinician-derived ASA scores.MethodsData from 49,513 surgical patients were used to train logistic regression, random forest, and gradient boosted tree classifiers for predicting in-hospital mortality. The features used are readily available before surgery from EMR databases. A gradient boosted tree regression model was trained to impute the ASA Physical Status Classification, and this new, imputed score was included as an additional feature to preoperatively predict in-hospital post-surgical mortality. The preoperative risk prediction was then used as an input feature to a deep neural network (DNN), along with intraoperative features, to predict postoperative in-hospital mortality risk. Performance was measured using the area under the receiver operating characteristic (ROC) curve (AUC).ResultsWe found that the random forest classifier (AUC 0.921, 95%CI 0.908-0.934) outperforms logistic regression (AUC 0.871, 95%CI 0.841-0.900) and gradient boosted trees (AUC 0.897, 95%CI 0.881-0.912) in predicting in-hospital post-surgical mortality. Using logistic regression, the ASA Physical Status Classification score alone had an AUC of 0.865 (95%CI 0.848-0.882). Adding preoperative features to the ASA Physical Status Classification improved the random forest AUC to 0.929 (95%CI 0.915-0.943). Using only automatically obtained preoperative features with no clinician intervention, we found that the random forest model achieved an AUC of 0.921 (95%CI 0.908-0.934). Integrating the preoperative risk prediction into the DNN for postoperative risk prediction results in an AUC of 0.924 (95%CI 0.905-0.941), and with both a preoperative and postoperative risk score for each patient, we were able to show that the mortality risk changes over time.ConclusionsFeatures easily extracted from EMR data can be used to preoperatively predict the risk of in-hospital post-surgical mortality in a fully automated fashion, with accuracy comparable to models trained on features that require clinical expertise. This preoperative risk score can then be compared to the postoperative risk score to show that the risk changes, and therefore should be monitored longitudinally over time.Author summaryRapid, preoperative identification of those patients at highest risk for medical complications is necessary to ensure that limited infrastructure and human resources are directed towards those most likely to benefit. Existing risk scores either lack specificity at the patient level, or utilize the American Society of Anesthesiologists (ASA) physical status classification, which requires a clinician to review the chart. In this manuscript we report on using machine-learning algorithms, specifically random forest, to create a fully automated score that predicts preoperative in-hospital mortality based solely on structured data available at the time of surgery. This score has a higher AUC than both the ASA physical status score and the Charlson comorbidity score. Additionally, we integrate this score with a previously published postoperative score to demonstrate the extent to which patient risk changes during the perioperative period.

Download Full-text

Machine Learning-based in-hospital Mortality Prediction Models for Patients With Acute Coronary Syndrome

10.21203/rs.3.rs-134944/v1 ◽

2020 ◽

Author(s):

Jun Ke ◽

Yiwei Chen ◽

Xiaoping Wang ◽

Zhiyong Wu ◽

qiongyao Zhang ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Hospital Mortality ◽

Operating Characteristic ◽

Prediction Models ◽

Characteristic Curve ◽

Multivariate Logistic Regression Analysis ◽

Hdl Cholesterol ◽

Coronary Syndrome

Abstract BackgroundThe purpose of this study is to identify the risk factors of in-hospital mortality in patients with acute coronary syndrome (ACS) and to evaluate the performance of traditional regression and machine learning prediction models.MethodsThe data of ACS patients who entered the emergency department of Fujian Provincial Hospital from January 1, 2017 to March 31, 2020 for chest pain were retrospectively collected. The study used univariate and multivariate logistic regression analysis to identify risk factors for in-hospital mortality of ACS patients. The traditional regression and machine learning algorithms were used to develop predictive models, and the sensitivity, specificity, and receiver operating characteristic curve were used to evaluate the performance of each model.ResultsA total of 7810 ACS patients were included in the study, and the in-hospital mortality rate was 1.75%. Multivariate logistic regression analysis found that age and levels of D-dimer, cardiac troponin I, N-terminal pro-B-type natriuretic peptide (NT-proBNP), lactate dehydrogenase (LDH), high-density lipoprotein (HDL) cholesterol, and calcium channel blockers were independent predictors of in-hospital mortality. The study found that the area under the receiver operating characteristic curve of the models developed by logistic regression, gradient boosting decision tree (GBDT), random forest, and support vector machine (SVM) for predicting the risk of in-hospital mortality were 0.963, 0.960, 0.963, and 0.959, respectively. Feature importance evaluation found that NT-proBNP, LDH, and HDL cholesterol were top three variables that contribute the most to the prediction performance of the GBDT model and random forest model.ConclusionsThe predictive model developed using logistic regression, GBDT, random forest, and SVM algorithms can be used to predict the risk of in-hospital death of ACS patients. Based on our findings, we recommend that clinicians focus on monitoring the changes of NT-proBNP, LDH, and HDL cholesterol, as this may improve the clinical outcomes of ACS patients.

Download Full-text

Electronic Nose for Bladder Cancer Detection

Chemistry Proceedings ◽

10.3390/csac2021-10438 ◽

2021 ◽

Vol 5 (1) ◽

pp. 22

Author(s):

Heena Tyagi ◽

Emma Daulton ◽

Ayman S. Bannaga ◽

Ramesh P. Arasaradnam ◽

James A. Covington

Keyword(s):

Bladder Cancer ◽

Logistic Regression ◽

Random Forest ◽

Sensitivity And Specificity ◽

Cancer Detection ◽

Electronic Nose ◽

Random Forest Classifier ◽

Urine Samples ◽

Sparse Logistic Regression ◽

High Separation

This study outlines the use of an electronic nose as a method for the detection of VOCs as biomarkers of bladder cancer. Here, an AlphaMOS FOX 4000 electronic nose was used for the analysis of urine samples from 15 bladder cancer and 41 non-cancerous patients. The FOX 4000 consists of 18 MOS sensors that were used to differentiate the two groups. The results obtained were analysed using s MultiSens Analyzer and RStudio. The results showed a high separation with sensitivity and specificity of 0.93 and 0.88, respectively, using a Sparse Logistic Regression and 0.93 and 0.76 using a Random Forest classifier. We conclude that the electronic nose shows potential for discriminating bladder cancer from non-cancer subjects using urine samples.

Download Full-text

Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

10.1101/2021.03.08.21252615 ◽

2021 ◽

Author(s):

Chris J. Kennedy ◽

Dustin G. Mark ◽

Jie Huang ◽

Mark J. van der Laan ◽

Alan E. Hubbard ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Chest Pain ◽

Random Forest ◽

Decision Trees ◽

Low Risk ◽

Major Adverse Cardiac Events ◽

Risk Scores ◽

Cardiac Events ◽

Adverse Cardiac Events

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.

Download Full-text

Machine-Learning vs. Expert-Opinion Driven Logistic Regression Modelling for Predicting 30-Day Unplanned Rehospitalisation in Preterm Babies: A Prospective, Population-Based Study (EPIPAGE 2)

Frontiers in Pediatrics ◽

10.3389/fped.2020.585868 ◽

2021 ◽

Vol 8 ◽

Author(s):

Robert A. Reed ◽

Andrei S. Morgan ◽

Jennifer Zeitlin ◽

Pierre-Henri Jarreau ◽

Héloïse Torchin ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Regression Model ◽

Expert Opinion ◽

Logistic Regression Model ◽

Population Based ◽

Regression Modelling ◽

Preterm Babies ◽

Logistic Regression Modelling

Introduction: Preterm babies are a vulnerable population that experience significant short and long-term morbidity. Rehospitalisations constitute an important, potentially modifiable adverse event in this population. Improving the ability of clinicians to identify those patients at the greatest risk of rehospitalisation has the potential to improve outcomes and reduce costs. Machine-learning algorithms can provide potentially advantageous methods of prediction compared to conventional approaches like logistic regression.Objective: To compare two machine-learning methods (least absolute shrinkage and selection operator (LASSO) and random forest) to expert-opinion driven logistic regression modelling for predicting unplanned rehospitalisation within 30 days in a large French cohort of preterm babies.Design, Setting and Participants: This study used data derived exclusively from the population-based prospective cohort study of French preterm babies, EPIPAGE 2. Only those babies discharged home alive and whose parents completed the 1-year survey were eligible for inclusion in our study. All predictive models used a binary outcome, denoting a baby's status for an unplanned rehospitalisation within 30 days of discharge. Predictors included those quantifying clinical, treatment, maternal and socio-demographic factors. The predictive abilities of models constructed using LASSO and random forest algorithms were compared with a traditional logistic regression model. The logistic regression model comprised 10 predictors, selected by expert clinicians, while the LASSO and random forest included 75 predictors. Performance measures were derived using 10-fold cross-validation. Performance was quantified using area under the receiver operator characteristic curve, sensitivity, specificity, Tjur's coefficient of determination and calibration measures.Results: The rate of 30-day unplanned rehospitalisation in the eligible population used to construct the models was 9.1% (95% CI 8.2–10.1) (350/3,841). The random forest model demonstrated both an improved AUROC (0.65; 95% CI 0.59–0.7; p = 0.03) and specificity vs. logistic regression (AUROC 0.57; 95% CI 0.51–0.62, p = 0.04). The LASSO performed similarly (AUROC 0.59; 95% CI 0.53–0.65; p = 0.68) to logistic regression.Conclusions: Compared to an expert-specified logistic regression model, random forest offered improved prediction of 30-day unplanned rehospitalisation in preterm babies. However, all models offered relatively low levels of predictive ability, regardless of modelling method.

Download Full-text

What drives forest fire in Fujian, China? Evidence from logistic regression and Random Forests

International Journal of Wildland Fire ◽

10.1071/wf15121 ◽

2016 ◽

Vol 25 (5) ◽

pp. 505 ◽

Cited By ~ 27

Author(s):

Futao Guo ◽

Guangyu Wang ◽

Zhangwen Su ◽

Huiling Liang ◽

Wenhui Wang ◽

...

Keyword(s):

Logistic Regression ◽

Random Forest ◽

Regional Scale ◽

Fire Risk ◽

Driving Factors ◽

Fire Season ◽

Fire Occurrence ◽

Climate Factors ◽

Local Factors ◽

Risk Zones

We applied logistic regression and Random Forest to evaluate drivers of fire occurrence on a provincial scale. Potential driving factors were divided into two groups according to scale of influence: ‘climate factors’, which operate on a regional scale, and ‘local factors’, which includes infrastructure, vegetation, topographic and socioeconomic data. The groups of factors were analysed separately and then significant factors from both groups were analysed together. Both models identified significant driving factors, which were ranked in terms of relative importance. Results show that climate factors are the main drivers of fire occurrence in the forests of Fujian, China. Particularly, sunshine hours, relative humidity (fire seasonal and daily), precipitation (fire season) and temperature (fire seasonal and daily) were seen to play a crucial role in fire ignition. Of the local factors, elevation, distance to railway and per capita GDP were found to be most significant. Random Forest demonstrated a higher predictive ability than logistic regression across all groups of factors (climate, local, and climate and local combined). Maps of the likelihood of fire occurrence in Fujian illustrate that the high fire-risk zones are distributed across administrative divisions; consequently, fire management strategies should be devised based on fire-risk zones, rather than on separate administrative divisions.

Download Full-text

HEART DISEASE PREDICTION WITH LOGISTIC REGRESSION AND RANDOM FOREST MODEL

European Journal of Biomedical and Life Sciences ◽

10.29013/elbls-21-1.2-24-33 ◽

2021 ◽

pp. 24-33

Author(s):

D. Tang

Keyword(s):

Logistic Regression ◽

Heart Disease ◽

Random Forest ◽

Random Forest Model ◽

Disease Prediction ◽

Forest Model

Download Full-text