A Radiogenomics Ensemble to Predict EGFR and KRAS Mutations in NSCLC

Silvia Moreno; Mario Bonfante; Eduardo Zurek; Dmitry Cherezov; Dmitry Goldgof; Lawrence Hall; Matthew Schabath

doi:10.3390/tomography7020014

A Radiogenomics Ensemble to Predict EGFR and KRAS Mutations in NSCLC

Tomography ◽

10.3390/tomography7020014 ◽

2021 ◽

Vol 7 (2) ◽

pp. 154-168

Author(s):

Silvia Moreno ◽

Mario Bonfante ◽

Eduardo Zurek ◽

Dmitry Cherezov ◽

Dmitry Goldgof ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Kras Mutation ◽

Learning Approach ◽

Learning Models ◽

Kras Mutations ◽

Machine Learning Approach ◽

Class Average ◽

Public Datasets ◽

Machine Learning Models

Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. Furthermore, many times there is a lack of big enough relevant public datasets, so the performance of single classifiers is not outstanding. In this paper, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV), is proposed and its performance is assessed both for machine learning models and CNNs. For the EGFR mutation, in the machine learning approach, there was an increase in the sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach, an AUC of 0.846 was obtained, and with SCAV, the accuracy of the model was increased from 0.80 to 0.857. For the KRAS mutation, both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC), a significant increase in performance was found. The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs. The results provide confidence that as large datasets become available, tools to augment clinical capabilities can be fielded.

Download Full-text

Telugu News Data Classification Using Machine Learning Approach

10.4018/978-1-7998-7685-4.ch014 ◽

2022 ◽

pp. 181-194

Author(s):

Bala Krishna Priya G. ◽

Jabeen Sultana ◽

Usha Rani M.

Keyword(s):

Machine Learning ◽

Social Media ◽

Research Work ◽

Learning Approach ◽

Fake News ◽

Learning Models ◽

Machine Learning Classifiers ◽

Proposed Model ◽

Machine Learning Approach ◽

Machine Learning Models

Mining Telugu news data and categorizing based on public sentiments is quite important since a lot of fake news emerged with rise of social media. Identifying whether news text is positive, negative, or neutral and later classifying the data in which areas they fall like business, editorial, entertainment, nation, and sports is included throughout this research work. This research work proposes an efficient model by adopting machine learning classifiers to perform classification on Telugu news data. The results obtained by various machine-learning models are compared, and an efficient model is found, and it is observed that the proposed model outperformed with reference to accuracy, precision, recall, and F1-score.

Download Full-text

Development and Validation of a Machine Learning Approach for Automated Severity Assessment of COVID-19 Based on Clinical and Imaging Data: Retrospective Study

JMIR Medical Informatics ◽

10.2196/24572 ◽

2021 ◽

Vol 9 (2) ◽

pp. e24572

Author(s):

Juan Carlos Quiroz ◽

You-Zhen Feng ◽

Zhong-Yuan Cheng ◽

Dana Rezazadegan ◽

Ping-Kang Chen ◽

...

Keyword(s):

Machine Learning ◽

Predictive Power ◽

Care Delivery ◽

Learning Approach ◽

Imaging Features ◽

Severity Assessment ◽

Imaging Data ◽

Learning Models ◽

Machine Learning Approach ◽

Machine Learning Models

Background COVID-19 has overwhelmed health systems worldwide. It is important to identify severe cases as early as possible, such that resources can be mobilized and treatment can be escalated. Objective This study aims to develop a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data. Methods Clinical data—including demographics, signs, symptoms, comorbidities, and blood test results—and chest computed tomography scans of 346 patients from 2 hospitals in the Hubei Province, China, were used to develop machine learning models for automated severity assessment in diagnosed COVID-19 cases. We compared the predictive power of the clinical and imaging data from multiple machine learning models and further explored the use of four oversampling methods to address the imbalanced classification issue. Features with the highest predictive power were identified using the Shapley Additive Explanations framework. Results Imaging features had the strongest impact on the model output, while a combination of clinical and imaging features yielded the best performance overall. The identified predictive features were consistent with those reported previously. Although oversampling yielded mixed results, it achieved the best model performance in our study. Logistic regression models differentiating between mild and severe cases achieved the best performance for clinical features (area under the curve [AUC] 0.848; sensitivity 0.455; specificity 0.906), imaging features (AUC 0.926; sensitivity 0.818; specificity 0.901), and a combination of clinical and imaging features (AUC 0.950; sensitivity 0.764; specificity 0.919). The synthetic minority oversampling method further improved the performance of the model using combined features (AUC 0.960; sensitivity 0.845; specificity 0.929). Conclusions Clinical and imaging features can be used for automated severity assessment of COVID-19 and can potentially help triage patients with COVID-19 and prioritize care delivery to those at a higher risk of severe disease.

Download Full-text

Development and Validation of a Machine Learning Approach for Automated Severity Assessment of COVID-19 Based on Clinical and Imaging Data: Retrospective Study (Preprint)

10.2196/preprints.24572 ◽

2020 ◽

Author(s):

Juan Carlos Quiroz ◽

You-Zhen Feng ◽

Zhong-Yuan Cheng ◽

Dana Rezazadegan ◽

Ping-Kang Chen ◽

...

Keyword(s):

Machine Learning ◽

Predictive Power ◽

Care Delivery ◽

Learning Approach ◽

Imaging Features ◽

Severity Assessment ◽

Imaging Data ◽

Learning Models ◽

Machine Learning Approach ◽

Machine Learning Models

BACKGROUND COVID-19 has overwhelmed health systems worldwide. It is important to identify severe cases as early as possible, such that resources can be mobilized and treatment can be escalated. OBJECTIVE This study aims to develop a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data. METHODS Clinical data—including demographics, signs, symptoms, comorbidities, and blood test results—and chest computed tomography scans of 346 patients from 2 hospitals in the Hubei Province, China, were used to develop machine learning models for automated severity assessment in diagnosed COVID-19 cases. We compared the predictive power of the clinical and imaging data from multiple machine learning models and further explored the use of four oversampling methods to address the imbalanced classification issue. Features with the highest predictive power were identified using the Shapley Additive Explanations framework. RESULTS Imaging features had the strongest impact on the model output, while a combination of clinical and imaging features yielded the best performance overall. The identified predictive features were consistent with those reported previously. Although oversampling yielded mixed results, it achieved the best model performance in our study. Logistic regression models differentiating between mild and severe cases achieved the best performance for clinical features (area under the curve [AUC] 0.848; sensitivity 0.455; specificity 0.906), imaging features (AUC 0.926; sensitivity 0.818; specificity 0.901), and a combination of clinical and imaging features (AUC 0.950; sensitivity 0.764; specificity 0.919). The synthetic minority oversampling method further improved the performance of the model using combined features (AUC 0.960; sensitivity 0.845; specificity 0.929). CONCLUSIONS Clinical and imaging features can be used for automated severity assessment of COVID-19 and can potentially help triage patients with COVID-19 and prioritize care delivery to those at a higher risk of severe disease.

Download Full-text

A Deep Learning Approach with Feature Derivation and Selection for Overdue Repayment Forecasting

Applied Sciences ◽

10.3390/app10238491 ◽

2020 ◽

Vol 10 (23) ◽

pp. 8491

Author(s):

Bin Liu ◽

Zhexi Zhang ◽

Junchi Yan ◽

Ning Zhang ◽

Hongyuan Zha ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Short Term Memory ◽

Critical Time ◽

Learning Approach ◽

Learning Models ◽

Comparison Results ◽

Online Lending ◽

Machine Learning Models

Risk control has always been a major challenge in finance. Overdue repayment is a frequently encountered discreditable behavior in online lending. Motivated by the powerful capabilities of deep neural networks, we propose a fusion deep learning approach, namely AD-MBLSTM, based on the deep neural network (DNN), multi-layer bi-directional long short-term memory (LSTM) (BiLSTM) and the attention mechanism for overdue repayment behavior forecasting according to historical repayment records. Furthermore, we present a novel feature derivation and selection method for the procedure of data preprocessing. Visualization and interpretability improvement work is also implemented to explore the critical time points and causes of overdue repayment behavior. In addition, we present a new dataset originating from a practical application scenario in online lending. We evaluate our proposed framework on the dataset and compare the performance with various general machine learning models and neural network models. Comparison results and the ablation study demonstrate that our proposed model outperforms many effective general machine learning models by a large margin, and each indispensable sub-component takes an active role.

Download Full-text

Identification of Key Influencers for Secondary Distribution of HIV Self-Testing among Chinese MSM: A Machine Learning Approach

10.1101/2021.04.19.21255584 ◽

2021 ◽

Author(s):

Fengshi JING ◽

Yang Ye ◽

Yi Zhou ◽

Yuxin Ni ◽

Xumeng Yan ◽

...

Keyword(s):

Machine Learning ◽

Human Identification ◽

Support Vector ◽

Learning Approach ◽

Learning Models ◽

Self Testing ◽

Machine Learning Approach ◽

Secondary Distribution ◽

First Time ◽

Machine Learning Models

Abstract Background. HIV self-testing (HIVST) has been rapidly scaled up and additional strategies further expand testing uptake. Secondary distribution has people (indexes) apply for multiple kits and pass these kits to people (alters) in their social networks. However, identifying key influencers is difficult. This study aimed to develop an innovative ensemble machine learning approach to identify key influencers among Chinese men who have sex with men (MSM) for HIVST secondary distribution. Method. We defined three types of key influencers: 1) key distributors who can distribute more kits; 2) key promoters who can contribute to finding first-time testing alters; 3) key detectors who can help to find positive alters. Four machine learning models (logistic regression, support vector machine, decision tree, random forest) were trained to identify key influencers. An ensemble learning algorithm was adopted to combine these four models. Simulation experiments were run to validate our approach. Results. 309 indexes distributed kits to 269 alters. Our approach outperformed human identification (self-reported scales cut-off), exceeding by an average accuracy of 11.0%, could distribute 18.2% (95%CI: 9.9%-26.5%) more kits, find 13.6% (95%CI: 1.9%-25.3%) more first-time testing alters and 12.0% (95%CI: -14.7%-38.7%) more positive-testing alters. Our approach could also increase simulated intervention efficiency by 17.7% (95%CI: -3.5%-38.8%) than human identification. Conclusion. We built machine learning models to identify key influencers among Chinese MSM who were more likely to engage in HIVST secondary distribution.

Download Full-text

Direct Comparison of the Prediction of the Unbound Brain-to-Plasma Partitioning Utilizing Machine Learning Approach and Mechanistic Neuropharmacokinetic Model

The AAPS Journal ◽

10.1208/s12248-021-00604-x ◽

2021 ◽

Vol 23 (4) ◽

Author(s):

Yohei Kosugi ◽

Kunihiko Mizuno ◽

Cipriano Santos ◽

Sho Sato ◽

Natalie Hosea ◽

...

Keyword(s):

Machine Learning ◽

Multiple Drug Resistance ◽

Predictive Performance ◽

Training Dataset ◽

Multiple Drug ◽

Learning Approach ◽

Cancer Resistance ◽

Learning Models ◽

Machine Learning Approach ◽

Machine Learning Models

AbstractThe mechanistic neuropharmacokinetic (neuroPK) model was established to predict unbound brain-to-plasma partitioning (Kp,uu,brain) by considering in vitro efflux activities of multiple drug resistance 1 (MDR1) and breast cancer resistance protein (BCRP). Herein, we directly compare this model to a computational machine learning approach utilizing physicochemical descriptors and efflux ratios of MDR1 and BCRP-expressing cells for predicting Kp,uu,brain in rats. Two different types of machine learning techniques, Gaussian processes (GP) and random forest regression (RF), were assessed by the time and cluster-split validation methods using 640 internal compounds. The predictivity of machine learning models based on only molecular descriptors in the time-split dataset performed worse than the cluster-split dataset, whereas the models incorporating MDR1 and BCRP efflux ratios showed similar predictivity between time and cluster-split datasets. The GP incorporating MDR1 and BCRP in the time-split dataset achieved the highest correlation (R2 = 0.602). These results suggested that incorporation of MDR1 and BCRP in machine learning is beneficial for robust and accurate prediction. Kp,uu,brain prediction utilizing the neuroPK model was significantly worse compared to machine learning approaches for the same dataset. We also investigated the predictivity of Kp,uu,brain using an external independent test set of 34 marketed drugs. Compared to machine learning models, the neuroPK model showed better predictive performance with R2 of 0.577. This work demonstrates that the machine learning model for Kp,uu,brain achieves maximum predictive performance within the chemical applicability domain, whereas the neuroPK model is applicable more widely beyond the chemical space covered in the training dataset.

Download Full-text

Efficiencies of Feature Engineering in the Machine Learning approach for Fake News Classification

10.20944/preprints202111.0024.v1 ◽

2021 ◽

Author(s):

Katrin Donetski

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Global Network ◽

Feature Engineering ◽

Learning Approach ◽

Fake News ◽

Learning Models ◽

Substantial Impact ◽

Machine Learning Approach ◽

Classification Tasks

The rapid infiltration of fake news is a flaw to the otherwise valuable internet, a virtually global network that allows for the simultaneous exchange of information. While a common, and normally effective, approach to such classification tasks is designing a deep learning-based model, the subjectivity behind the writing and production of misleading news invalidates this technique. Deep learning models are unexplainable in nature, making the contextualization of results impossible because it lacks explicit features used in traditional machine learning. This paper emphasizes the need for feature engineering to effectively address this problem: containing the spread of fake news at the source, not after it has become globally prevalent. Insights from extracted features were used to manipulate the text, which was then tested on deep learning models. The original unknown yet substantial impact that the original features had on deep learning models was successfully depicted in this study.

Download Full-text

Machine Learning-Based Malicious X.509 Certificates’ Detection

Applied Sciences ◽

10.3390/app11052164 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2164

Author(s):

Jiaxin Li ◽

Zhaoxin Zhang ◽

Changyong Guo

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Ensemble Learning ◽

Traffic Analysis ◽

Learning Models ◽

Detection Model ◽

Analysis Tools ◽

Average Accuracy ◽

Machine Learning Models

X.509 certificates play an important role in encrypting the transmission of data on both sides under HTTPS. With the popularization of X.509 certificates, more and more criminals leverage certificates to prevent their communications from being exposed by malicious traffic analysis tools. Phishing sites and malware are good examples. Those X.509 certificates found in phishing sites or malware are called malicious X.509 certificates. This paper applies different machine learning models, including classical machine learning models, ensemble learning models, and deep learning models, to distinguish between malicious certificates and benign certificates with Verification for Extraction (VFE). The VFE is a system we design and implement for obtaining plentiful characteristics of certificates. The result shows that ensemble learning models are the most stable and efficient models with an average accuracy of 95.9%, which outperforms many previous works. In addition, we obtain an SVM-based detection model with an accuracy of 98.2%, which is the highest accuracy. The outcome indicates the VFE is capable of capturing essential and crucial characteristics of malicious X.509 certificates.

Download Full-text

A Physics-Infused Deep Learning Model for the Prediction of Refractive Indices and Its Use for the Large-Scale Screening of Organic Compound Space

10.26434/chemrxiv.8796950 ◽

2019 ◽

Author(s):

Mojtaba Haghighatlari ◽

Gaurav Vishwakarma ◽

Mohammad Atif Faiz Afzal ◽

Johannes Hachmann

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Organic Molecules ◽

Learning Model ◽

Training Data ◽

Refractive Indices ◽

Learning Models ◽

Deep Learning Model ◽

Machine Learning Models

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>

Download Full-text

Application of Bioactivity Profile Based Fingerprints for Building Machine Learning Models

10.26434/chemrxiv.6969584 ◽

2018 ◽

Cited By ~ 1

Author(s):

Noé Sturm ◽

Jiangming Sun ◽

Yves Vandriessche ◽

Andreas Mayr ◽

Günter Klambauer ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

High Throughput ◽

Scaffold Hopping ◽

Learning Models ◽

Industrial Data ◽

Structural Descriptors ◽

Bioactivity Profile ◽

Machine Learning Models

<div>This article describes an application of high-throughput fingerprints (HTSFP) built upon industrial data accumulated over the years. </div><div>The fingerprint was used to build machine learning models (multi-task deep learning + SVM) for compound activity predictions towards a panel of 131 targets. </div><div>Quality of the predictions and the scaffold hopping potential of the HTSFP were systematically compared to traditional structural descriptors ECFP. </div><div><br></div>

Download Full-text