Machine Learning Data Augmentation as a Tool to Enhance Quantitative Composition–Activity Relationships of Complex Mixtures. A New Application to Dissect the Role of Main Chemical Components in Bioactive Essential Oils

AbstractFour species of grass generate half of all human-consumed calories1. However, abundant biological data on species that produce our food remains largely inaccessible, imposing direct barriers to understanding crop yield and fitness traits. Here, we assemble and analyse a continent-wide database of field experiments spanning ten years and hundreds of thousands of machine-phenotyped populations of ten major crop species. Training an ensemble of machine learning models, using thousands of variables capturing weather, ground-sensor, soil, chemical and fertiliser dosage, management, and satellite data, produces robust cross-continent yield models exceeding R2 = 0.8 prediction accuracy. In contrast to ‘black box’ analytics, detailed interrogation of these models reveals fundamental drivers of crop behaviour and complex interactions predicting yield and agronomic traits. These results demonstrate the capacity of machine learning models to build unified, interpretable, and explainable models of crop behaviour, and highlight the powerful role of data in the future of food.

Download Full-text

Artificial Intelligence: A New Tool in Operating Room Management. Role of Machine Learning Models in Operating Room Optimization

Journal of Medical Systems ◽

10.1007/s10916-019-1512-1 ◽

2019 ◽

Vol 44 (1) ◽

Cited By ~ 2

Author(s):

Valentina Bellini ◽

Marco Guzzon ◽

Barbara Bigliardi ◽

Monica Mordonini ◽

Serena Filippelli ◽

...

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Operating Room ◽

Operating Room Management ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Counterfactual Examples for Data Augmentation: A Case Study

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128503 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Md Golam Moula Mehedi Hasan ◽

Douglas A. Talbert

Keyword(s):

Machine Learning ◽

Potential Application ◽

Data Augmentation ◽

Generative Adversarial Networks ◽

Application Area ◽

Learning Models ◽

Adversarial Networks ◽

Feature Values ◽

Machine Learning Models

Counterfactual explanations are gaining in popularity as a way of explaining machine learning models. Counterfactual examples are generally created to help interpret the decision of a model. In this case, if a model makes a certain decision for an instance, the counterfactual examples of that instance reverse the decision of the model. The counterfactual examples can be created by craftily changing particular feature values of the instance. Though counterfactual examples are generated to explain the decision of machine learning models, in this work, we explore another potential application area of counterfactual examples, whether counterfactual examples are useful for data augmentation. We demonstrate the efficacy of this approach on the widely used “Adult-Income” dataset. We consider several scenarios where we do not have enough data and use counterfactual examples to augment the dataset. We compare our approach with Generative Adversarial Networks approach for dataset augmentation. The experimental results show that our proposed approach can be an effective way to augment a dataset.

Download Full-text

Assessing the performance of a suite of machine learning models for daily river water temperature prediction

PeerJ ◽

10.7717/peerj.7065 ◽

2019 ◽

Vol 7 ◽

pp. e7065 ◽

Cited By ~ 9

Author(s):

Senlin Zhu ◽

Emmanuel Karlo Nyarko ◽

Marijana Hadzima-Nyarko ◽

Salim Heddam ◽

Shiqiang Wu

Keyword(s):

Machine Learning ◽

Water Temperature ◽

River Water ◽

Minor Role ◽

Learning Models ◽

Temperature Prediction ◽

Flow Discharge ◽

River Water Temperature ◽

Machine Learning Models

In this study, different versions of feedforward neural network (FFNN), Gaussian process regression (GPR), and decision tree (DT) models were developed to estimate daily river water temperature using air temperature (Ta), flow discharge (Q), and the day of year (DOY) as predictors. The proposed models were assessed using observed data from eight river stations, and modelling results were compared with the air2stream model. Model performances were evaluated using four indicators in this study: the coefficient of correlation (R), the Willmott index of agreement (d), the root mean squared error (RMSE), and the mean absolute error (MAE). Results indicated that the three machine learning models had similar performance when only Ta was used as the predictor. When the day of year was included as model input, the performances of the three machine learning models dramatically improved. Including flow discharge instead of day of year, as an additional predictor, provided a lower gain in model accuracy, thereby showing the relatively minor role of flow discharge in river water temperature prediction. However, an increase in the relative importance of flow discharge was noticed for stations with high altitude catchments (Rhône, Dischmabach and Cedar) which are influenced by cold water releases from hydropower or snow melting, suggesting the dependence of the role of flow discharge on the hydrological characteristics of such rivers. The air2stream model outperformed the three machine learning models for most of the studied rivers except for the cases where including flow discharge as a predictor provided the highest benefits. The DT model outperformed the FFNN and GPR models in the calibration phase, however in the validation phase, its performance slightly decreased. In general, the FFNN model performed slightly better than GPR model. In summary, the overall modelling results showed that the three machine learning models performed well for river water temperature modelling.

Download Full-text

Detecting Arsenic Contamination Using Satellite Imagery and Machine Learning

Toxics ◽

10.3390/toxics9120333 ◽

2021 ◽

Vol 9 (12) ◽

pp. 333

Author(s):

Ayush Agrawal ◽

Mark R. Petersen

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Mean Squared Error ◽

Binary Classification ◽

Arsenic Concentration ◽

Arsenic Contamination ◽

Hyperspectral Data ◽

Detection Methods ◽

Learning Models ◽

Machine Learning Models

Arsenic, a potent carcinogen and neurotoxin, affects over 200 million people globally. Current detection methods are laborious, expensive, and unscalable, being difficult to implement in developing regions and during crises such as COVID-19. This study attempts to determine if a relationship exists between soil’s hyperspectral data and arsenic concentration using NASA’s Hyperion satellite. It is the first arsenic study to use satellite-based hyperspectral data and apply a classification approach. Four regression machine learning models are tested to determine this correlation in soil with bare land cover. Raw data are converted to reflectance, problematic atmospheric influences are removed, characteristic wavelengths are selected, and four noise reduction algorithms are tested. The combination of data augmentation, Genetic Algorithm, Second Derivative Transformation, and Random Forest regression (R2=0.840 and normalized root mean squared error (re-scaled to [0,1]) = 0.122) shows strong correlation, performing better than past models despite using noisier satellite data (versus lab-processed samples). Three binary classification machine learning models are then applied to identify high-risk shrub-covered regions in ten U.S. states, achieving strong accuracy (=0.693) and F1-score (=0.728). Overall, these results suggest that such a methodology is practical and can provide a sustainable alternative to arsenic contamination detection.

Download Full-text

Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

10.26434/chemrxiv.11811564.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Michael Fortunato ◽

Connor W. Coley ◽

Brian Barnes ◽

Klavs F. Jensen

Keyword(s):

Neural Network ◽

Machine Learning ◽

Data Augmentation ◽

Machine Learning Algorithms ◽

Learning Models ◽

The Neural Network ◽

Computer Aided ◽

Synthesis Planning ◽

The One ◽

Machine Learning Models

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.

Download Full-text

Machine learning for prediction of immunotherapy efficacy in non-small cell lung cancer from simple clinical and biological data

10.1101/2021.11.30.21267064 ◽

2021 ◽

Author(s):

Sébastien Benzekry ◽

Mathieu Grangeon ◽

Mélanie Karlsen ◽

Maria Alexa ◽

Isabella Bicalho-Frazeto ◽

...

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Random Forest ◽

Predictive Value ◽

Performance Status ◽

Small Cell ◽

Biological Data ◽

Learning Models ◽

Small Cell Lung ◽

Machine Learning Models

ABSTRACTBackgroundImmune checkpoint inhibitors (ICIs) are now a therapeutic standard in advanced non-small cell lung cancer (NSCLC), but strong predictive markers for ICIs efficacy are still lacking. We evaluated machine learning models built on simple clinical and biological data to individually predict response to ICIs.MethodsPatients with metastatic NSCLC who received ICI in second line or later were included. We collected clinical and hematological data and studied the association of this data with disease control rate (DCR), progression free survival (PFS) and overall survival (OS). Multiple machine learning (ML) algorithms were assessed for their ability to predict response.ResultsOverall, 298 patients were enrolled. The overall response rate and DCR were 15.3 % and 53%, respectively. Median PFS and OS were 3.3 and 11.4 months, respectively. In multivariable analysis, DCR was significantly associated with performance status (PS) and hemoglobin level (OR 0.58, p<0.0001; OR 1.8, p<0.001). These variables were also associated with PFS and OS and ranked top in random forest-based feature importance. Neutrophils-to-lymphocytes ratio was also associated with DCR, PFS and OS. The best ML algorithm was a random forest. It could predict DCR with satisfactory efficacy based on these three variables. Ten-fold cross-validated performances were: accuracy 0.68 ± 0.04, sensitivity 0.58 ± 0.08; specificity 0.78 ± 0.06; positive predictive value 0.70 ± 0.08; negative predictive value 0.68 ± 0.06; AUC 0.74 ± 0.03.ConclusionCombination of simple clinical and biological data could accurately predict disease control rate at the individual level.Highlights-Machine learning applied to a large set of NSCLC patients could predict efficacy of immunotherapy with a 69% accuracy using simple routine data-Hemoglobin levels and performance status were the strongest predictors and significantly associated with DCR, PFS and OS-Neutrophils-to-lymphocyte ratio was also associated with outcome-Benchmark of 8 machine learning models

Download Full-text

DeepHE: Accurately Predicting Human Essential Genes based on Deep Learning

10.1101/2020.02.14.950048 ◽

2020 ◽

Cited By ~ 2

Author(s):

Xue Zhang ◽

Wangxin Xiao ◽

Weijia Xiao

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Essential Gene ◽

Gene Prediction ◽

Biological Data ◽

Essential Genes ◽

Learning Models ◽

Imbalanced Learning ◽

Ppi Network ◽

Machine Learning Models

AbstractMotivationAccurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance.ResultsWe proposed a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method was utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features were integrated to train a multilayer neural network. A cost-sensitive technique was used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes showed that our proposed method, DeepHE, can accurately predict human gene essentiality with an average AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compared DeepHE with several widely used traditional machine learning models (SVM, Naïve Bayes, Random Forest, Adaboost). The experimental results showed that DeepHE greatly outperformed the compared machine learning models.ConclusionsWe demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.Availability and ImplementationThe python code will be freely available upon the acceptance of this manuscript at https://github.com/xzhang2016/[email protected]

Download Full-text

Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

10.26434/chemrxiv.11811564 ◽

2020 ◽

Author(s):

Michael Fortunato ◽

Connor W. Coley ◽

Brian Barnes ◽

Klavs F. Jensen

Keyword(s):

Neural Network ◽

Machine Learning ◽

Data Augmentation ◽

Machine Learning Algorithms ◽

Learning Models ◽

The Neural Network ◽

Computer Aided ◽

Synthesis Planning ◽

The One ◽

Machine Learning Models

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.

Download Full-text

Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction

Briefings in Bioinformatics ◽

10.1093/bib/bbab127 ◽

2021 ◽

Author(s):

Xiang Liu ◽

Huitao Feng ◽

Jie Wu ◽

Kelin Xia

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Molecular Descriptors ◽

Biological Data ◽

Learning Models ◽

Filtration Process ◽

Binding Affinity Prediction ◽

Affinity Prediction ◽

Machine Learning Models

Abstract Molecular descriptors are essential to not only quantitative structure activity/property relationship (QSAR/QSPR) models, but also machine learning based chemical and biological data analysis. In this paper, we propose persistent spectral hypergraph (PSH) based molecular descriptors or fingerprints for the first time. Our PSH-based molecular descriptors are used in the characterization of molecular structures and interactions, and further combined with machine learning models, in particular gradient boosting tree (GBT), for protein-ligand binding affinity prediction. Different from traditional molecular descriptors, which are usually based on molecular graph models, a hypergraph-based topological representation is proposed for protein–ligand interaction characterization. Moreover, a filtration process is introduced to generate a series of nested hypergraphs in different scales. For each of these hypergraphs, its eigen spectrum information can be obtained from the corresponding (Hodge) Laplacain matrix. PSH studies the persistence and variation of the eigen spectrum of the nested hypergraphs during the filtration process. Molecular descriptors or fingerprints can be generated from persistent attributes, which are statistical or combinatorial functions of PSH, and combined with machine learning models, in particular, GBT. We test our PSH-GBT model on three most commonly used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. Our results, for all these databases, are better than all existing machine learning models with traditional molecular descriptors, as far as we know.

Download Full-text