BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

Mapping Intimacies ◽

10.1101/426692 ◽

2018 ◽

Author(s):

Rudraksh Tuwani ◽

Somin Wadhwa ◽

Ganesh Bagler

Keyword(s):

Machine Learning ◽

Molecular Descriptors ◽

Wide Spectrum ◽

Predictive Performance ◽

Sweet Taste ◽

Future Research ◽

Gustatory System ◽

Learning Models ◽

Integrative Framework ◽

Machine Learning Models

ABSTRACTThe dichotomy of sweet and bitter tastes is a salient evolutionary feature of human gustatory system with an innate attraction to sweet taste and aversion to bitterness. A better understanding of molecular correlates of bitter-sweet taste gradient is crucial for identification of natural as well as synthetic compounds of desirable taste on this axis. While previous studies have advanced our understanding of the molecular basis of bitter-sweet taste and contributed models for their identification, there is ample scope to enhance these models by meticulous compilation of bitter-sweet molecules and utilization of a wide spectrum of molecular descriptors. Towards these goals, based on structured data compilation our study provides an integrative framework with state-of-the-art machine learning models for bitter-sweet taste prediction (BitterSweet). We compare different sets of molecular descriptors for their predictive performance and further identify important features as well as feature blocks. The utility of BitterSweet models is demonstrated by taste prediction on large specialized chemical sets such as FlavorDB, FooDB, SuperSweet, Super Natural II, DSSTox, and DrugBank. To facilitate future research in this direction, we make all datasets and BitterSweet models publicly available, and also present an end-to-end software for bitter-sweet taste prediction based on freely available chemical descriptors.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Temporal and Spatial Autocorrelation as Determinants of Regional AOD-PM2.5 Model Performance in the Middle East

Remote Sensing ◽

10.3390/rs13183790 ◽

2021 ◽

Vol 13 (18) ◽

pp. 3790

Author(s):

Khang Chau ◽

Meredith Franklin ◽

Huikyo Lee ◽

Michael Garay ◽

Olga Kalashnikova

Keyword(s):

Machine Learning ◽

Middle East ◽

United Arab Emirates ◽

Atmospheric Correction ◽

Predictive Performance ◽

Variable Importance ◽

Learning Models ◽

Median Test ◽

Temporal And Spatial ◽

Machine Learning Models

Exposure to fine particulate matter (PM2.5) air pollution has been shown in numerous studies to be associated with detrimental health effects. However, the ability to conduct epidemiological assessments can be limited due to challenges in generating reliable PM2.5 estimates, particularly in parts of the world such as the Middle East where measurements are scarce and extreme meteorological events such as sandstorms are frequent. In order to supplement exposure modeling efforts under such conditions, satellite-retrieved aerosol optical depth (AOD) has proven to be useful due to its global coverage. By using AODs from the Multiangle Implementation of Atmospheric Correction (MAIAC) of the MODerate Resolution Imaging Spectroradiometer (MODIS) and the Multiangle Imaging Spectroradiometer (MISR) combined with meteorological and assimilated aerosol information from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), we constructed machine learning models to predict PM2.5 in the area surrounding the Persian Gulf, including Kuwait, Bahrain, and the United Arab Emirates (U.A.E). Our models showed regional differences in predictive performance, with better results in the U.A.E. (median test R2 = 0.66) than Kuwait (median test R2 = 0.51). Variable importance also differed by region, where satellite-retrieved AOD variables were more important for predicting PM2.5 in Kuwait than in the U.A.E. Divergent trends in the temporal and spatial autocorrelations of PM2.5 and AOD in the two regions offered possible explanations for differences in predictive performance and variable importance. In a test of model transferability, we found that models trained in one region and applied to another did not predict PM2.5 well, even if the transferred model had better performance. Overall the results of our study suggest that models developed over large geographic areas could generate PM2.5 estimates with greater uncertainty than could be obtained by taking a regional modeling approach. Furthermore, development of methods to better incorporate spatial and temporal autocorrelations in machine learning models warrants further examination.

Download Full-text

A predictive performance comparison of machine learning models for judicial cases

2017 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci.2017.8285436 ◽

2017 ◽

Cited By ~ 4

Author(s):

Zhenyu Liu ◽

Huanhuan Chen

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

Performance Comparison ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Development of Combined Heavy Rain Damage Prediction Models with Machine Learning

Water ◽

10.3390/w11122516 ◽

2019 ◽

Vol 11 (12) ◽

pp. 2516 ◽

Cited By ~ 1

Author(s):

Changhyun Choi ◽

Jeonghwan Kim ◽

Jungwook Kim ◽

Hung Soo Kim

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Prediction Model ◽

Prediction Models ◽

Predictive Performance ◽

Heavy Rain ◽

Learning Models ◽

Damage Prediction ◽

Natural Disaster Management ◽

Machine Learning Models

Adequate forecasting and preparation for heavy rain can minimize life and property damage. Some studies have been conducted on the heavy rain damage prediction model (HDPM), however, most of their models are limited to the linear regression model that simply explains the linear relation between rainfall data and damage. This study develops the combined heavy rain damage prediction model (CHDPM) where the residual prediction model (RPM) is added to the HDPM. The predictive performance of the CHDPM is analyzed to be 4–14% higher than that of HDPM. Through this, we confirmed that the predictive performance of the model is improved by combining the RPM of the machine learning models to complement the linearity of the HDPM. The results of this study can be used as basic data beneficial for natural disaster management.

Download Full-text

A Universal Screening Tool for Dyslexia by a Web-Game and Machine Learning

Frontiers in Computer Science ◽

10.3389/fcomp.2021.628634 ◽

2022 ◽

Vol 3 ◽

Author(s):

Maria Rauschenberger ◽

Ricardo Baeza-Yates ◽

Luz Rello

Keyword(s):

Machine Learning ◽

Early Intervention ◽

Universal Screening ◽

User Study ◽

Future Research ◽

Learning Models ◽

Early Screening ◽

Web Based ◽

Negative Side Effects ◽

Machine Learning Models

Children with dyslexia have difficulties learning how to read and write. They are often diagnosed after they fail school even if dyslexia is not related to general intelligence. Early screening of dyslexia can prevent the negative side effects of late detection and enables early intervention. In this context, we present an approach for universal screening of dyslexia using machine learning models with data gathered from a web-based language-independent game. We designed the game content taking into consideration the analysis of mistakes of people with dyslexia in different languages and other parameters related to dyslexia like auditory perception as well as visual perception. We did a user study with 313 children (116 with dyslexia) and train predictive machine learning models with the collected data. Our method yields an accuracy of 0.74 for German and 0.69 for Spanish as well as a F1-score of 0.75 for German and 0.75 for Spanish, using Random Forests and Extra Trees, respectively. We also present the game content design, potential new auditory input, and knowledge about the design approach for future research to explore Universal screening of dyslexia. universal screening with language-independent content can be used for the screening of pre-readers who do not have any language skills, facilitating a potential early intervention.

Download Full-text

Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning

Energies ◽

10.3390/en14020289 ◽

2021 ◽

Vol 14 (2) ◽

pp. 289

Author(s):

Maria Krechowicz ◽

Adam Krechowicz

Keyword(s):

Machine Learning ◽

Risk Assessment ◽

Assessment Process ◽

Future Research ◽

Directional Drilling ◽

Ann Model ◽

Learning Models ◽

Horizontal Directional Drilling ◽

Risk Assessment Process ◽

Machine Learning Models

Nowadays we can observe a growing demand for installations of new gas pipelines in Europe. A large number of them are installed using trenchless Horizontal Directional Drilling (HDD) technology. The aim of this work was to develop and compare new machine learning models dedicated for risk assessment in HDD projects. The data from 133 HDD projects from eight countries of the world were gathered, profiled, and preprocessed. Three machine learning models, logistic regression, random forests, and Artificial Neural Network (ANN), were developed to predict the overall HDD project outcome (failure free installation or installation likely to fail), and the occurrence of identified unwanted events. The best performance in terms of recall and accuracy was achieved for the developed ANN model, which proved to be efficient, fast and robust in predicting risks in HDD projects. Machine learning applications in the proposed models enabled eliminating the involvement of a group of experts in the risk assessment process and therefore significantly lower the costs associated with the risk assessment process. Future research may be oriented towards developing a comprehensive risk management system, which will enable dynamic risk assessment taking into account various combinations of risk mitigation actions.

Download Full-text

Machine Learning to Predict 10-year Cardiovascular Mortality from the Electrocardiogram: Analysis of the Third National Health and Nutrition Examination Survey (NHANES III)

10.1101/2021.09.09.21263327 ◽

2021 ◽

Author(s):

Chang H Kim ◽

Sadeer Al-Kindi ◽

Yasir Tarabichi ◽

Suril Gohel ◽

Riddhi Vyas ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Mortality ◽

Predictive Performance ◽

Nhanes Iii ◽

Nutrition Examination Survey ◽

Learning Models ◽

The Third ◽

Health And Nutrition ◽

Machine Learning Models ◽

Ecg Data

Background: The value of the electrocardiogram (ECG) for predicting long-term cardiovascular outcomes is not well defined. Machine learning methods are well suited for analysis of highly correlated data such as that from the ECG. Methods: Using demographic, clinical, and 12-lead ECG data from the Third National Health and Nutrition Examination Survey (NHANES III), machine learning models were trained to predict 10-year cardiovascular mortality in ambulatory U.S. adults. Predictive performance of each model was assessed using area under receiver operating characteristic curve (AUROC), area under precision-recall curve (AUPRC), sensitivity, and specificity. These were compared to the 2013 American College of Cardiology/American Heart Association Pooled Cohort Equations (PCE). Results: 7,067 study participants (mean age: 59.2 +/- 13.4 years, female: 52.5%, white: 73.9%, black: 23.3%) were included. At 10 years of follow up, 338 (4.8%) had died from cardiac causes. Compared to the PCE (AUROC: 0.668, AUPRC: 0.125, sensitivity: 0.492, specificity: 0.859), machine learning models only required demographic and ECG data to achieve comparable performance: logistic regression (AUROC: 0.754, AUPRC: 0.141, sensitivity: 0.747, specificity: 0.759), neural network (AUROC: 0.764, AUPRC: 0.149, sensitivity: 0.722, specificity: 0.787), and ensemble model (AUROC: 0.695, AUPRC: 0.166, sensitivity: 0.468, specificity: 0.912). Additional clinical data did not improve the predictive performance of machine learning models. In variable importance analysis, important ECG features clustered in inferior and lateral leads. Conclusions: Machine learning can be applied to demographic and ECG data to predict 10-year cardiovascular mortality in ambulatory adults, with potentially important implications for primary prevention.

Download Full-text

Robust Performance of Potentially Functional SNPs in Machine Learning Models for the Prediction of Atorvastatin-Induced Myalgia

Frontiers in Pharmacology ◽

10.3389/fphar.2021.605764 ◽

2021 ◽

Vol 12 ◽

Author(s):

Brandon N. S. Ooi ◽

Raechell ◽

Ariel F. Ying ◽

Yong Zher Koh ◽

Yu Jin ◽

...

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

Whole Genome ◽

Learning Models ◽

Association Analyses ◽

Functional Snps ◽

Individual Snps ◽

Genome Association ◽

Whole Genome Association ◽

Machine Learning Models

Background:Statins can cause muscle symptoms resulting in poor adherence to therapy and increased cardiovascular risk. We hypothesize that combinations of potentially functional SNPs (pfSNPs), rather than individual SNPs, better predict myalgia in patients on atorvastatin. This study assesses the value of potentially functional single nucleotide polymorphisms (pfSNPs) and employs six machine learning algorithms to identify the combination of SNPs that best predict myalgia.Methods: Whole genome sequencing of 183 Chinese, Malay and Indian patients from Singapore was conducted to identify genetic variants associated with atorvastatin induced myalgia. To adjust for confounding factors, demographic and clinical characteristics were also examined for their association with myalgia. The top factor, sex, was then used as a covariate in the whole genome association analyses. Variants that were highly associated with myalgia from this and previous studies were extracted, assessed for potential functionality (pfSNPs) and incorporated into six machine learning models. Predictive performance of a combination of different models and inputs were compared using the average cross validation area under ROC curve (AUC). The minimum combination of SNPs to achieve maximum sensitivity and specificity as determined by AUC, that predict atorvastatin-induced myalgia in most, if not all the six machine learning models was determined.Results: Through whole genome association analyses using sex as a covariate, a larger proportion of pfSNPs compared to non-pf SNPs were found to be highly associated with myalgia. Although none of the individual SNPs achieved genome wide significance in univariate analyses, machine learning models identified a combination of 15 SNPs that predict myalgia with good predictive performance (AUC >0.9). SNPs within genes identified in this study significantly outperformed SNPs within genes previously reported to be associated with myalgia. pfSNPs were found to be more robust in predicting myalgia, outperforming non-pf SNPs in the majority of machine learning models tested.Conclusion: Combinations of pfSNPs that were consistently identified by different machine learning models to have high predictive performance have good potential to be clinically useful for predicting atorvastatin-induced myalgia once validated against an independent cohort of patients.

Download Full-text

An Attempt to Boost Molecular Descriptors with Quantum-Derived Features in Prediction of Maximum Emission Wavelengths of Chromophores

10.26434/chemrxiv.14534136.v1 ◽

2021 ◽

Author(s):

Bartłomiej Fliszkiewicz

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Optical Properties ◽

Molecular Descriptors ◽

Gradient Boosting ◽

Learning Models ◽

Linear Gradient ◽

Maximum Emission ◽

Improve Accuracy ◽

Machine Learning Models

The following research assesses the capability of machine learning in predicting maximum emission wavelength of organic compounds. The predictions are based on structure descriptors and fingerprints widely applied in cheminformatics. In an attempt to further improve accuracy, developed machine learning models were enriched with quantum mechanics derived features. Multi linear, gradient boosting and random forest regressions were applied. Computers were trained and tested with database of experimental data of optical properties.

Download Full-text

Empirical asset pricing via machine learning: evidence from the European stock market

Journal of Asset Management ◽

10.1057/s41260-021-00237-x ◽

2021 ◽

Author(s):

Wolfgang Drobetz ◽

Tizian Otto

Keyword(s):

Machine Learning ◽

Stock Returns ◽

Network Architecture ◽

Risk Measures ◽

Predictive Performance ◽

Support Vector ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Methods ◽

Machine Learning Models

AbstractThis paper evaluates the predictive performance of machine learning methods in forecasting European stock returns. Compared to a linear benchmark model, interactions and nonlinear effects help improve the predictive performance. But machine learning models must be adequately trained and tuned to overcome the high dimensionality problem and to avoid overfitting. Across all machine learning methods, the most important predictors are based on price trends and fundamental signals from valuation ratios. However, the models exhibit substantial variation in statistical predictive performance that translate into pronounced differences in economic profitability. The return and risk measures of long-only trading strategies indicate that machine learning models produce sizeable gains relative to our benchmark. Neural networks perform best, also after accounting for transaction costs. A classification-based portfolio formation, utilizing a support vector machine that avoids estimating stock-level expected returns, performs even better than the neural network architecture.

Download Full-text