Using machine learning to predict quantitative phenotypes from protein and nucleic acid sequences

Mapping Intimacies ◽

10.1101/677328 ◽

2019 ◽

Author(s):

David B. Sauer ◽

Da-Neng Wang

Keyword(s):

Machine Learning ◽

Nucleic Acid ◽

Molecular Mechanisms ◽

Mean Squared Error ◽

Fluorescent Proteins ◽

Optimal Growth ◽

Multilayer Perceptrons ◽

Learning Models ◽

The Relationship ◽

Machine Learning Models

AbstractThe link between sequence and phenotype is essential to understanding the molecular mechanisms of evolution, and the design of proteins and genes with specific properties. However, it is difficult to describe the relationship between sequence and protein or organismal phenotypes, due to the complex relationship between sequence, protein folding and activity, and organismal physiology. Here, we use machine learning models trained on individual families of proteins or nucleic acids to predict the originating species’ optimal growth temperatures or other quantitative phenotypes. Trained multilayer perceptrons (MLPs) outperformed linear regressions in predicting the originating species growth temperature from protein sequences, achieving a root mean squared error of 3.6 °C. Similar machine learning models were able to predict the binding affinity of mutant WW domain sequences, brightness of fluorescent proteins, and enzymatic activity of ribozymes. Notably, the trained models are protein or nucleic acid family specific and therefore useful in the design of biopolymers with particular properties. This method provides a new tool for the in silico prediction of quantitative biophysical and organismal phenotypes directly from sequence.

Download Full-text

What Can Machine Learning Approaches in Genomics Tell Us about the Molecular Basis of Amyotrophic Lateral Sclerosis?

Journal of Personalized Medicine ◽

10.3390/jpm10040247 ◽

2020 ◽

Vol 10 (4) ◽

pp. 247

Author(s):

Christina Vasilopoulou ◽

Andrew P. Morris ◽

George Giannakopoulos ◽

Stephanie Duguez ◽

William Duddy

Keyword(s):

Machine Learning ◽

Amyotrophic Lateral Sclerosis ◽

Genetic Architecture ◽

Molecular Mechanisms ◽

Current Knowledge ◽

Regulatory Elements ◽

Specific Information ◽

Learning Models ◽

Lateral Sclerosis ◽

Machine Learning Models

Amyotrophic Lateral Sclerosis (ALS) is the most common late-onset motor neuron disorder, but our current knowledge of the molecular mechanisms and pathways underlying this disease remain elusive. This review (1) systematically identifies machine learning studies aimed at the understanding of the genetic architecture of ALS, (2) outlines the main challenges faced and compares the different approaches that have been used to confront them, and (3) compares the experimental designs and results produced by those approaches and describes their reproducibility in terms of biological results and the performances of the machine learning models. The majority of the collected studies incorporated prior knowledge of ALS into their feature selection approaches, and trained their machine learning models using genomic data combined with other types of mined knowledge including functional associations, protein-protein interactions, disease/tissue-specific information, epigenetic data, and known ALS phenotype-genotype associations. The importance of incorporating gene-gene interactions and cis-regulatory elements into the experimental design of future ALS machine learning studies is highlighted. Lastly, it is suggested that future advances in the genomic and machine learning fields will bring about a better understanding of ALS genetic architecture, and enable improved personalized approaches to this and other devastating and complex diseases.

Download Full-text

The Relationship Between Reductionism and Prediction in Psychiatry: A Survey

10.31234/osf.io/pgryu ◽

2021 ◽

Author(s):

Eren Asena ◽

Henk Cremers

Keyword(s):

Machine Learning ◽

Mental Disorders ◽

Literature Review ◽

Significant Role ◽

Biological Psychiatry ◽

Learning Models ◽

Different Types ◽

Survey Results ◽

The Relationship ◽

Machine Learning Models

Introduction. Biological psychiatry has yet to find clinically useful biomarkers despite mucheffort. Is this because the field needs better methods and more data, or are current conceptualizations of mental disorders too reductionistic? Although this is an important question, there seems to be no consensus on what it means to be a “reductionist”. Aims. This paper aims to; a) to clarify the views of researchers on different types of reductionism; b) to examine the relationship between these views and the degree to which researchers believe mental disorders can be predicted from biomarkers; c) to compare these predictability estimates with the performance of machine learning models that have used biomarkers to distinguish cases from controls. Methods. We created a survey on reductionism and the predictability of mental disorders from biomarkers, and shared it with researchers in biological psychiatry. Furthermore, a literature review was conducted on the performance of machine learning models in predicting mental disorders from biomarkers. Results. The survey results showed that 9% of the sample were dualists and 57% were explanatory reductionists. There was no relationship between reductionism and perceived predictability. The estimated predictability of 11 mental disorders using currently available methods ranged between 65-80%, which was comparable to the results from the literature review. However, the participants were highly optimistic about the ability of future methods in distinguishing cases from controls. Moreover, although behavioral data were rated as the most effective data type in predicting mental disorders, the participants expected biomarkers to play a significant role in not just predicting, but also defining mental disorders in the future.

Download Full-text

Permutation-based identification of important biomarkers for complex diseases via machine learning models

Nature Communications ◽

10.1038/s41467-021-22756-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Xinlei Mi ◽

Baiming Zou ◽

Fei Zou ◽

Jianhua Hu

Keyword(s):

Machine Learning ◽

Human Disease ◽

Molecular Mechanisms ◽

The Cancer Genome Atlas ◽

Support Vector ◽

Individual Feature ◽

Learning Models ◽

Efficient Manner ◽

Feature Importance ◽

Machine Learning Models

AbstractStudy of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Many machine learning-based methods have been developed and widely used to alleviate some analytic challenges in complex human disease studies. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting each individual feature due to their sophisticated algorithms. However, identifying important biomarkers is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Herein, we propose a Permutation-based Feature Importance Test (PermFIT) for estimating and testing the feature importance, and for assisting interpretation of individual feature in complex frameworks, including deep neural networks, random forests, and support vector machines. PermFIT (available at https://github.com/SkadiEye/deepTL) is implemented in a computationally efficient manner, without model refitting. We conduct extensive numerical studies under various scenarios, and show that PermFIT not only yields valid statistical inference, but also improves the prediction accuracy of machine learning models. With the application to the Cancer Genome Atlas kidney tumor data and the HITChip atlas data, PermFIT demonstrates its practical usage in identifying important biomarkers and boosting model prediction performance.

Download Full-text

Detecting Arsenic Contamination Using Satellite Imagery and Machine Learning

Toxics ◽

10.3390/toxics9120333 ◽

2021 ◽

Vol 9 (12) ◽

pp. 333

Author(s):

Ayush Agrawal ◽

Mark R. Petersen

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Mean Squared Error ◽

Binary Classification ◽

Arsenic Concentration ◽

Arsenic Contamination ◽

Hyperspectral Data ◽

Detection Methods ◽

Learning Models ◽

Machine Learning Models

Arsenic, a potent carcinogen and neurotoxin, affects over 200 million people globally. Current detection methods are laborious, expensive, and unscalable, being difficult to implement in developing regions and during crises such as COVID-19. This study attempts to determine if a relationship exists between soil’s hyperspectral data and arsenic concentration using NASA’s Hyperion satellite. It is the first arsenic study to use satellite-based hyperspectral data and apply a classification approach. Four regression machine learning models are tested to determine this correlation in soil with bare land cover. Raw data are converted to reflectance, problematic atmospheric influences are removed, characteristic wavelengths are selected, and four noise reduction algorithms are tested. The combination of data augmentation, Genetic Algorithm, Second Derivative Transformation, and Random Forest regression (R2=0.840 and normalized root mean squared error (re-scaled to [0,1]) = 0.122) shows strong correlation, performing better than past models despite using noisier satellite data (versus lab-processed samples). Three binary classification machine learning models are then applied to identify high-risk shrub-covered regions in ten U.S. states, achieving strong accuracy (=0.693) and F1-score (=0.728). Overall, these results suggest that such a methodology is practical and can provide a sustainable alternative to arsenic contamination detection.

Download Full-text

Optimization of Machine Learning in Various Situations Using ICT-Based TVOC Sensors

Micromachines ◽

10.3390/mi11121092 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1092

Author(s):

Jae Hyuk Cho ◽

Hayoun Lee

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Ground Truth ◽

Information And Communications Technology ◽

Learning Models ◽

Computational Framework ◽

Sensory Data ◽

Total Volatile Organic Compounds ◽

The Relationship ◽

Machine Learning Models

A computational framework using artificial intelligence (AI) has been suggested in numerous fields, such as medicine, robotics, meteorology, and chemistry. The specificity of each AI model and the relationship between data characteristics and ground truth, allowing their guidance according to each situation, has not been given. Since TVOCs (total volatile organic compounds) cause serious harm to human health and plants, the prevention of such damages with a reduction in their occurrence frequency becomes not an optional process but an essential one in manufacturing, as well as for chemical industries and laboratories. In this study, with consideration of the characteristics of the machine learning technique and ICT (information and communications technology), TVOC sensors are explored as a function of grounded data analysis and the selection of machine learning models, determining their performance in real situations. For representative scenarios, considering features from an ICT semiconductor sensor and one targeting TVOC gas, we investigated suitable analysis methods and machine learning models such as LSTM (long short-term memory), GRU (gated recurrent unit), and RNN (recurrent neural network). Detailed factors for these machine learning models with respect to the concentration of TVOC gas in the atmosphere are compared with original sensory data to obtain their accuracy. From this work, we expect to significantly minimize risk in empirical applications, i.e., maintaining homeostasis or predicting abnormal situations to construct an opportune response.

Download Full-text

What if Social Robots Look for Productive Engagement?

International Journal of Social Robotics ◽

10.1007/s12369-021-00766-w ◽

2021 ◽

Author(s):

Jauwairia Nasir ◽

Barbara Bruno ◽

Mohamed Chetouani ◽

Pierre Dillenbourg

Keyword(s):

Machine Learning ◽

Learning Outcomes ◽

State Of The Art ◽

Learning Models ◽

Collaborative Activity ◽

Data Set ◽

Performance Metric ◽

The Relationship ◽

Machine Learning Models ◽

Productive Engagement

AbstractIn educational HRI, it is generally believed that a robots behavior has a direct effect on the engagement of a user with the robot, the task at hand and also their partner in case of a collaborative activity. Increasing this engagement is then held responsible for increased learning and productivity. The state of the art usually investigates the relationship between the behaviors of the robot and the engagement state of the user while assuming a linear relationship between engagement and the end goal: learning. However, is it correct to assume that to maximise learning, one needs to maximise engagement? Furthermore, conventional supervised models of engagement require human annotators to get labels. This is not only laborious but also introduces further subjectivity in an already subjective construct of engagement. Can we have machine-learning models for engagement detection where annotations do not rely on human annotators? Looking deeper at the behavioral patterns and the learning outcomes and a performance metric in a multi-modal data set collected in an educational human–human–robot setup with 68 students, we observe a hidden link that we term as Productive Engagement. We theorize a robot incorporating this knowledge will (1) distinguish teams based on engagement that is conducive of learning; and (2) adopt behaviors that eventually lead the users to increased learning by means of being productively engaged. Furthermore, this seminal link paves way for machine-learning models in educational HRI with automatic labelling based on the data.

Download Full-text

Exploring the Relationship Between Mandatory Helmet Use Regulations and Adult Cyclists’ Behavior in California Using Hybrid Machine Learning Models

10.31979/mti.2021.2024 ◽

2021 ◽

Author(s):

Fatemeh Davoudi Kakhki ◽

Maria Chierichetti

Keyword(s):

Machine Learning ◽

Sociodemographic Characteristics ◽

Bicycle Helmet ◽

Learning Models ◽

Helmet Use ◽

Bicycle Helmets ◽

Hybrid Machine ◽

The Impact ◽

The Relationship ◽

Machine Learning Models

In California, bike fatalities increased by 8.1% from 2015 to 2016. Even though the benefits of wearing helmets in protecting cyclists against trauma in cycling crash has been determined, the use of helmets is still limited, and there is opposition against mandatory helmet use, particularly for adults. Therefore, exploring perceptions of adult cyclists regarding mandatory helmet use is a key element in understanding cyclists’ behavior, and determining the impact of mandatory helmet use on their cycling rate. The goal of this research is to identify sociodemographic characteristics and cycling behaviors that are associated with the use and non-use of bicycle helmets among adults, and to assess if the enforcement of a bicycle helmet law will result in a change in cycling rates. This research develops hybrid machine learning models to pinpoint the driving factors that explain adult cyclists’ behavior regarding helmet use laws.

Download Full-text

Tracking Major Sources of Water Contamination Using Machine Learning

Frontiers in Microbiology ◽

10.3389/fmicb.2020.616692 ◽

2021 ◽

Vol 11 ◽

Author(s):

Jianyong Wu ◽

Conghe Song ◽

Eric A. Dubinsky ◽

Jill R. Stewart

Keyword(s):

Machine Learning ◽

Random Forest ◽

Land Cover ◽

Microbial Contamination ◽

Naive Bayes ◽

Naïve Bayes ◽

Learning Models ◽

The Relationship ◽

Microbial Sources ◽

Machine Learning Models

Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this study, we tested and evaluated machine learning models to predict the major sources of microbial contamination in a watershed. We examined the relationship between microbial sources, land cover, weather, and hydrologic variables in a watershed in Northern California, United States. Six models, including K-nearest neighbors (KNN), Naïve Bayes, Support vector machine (SVM), simple neural network (NN), Random Forest, and XGBoost, were built to predict major microbial sources using land cover, weather and hydrologic variables. The results showed that these models successfully predicted microbial sources classified into two categories (human and non-human), with the average accuracy ranging from 69% (Naïve Bayes) to 88% (XGBoost). The area under curve (AUC) of the receiver operating characteristic (ROC) illustrated XGBoost had the best performance (average AUC = 0.88), followed by Random Forest (average AUC = 0.84), and KNN (average AUC = 0.74). The importance index obtained from Random Forest indicated that precipitation and temperature were the two most important factors to predict the dominant microbial source. These results suggest that machine learning models, particularly XGBoost, can predict the dominant sources of microbial contamination based on the relationship of microbial contaminants with daily weather and land cover, providing a powerful tool to understand microbial sources in water.

Download Full-text

A Gated Recurrent Unit Approach to Bitcoin Price Prediction

Journal of Risk and Financial Management ◽

10.3390/jrfm13020023 ◽

2020 ◽

Vol 13 (2) ◽

pp. 23 ◽

Cited By ~ 9

Author(s):

Aniruddha Dutta ◽

Saket Kumar ◽

Meheli Basu

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Mean Squared Error ◽

Learning Models ◽

Endogenous Factors ◽

Price Prediction ◽

Financial Gain ◽

Fixed Set ◽

Machine Learning Models ◽

Better Than

In today’s era of big data, deep learning and artificial intelligence have formed the backbone for cryptocurrency portfolio optimization. Researchers have investigated various state of the art machine learning models to predict Bitcoin price and volatility. Machine learning models like recurrent neural network (RNN) and long short-term memory (LSTM) have been shown to perform better than traditional time series models in cryptocurrency price prediction. However, very few studies have applied sequence models with robust feature engineering to predict future pricing. In this study, we investigate a framework with a set of advanced machine learning forecasting methods with a fixed set of exogenous and endogenous factors to predict daily Bitcoin prices. We study and compare different approaches using the root mean squared error (RMSE). Experimental results show that the gated recurring unit (GRU) model with recurrent dropout performs better than popular existing models. We also show that simple trading strategies, when implemented with our proposed GRU model and with proper learning, can lead to financial gain.

Download Full-text

Predicting Bone Metastasis Using Gene Expression-Based Machine Learning Models

Frontiers in Genetics ◽

10.3389/fgene.2021.771092 ◽

2021 ◽

Vol 12 ◽

Author(s):

Somayah Albaradei ◽

Mahmut Uludag ◽

Maha A. Thafar ◽

Takashi Gojobori ◽

Magbubah Essack ◽

...

Keyword(s):

Machine Learning ◽

Adverse Effects ◽

Predictive Value ◽

Molecular Mechanisms ◽

Malignant Tumors ◽

Life Quality ◽

Cord Compression ◽

Learning Models ◽

Tcga Dataset ◽

Machine Learning Models

Bone is the most common site of distant metastasis from malignant tumors, with the highest prevalence observed in breast and prostate cancers. Such bone metastases (BM) cause many painful skeletal-related events, such as severe bone pain, pathological fractures, spinal cord compression, and hypercalcemia, with adverse effects on life quality. Many bone-targeting agents developed based on the current understanding of BM onset’s molecular mechanisms dull these adverse effects. However, only a few studies investigated potential predictors of high risk for developing BM, despite such knowledge being critical for early interventions to prevent or delay BM. This work proposes a computational network-based pipeline that incorporates a ML/DL component to predict BM development. Based on the proposed pipeline we constructed several machine learning models. The deep neural network (DNN) model exhibited the highest prediction accuracy (AUC of 92.11%) using the top 34 featured genes ranked by betweenness centrality scores. We further used an entirely separate, “external” TCGA dataset to evaluate the robustness of this DNN model and achieved sensitivity of 85%, specificity of 80%, positive predictive value of 78.10%, negative predictive value of 80%, and AUC of 85.78%. The result shows the models’ way of learning allowed it to zoom in on the featured genes that provide the added benefit of the model displaying generic capabilities, that is, to predict BM for samples from different primary sites. Furthermore, existing experimental evidence provides confidence that about 50% of the 34 hub genes have BM-related functionality, which suggests that these common genetic markers provide vital insight about BM drivers. These findings may prompt the transformation of such a method into an artificial intelligence (AI) diagnostic tool and direct us towards mechanisms that underlie metastasis to bone events.

Download Full-text