Interpretable and Accurate Prediction Models for Metagenomics Data

Mapping Intimacies ◽

10.1101/409144 ◽

2018 ◽

Author(s):

Edi Prifti ◽

Yann Chevaleyre ◽

Blaise Hanczar ◽

Eugeni Belda ◽

Antoine Danchin ◽

...

Keyword(s):

Prediction Models ◽

Biomarker Discovery ◽

Metabolic Diseases ◽

Biological Information ◽

Metagenomic Data ◽

Disease States ◽

Microbial Taxon ◽

Black Boxes ◽

Metagenomics Data ◽

Legal Pressure

ABSTRACTBiomarker discovery using metagenomic data is becoming more prevalent for patient diagnosis, prognosis and risk evaluation. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes. Moreover, they seldom generalize well when learned on small datasets. Here, we introduce an original approach that focuses on three models inspired by microbial ecosystem interactions: the addition, subtraction, and ratio of microbial taxon abundances. While being extremely simple, their performance is surprisingly good and compares to or is better than Random Forest, SVM or Elastic Net. Such models besides being interpretable, allow distilling biological information of the predictive core-variables. Collectively, this approach builds up both reliable and trustworthy diagnostic decisions while agreeing with societal and legal pressure that require explainable AI models in the medical domain.

Download Full-text

Interpretable and accurate prediction models for metagenomics data

GigaScience ◽

10.1093/gigascience/giaa010 ◽

2020 ◽

Vol 9 (3) ◽

Cited By ~ 6

Author(s):

Edi Prifti ◽

Yann Chevaleyre ◽

Blaise Hanczar ◽

Eugeni Belda ◽

Antoine Danchin ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Biomarker Discovery ◽

Metabolic Diseases ◽

Biological Information ◽

Patient Decision Making ◽

Disease States ◽

Biological Insight ◽

Metagenomics Data ◽

Microbiome Data

Abstract Background Microbiome biomarker discovery for patient diagnosis, prognosis, and risk evaluation is attracting broad interest. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes and seldom generalize well. Their interpretation is challenging for physicians and biologists, which makes them difficult to trust and use routinely in the physician–patient decision-making process. Novel methods that provide interpretability and biological insight are needed. Here, we introduce “predomics”, an original machine learning approach inspired by microbial ecosystem interactions that is tailored for metagenomics data. It discovers accurate predictive signatures and provides unprecedented interpretability. The decision provided by the predictive model is based on a simple, yet powerful score computed by adding, subtracting, or dividing cumulative abundance of microbiome measurements. Results Tested on >100 datasets, we demonstrate that predomics models are simple and highly interpretable. Even with such simplicity, they are at least as accurate as state-of-the-art methods. The family of best models, discovered during the learning process, offers the ability to distil biological information and to decipher the predictability signatures of the studied condition. In a proof-of-concept experiment, we successfully predicted body corpulence and metabolic improvement after bariatric surgery using pre-surgery microbiome data. Conclusions Predomics is a new algorithm that helps in providing reliable and trustworthy diagnostic decisions in the microbiome field. Predomics is in accord with societal and legal requirements that plead for an explainable artificial intelligence approach in the medical field.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

Evaluating Metagenomic Prediction of the Metaproteome in a 4.5-Year Study of a Patient with Crohn's Disease

mSystems ◽

10.1128/msystems.00337-18 ◽

2019 ◽

Vol 4 (1) ◽

Cited By ~ 18

Author(s):

Robert H. Mills ◽

Yoshiki Vázquez-Baeza ◽

Qiyun Zhu ◽

Lingjing Jiang ◽

James Gaffney ◽

...

Keyword(s):

Crohn’S Disease ◽

Crohn's Disease ◽

Dna Analysis ◽

Gene Copy Number ◽

Metagenomic Data ◽

Gene Copy ◽

Metagenomic Sequencing ◽

Data Types ◽

Fecal Microbiome ◽

Disease States

ABSTRACT Although genetic approaches are the standard in microbiome analysis, proteome-level information is largely absent. This discrepancy warrants a better understanding of the relationship between gene copy number and protein abundance, as this is crucial information for inferring protein-level changes from metagenomic data. As it remains unknown how metaproteomic systems evolve during dynamic disease states, we leveraged a 4.5-year fecal time series using samples from a single patient with colonic Crohn’s disease. Utilizing multiplexed quantitative proteomics and shotgun metagenomic sequencing of eight time points in technical triplicate, we quantified over 29,000 protein groups and 110,000 genes and compared them to five protein biomarkers of disease activity. Broad-scale observations were consistent between data types, including overall clustering by principal-coordinate analysis and fluctuations in Gene Ontology terms related to Crohn’s disease. Through linear regression, we determined genes and proteins fluctuating in conjunction with inflammatory metrics. We discovered conserved taxonomic differences relevant to Crohn’s disease, including a negative association of Faecalibacterium and a positive association of Escherichia with calprotectin. Despite concordant associations of genera, the specific genes correlated with these metrics were drastically different between metagenomic and metaproteomic data sets. This resulted in the generation of unique functional interpretations dependent on the data type, with metaproteome evidence for previously investigated mechanisms of dysbiosis. An example of one such mechanism was a connection between urease enzymes, amino acid metabolism, and the local inflammation state within the patient. This proof-of-concept approach prompts further investigation of the metaproteome and its relationship with the metagenome in biologically complex systems such as the microbiome. IMPORTANCE A majority of current microbiome research relies heavily on DNA analysis. However, as the field moves toward understanding the microbial functions related to healthy and disease states, it is critical to evaluate how changes in DNA relate to changes in proteins, which are functional units of the genome. This study tracked the abundance of genes and proteins as they fluctuated during various inflammatory states in a 4.5-year study of a patient with colonic Crohn’s disease. Our results indicate that despite a low level of correlation, taxonomic associations were consistent in the two data types. While there was overlap of the data types, several associations were uniquely discovered by analyzing the metaproteome component. This case study provides unique and important insights into the fundamental relationship between the genes and proteins of a single individual’s fecal microbiome associated with clinical consequences.

Download Full-text

Metabolite biomarker discovery for metabolic diseases by flux analysis

2012 IEEE 6th International Conference on Systems Biology (ISB) ◽

10.1109/isb.2012.6314103 ◽

2012 ◽

Cited By ~ 1

Author(s):

Limin Li ◽

Hao Jiang ◽

Wai-Ki Ching ◽

Vassilis S. Vassiliadis

Keyword(s):

Biomarker Discovery ◽

Metabolic Diseases ◽

Flux Analysis

Download Full-text

Reliable Biomarker discovery from Metagenomic data via RegLRSD algorithm

BMC Bioinformatics ◽

10.1186/s12859-017-1738-1 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 2

Author(s):

Mustafa Alshawaqfeh ◽

Ahmad Bashaireh ◽

Erchin Serpedin ◽

Jan Suchodolski

Keyword(s):

Biomarker Discovery ◽

Metagenomic Data

Download Full-text

Leveraging Transcriptomics Data for Genomic Prediction Models in Cassava

10.1101/208181 ◽

2017 ◽

Cited By ~ 4

Author(s):

Roberto Lozano ◽

Dunia Pino del Carpio ◽

Teddy Amuge ◽

Ismail Siraj Kayondo ◽

Alfred Ozimati Adebo ◽

...

Keyword(s):

Genome Sequence ◽

Genomic Prediction ◽

Prediction Models ◽

Breeding Population ◽

Biological Information ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sources Of Information ◽

Transcriptomics Data ◽

The Individual

AbstractBackgroundGenomic prediction models were, in principle, developed to include all the available marker information; with this approach, these models have shown in various crops moderate to high predictive accuracies. Previous studies in cassava have demonstrated that, even with relatively small training populations and low-density GBS markers, prediction models are feasible for genomic selection. In the present study, we prioritized SNPs in close proximity to genome regions with biological importance for a given trait. We used a number of strategies to select variants that were then included in single and multiple kernel GBLUP models. Specifically, our sources of information were transcriptomics, GWAS, and immunity-related genes, with the ultimate goal to increase predictive accuracies for Cassava Brown Streak Disease (CBSD) severity.ResultsWe used single and multi-kernel GBLUP models with markers imputed to whole genome sequence level to accommodate various sources of biological information; fitting more than one kinship matrix allowed for differential weighting of the individual marker relationships. We applied these GBLUP approaches to CBSD phenotypes (i.e., root infection and leaf severity three and six months after planting) in a Ugandan Breeding Population (n = 955). Three means of exploiting an established RNAseq experiment of CBSD-infected cassava plants were used. Compared to the biology-agnostic GBLUP model, the accuracy of the informed multi-kernel models increased the prediction accuracy only marginally (1.78% to 2.52%).ConclusionsOur results show that markers imputed to whole genome sequence level do not provide enhanced prediction accuracies compared to using standard GBS marker data in cassava. The use of transcriptomics data and other sources of biological information resulted in prediction accuracies that were nominally superior to those obtained from traditional prediction models.

Download Full-text

Creating a Metabolic Syndrome Research Resource using the National Health and Nutrition Examination Survey

Database ◽

10.1093/database/baaa103 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Willysha S Jenkins ◽

Christian Richardson ◽

Ariel Williams ◽

Clarlynda R Williams-DeVane

Keyword(s):

Metabolic Syndrome ◽

National Health ◽

Biomarker Discovery ◽

Metabolic Diseases ◽

Categorical Variables ◽

Nutrition Examination Survey ◽

Health And Nutrition ◽

The Metabolic Syndrome ◽

Potential Biomarker ◽

Increased Risk

Abstract Metabolic syndrome (MetS) is multifaceted. Risk factors include visceral adiposity, dyslipidemia, hyperglycemia, hypertension and environmental stimuli. MetS leads to an increased risk of cardiovascular disease, type 2 diabetes and stroke. Comparative studies, however, have identified heterogeneity in the pathology of MetS across groups though the etiology of these differences has yet to be elucidated. The Metabolic Syndrome Research Resource (MetSRR) described in this report is a curated database that provides access to MetS-associated biological and ancillary data and pools current and potential biomarkers of MetS extracted from relevant National Health and Nutrition Examination Survey (NHANES) data from 1999–2016. Each potential biomarker was selected following the review of over 100 peer-reviewed articles. MetSRR includes 28 demographics, survey and known MetS-related variables, including 9 curated categorical variables and 42 potentially novel biomarkers. All measures are captured from over 90 000 individuals. This biocuration effort provides increased access to curated MetS-related data and will serve as a hypothesis-generating tool to aid in novel biomarker discovery. In addition, MetSRR provides the ability to generate and export ethnic group-/race-, sex- and age-specific curated datasets, thus broadening participation in research efforts to identify clinically evaluative MetS biomarkers for disparate populations. Although there are other databases, such as BioM2MetDisease, designed to explore metabolic diseases through analysis of miRNAs and disease phenotypes, MetSRR is the only MetS-specific database designed to explore etiology of MetS across groups, through the biocuration of demographic, biological samples and biometric data. Database URL: http://www.healthdisparityinformatics.com/MetSRR

Download Full-text

Artificial Intelligence in Lung Cancer: Bridging the Gap Between Computational Power and Clinical Decision-Making

Canadian Association of Radiologists Journal ◽

10.1177/0846537120941434 ◽

2020 ◽

pp. 084653712094143

Author(s):

Jaryd R. Christie ◽

Pencilla Lang ◽

Lauren M. Zelko ◽

David A. Palma ◽

Mohamed Abdelrazek ◽

...

Keyword(s):

Artificial Intelligence ◽

Lung Cancer ◽

Decision Making ◽

Clinical Decision Making ◽

Prediction Models ◽

Biomarker Discovery ◽

Surgical Techniques ◽

Clinical Decision ◽

Clinical Implementation ◽

Cancer Management

Lung cancer remains the most common cause of cancer death worldwide. Recent advances in lung cancer screening, radiotherapy, surgical techniques, and systemic therapy have led to increasing complexity in diagnosis, treatment decision-making, and assessment of recurrence. Artificial intelligence (AI)–based prediction models are being developed to address these issues and may have a future role in screening, diagnosis, treatment selection, and decision-making around salvage therapy. Imaging plays an essential role in all components of lung cancer management and has the potential to play a key role in AI applications. Artificial intelligence has demonstrated value in prognostic biomarker discovery in lung cancer diagnosis, treatment, and response assessment, putting it at the forefront of the next phase of personalized medicine. However, although exploratory studies demonstrate potential utility, there is a need for rigorous validation and standardization before AI can be utilized in clinical decision-making. In this review, we will provide a summary of the current literature implementing AI for outcome prediction in lung cancer. We will describe the anticipated impact of AI on the management of patients with lung cancer and discuss the challenges of clinical implementation of these techniques.

Download Full-text

Current Status of Metabolomic Biomarker Discovery: Impact of Study Design and Demographic Characteristics

Metabolites ◽

10.3390/metabo10060224 ◽

2020 ◽

Vol 10 (6) ◽

pp. 224 ◽

Cited By ~ 2

Author(s):

Vladimir Tolstikov ◽

A. James Moser ◽

Rangaprasad Sarangarajan ◽

Niven R. Narain ◽

Michael A. Kiebish

Keyword(s):

Study Design ◽

Biomarker Discovery ◽

Therapeutic Interventions ◽

Current Status ◽

Metabolomics Data ◽

Widespread Application ◽

Human Phenotype ◽

Demographic Groups ◽

Disease States ◽

Challenges And Opportunities

Widespread application of omic technologies is evolving our understanding of population health and holds promise in providing precise guidance for selection of therapeutic interventions based on patient biology. The opportunity to use hundreds of analytes for diagnostic assessment of human health compared to the current use of 10–20 analytes will provide greater accuracy in deconstructing the complexity of human biology in disease states. Conventional biochemical measurements like cholesterol, creatinine, and urea nitrogen are currently used to assess health status; however, metabolomics captures a comprehensive set of analytes characterizing the human phenotype and its complex metabolic processes in real-time. Unlike conventional clinical analytes, metabolomic profiles are dramatically influenced by demographic and environmental factors that affect the range of normal values and increase the risk of false biomarker discovery. This review addresses the challenges and opportunities created by the evolving field of clinical metabolomics and highlights features of study design and bioinformatics necessary to maximize the utility of metabolomics data across demographic groups.

Download Full-text

MicroRNA in biofluids—Robust biomarkers for disease, toxicology, or injury studies: The case of minimally invasive colorectal cancer detection.

Journal of Clinical Oncology ◽

10.1200/jco.2012.30.30_suppl.20 ◽

2012 ◽

Vol 30 (30_suppl) ◽

pp. 20-20

Author(s):

Peter Mouritzen ◽

Søren Jensby Nielsen ◽

Maria Wrang Teilum ◽

Thorarinn Blondal ◽

Ditte Andreasen ◽

...

Keyword(s):

Colorectal Cancer ◽

Early Detection ◽

Biomarker Discovery ◽

Expression Profiles ◽

Minimal Invasive ◽

Reference Ranges ◽

Disease States ◽

A Genome ◽

New Biomarkers ◽

Clinical Source

20 Background: MicroRNAs function as post-transcriptional regulators of gene expression. Their high relative stability in common clinical source materials (FFPE blocks, plasma, serum, urine, saliva, etc.) and the ability of microRNA expression profiles to accurately classify discrete tissue types and specific disease states have positioned microRNAs as promising new biomarkers for diagnostic application. Furthermore microRNAs have been shown to be rapidly released from tissues into the circulation with the development of pathology. Methods: Thousands of biofluid samples were profiled including blood derived plasma/serum and urine using a genome-wide LNA-based microRNA qPCR platform, which has unparalleled sensitivity and robustness even in biofluids with extremely low microRNA levels. Only a single RT reaction is required to conduct full miRNome profiling thereby facilitating high-throughput profiling without the need for pre-amplification. Results: Normal reference ranges for circulating microRNAs were determined in several biofluids, allowing development of qPCR arrays containing only relevant microRNA subsets present in various biofluids together with tissue specific microRNA markers. Procedures were developed to control pre-analytical variables, for quality checking and qualifying biofluid samples in particular serum and plasma but also urine and other biofluids. An extensive QC system was implemented in order to secure technical excellence and reveal any unwanted bias in the dataset. We currently screen and validate microRNAs biomarkers for cancer with the aim of developing minimal invasive tests to be applied in early detection population screens. Conclusions: The qPCR panels support development of robust biomarkers in disease, toxicology, and injury studies. We will demonstrate how panels may be quickly and robustly applied in biomarker discovery/validation projects using the specific case early detection of colorectal cancer in blood. Close attention is required on pre-analytical parameters. Hemolysis and cellular contamination affect miRNA profiles in biofluids and control is required.

Download Full-text