Managing batch effects in microbiome data

2019 ◽  
Vol 21 (6) ◽  
pp. 1954-1970 ◽  
Author(s):  
Yiwen Wang ◽  
Kim-Anh LêCao

Abstract Microbial communities have been increasingly studied in recent years to investigate their role in ecological habitats. However, microbiome studies are difficult to reproduce or replicate as they may suffer from confounding factors that are unavoidable in practice and originate from biological, technical or computational sources. In this review, we define batch effects as unwanted variation introduced by confounding factors that are not related to any factors of interest. Computational and analytical methods are required to remove or account for batch effects. However, inherent microbiome data characteristics (e.g. sparse, compositional and multivariate) challenge the development and application of batch effect adjustment methods to either account or correct for batch effects. We present commonly encountered sources of batch effects that we illustrate in several case studies. We discuss the limitations of current methods, which often have assumptions that are not met due to the peculiarities of microbiome data. We provide practical guidelines for assessing the efficiency of the methods based on visual and numerical outputs and a thorough tutorial to reproduce the analyses conducted in this review.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Verónica Lloréns-Rico ◽  
Sara Vieira-Silva ◽  
Pedro J. Gonçalves ◽  
Gwen Falony ◽  
Jeroen Raes

AbstractWhile metagenomic sequencing has become the tool of preference to study host-associated microbial communities, downstream analyses and clinical interpretation of microbiome data remains challenging due to the sparsity and compositionality of sequence matrices. Here, we evaluate both computational and experimental approaches proposed to mitigate the impact of these outstanding issues. Generating fecal metagenomes drawn from simulated microbial communities, we benchmark the performance of thirteen commonly used analytical approaches in terms of diversity estimation, identification of taxon-taxon associations, and assessment of taxon-metadata correlations under the challenge of varying microbial ecosystem loads. We find quantitative approaches including experimental procedures to incorporate microbial load variation in downstream analyses to perform significantly better than computational strategies designed to mitigate data compositionality and sparsity, not only improving the identification of true positive associations, but also reducing false positive detection. When analyzing simulated scenarios of low microbial load dysbiosis as observed in inflammatory pathologies, quantitative methods correcting for sampling depth show higher precision compared to uncorrected scaling. Overall, our findings advocate for a wider adoption of experimental quantitative approaches in microbiome research, yet also suggest preferred transformations for specific cases where determination of microbial load of samples is not feasible.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Charlie M. Carpenter ◽  
Daniel N. Frank ◽  
Kayla Williamson ◽  
Jaron Arbet ◽  
Brandie D. Wagner ◽  
...  

Abstract Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.


2018 ◽  
Vol 35 (13) ◽  
pp. 2348-2348 ◽  
Author(s):  
Zhenwei Dai ◽  
Sunny H Wong ◽  
Jun Yu ◽  
Yingying Wei

2018 ◽  
Author(s):  
Uri Shaham

AbstractBiological measurements often contain systematic errors, also known as “batch effects”, which may invalidate downstream analysis when not handled correctly. The problem of removing batch effects is of major importance in the biological community. Despite recent advances in this direction via deep learning techniques, most current methods may not fully preserve the true biological patterns the data contains. In this work we propose a deep learning approach for batch effect removal. The crux of our approach is learning a batch-free encoding of the data, representing its intrinsic biological properties, but not batch effects. In addition, we also encode the systematic factors through a decoding mechanism and require accurate reconstruction of the data. Altogether, this allows us to fully preserve the true biological patterns represented in the data. Experimental results are reported on data obtained from two high throughput technologies, mass cytometry and single-cell RNA-seq. Beyond good performance on training data, we also observe that our system performs well on test data obtained from new patients, which was not available at training time. Our method is easy to handle, a publicly available code can be found at https://github.com/ushaham/BatchEffectRemoval2018.


2007 ◽  
Vol 15 (02) ◽  
pp. 187-211
Author(s):  
DAVID Y CHOI ◽  
SUSAN ELKINAWY ◽  
STACEY H WANG

U.S.-based stock exchanges (e.g., NASDAQ) continue to be the financial markets of choice for IPOs among Asian entrepreneurs, although many Asian firms, upon listing, have performed poorly. This paper surveys the existing literature and summarizes the main advantages and disadvantages associated with a U.S. listing by an Asia-based venture. The paper also examines the post-IPO experience of four Asian companies which illuminate issues that are particularly relevant for Asian entrepreneurs. Our findings indicate that a U.S. listing can provide Asian companies with increased liquidity, visibility, and business opportunities. But being public in the U.S. brings disadvantages that include disclosure and reporting requirements, risk of lawsuits, and additional expenses. Our case studies reveal that successful U.S. listings require Asian firms to fully commit to transparency and investor relations programs. Based on our findings, we develop practical guidelines as to when a U.S. IPO may be sensible for an Asian venture.


2016 ◽  
Vol 42 (1-2) ◽  
pp. 1-90 ◽  
Author(s):  
Philippe Schlenker ◽  
Emmanuel Chemla ◽  
Anne M. Schel ◽  
James Fuller ◽  
Jean-Pierre Gautier ◽  
...  

AbstractWe argue that rich data gathered in experimental primatology in the last 40 years can benefit from analytical methods used in contemporary linguistics. Focusing on the syntactic and especially semantic side, we suggest that these methods could help clarify five questions: (i) what morphology and syntax, if any, do monkey calls have? (ii) what is the ‘lexical meaning’ of individual calls? (iii) how are the meanings of individual calls combined? (iv) how do calls or call sequences compete with each other when several are appropriate in a given situation? (v) how did the form and meaning of calls evolve? We address these questions in five case studies pertaining to cercopithecines (Putty-nosed monkeys, Blue monkeys, and Campbell’s monkeys), colobinae (Guereza monkeys and King Colobus monkeys), and New World monkeys (Titi monkeys). The


2020 ◽  
Author(s):  
Tiansheng Zhu ◽  
Guo-Bo Chen ◽  
Chunhui Yuan ◽  
Rui Sun ◽  
Fangfei Zhang ◽  
...  

AbstractBatch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. In BatchServer we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. BatchServer uses PVCA (Principal Variance Component Analysis) and UMAP (Manifold Approximation and Projection) for evaluation and visualizion of batch effects. We demonstate its application in multiple proteomics and transcriptomic data sets. BatchServer is provided at https://lifeinfo.shinyapps.io/batchserver/ as a web server. The source codes are freely available at https://github.com/guomics-lab/batch_server.


2018 ◽  
Author(s):  
Chenhao Li ◽  
Lisa Tucker-Kellogg ◽  
Niranjan Nagarajan

AbstractA growing body of literature points to the important roles that different microbial communities play in diverse natural environments and the human body. The dynamics of these communities is driven by a range of microbial interactions from symbiosis to predator-prey relationships, the majority of which are poorly understood, making it hard to predict the response of the community to different perturbations. With the increasing availability of high-throughput sequencing based community composition data, it is now conceivable to directly learn models that explicitly define microbial interactions and explain community dynamics. The applicability of these approaches is however affected by several experimental limitations, particularly the compositional nature of sequencing data. We present a new computational approach (BEEM) that addresses this key limitation in the inference of generalised Lotka-Volterra models (gLVMs) by coupling biomass estimation and model inference in an expectation maximization like algorithm (BEEM). Surprisingly, BEEM outperforms state-of-the-art methods for inferring gLVMs, while simultaneously eliminating the need for additional experimental biomass data as input. BEEM’s application to previously inaccessible public datasets (due to the lack of biomass data) allowed us for the first time to analyse microbial communities in the human gut on a per individual basis, revealing personalised dynamics and keystone species.


2019 ◽  
Author(s):  
Yuchen Yang ◽  
Gang Li ◽  
Huijun Qian ◽  
Kirk C. Wilhelmsen ◽  
Yin Shen ◽  
...  

AbstractBatch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN, Seurat v3, and LIGER. Furthermore, SMNN retains more cell type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841%.Key PointsBatch effect correction has been recognized to be critical when integrating scRNA-seq data from multiple batches due to systematic differences in time points, generating laboratory and/or handling technician(s), experimental protocol, and/or sequencing platform.Existing batch effect correction methods that leverages information from mutual nearest neighbors across batches (for example, implemented in SC3 or Seurat) ignore cell type information and suffer from potentially mismatching single cells from different cell types across batches, which would lead to undesired correction results, especially under the scenario where variation from batch effects is non-negligible compared with biological effects.To address this critical issue, here we present SMNN, a supervised machine learning method that first takes cluster/cell-type label information from users or inferred from scRNA-seq clustering, and then searches mutual nearest neighbors within each cell type instead of global searching.Our SMNN method shows clear advantages over three state-of-the-art batch effect correction methods and can better mix cells of the same cell type across batches and more effectively recover cell-type specific features, in both simulations and real datasets.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


Sign in / Sign up

Export Citation Format

Share Document