Covariance adjustment for batch effect in gene expression data

Jung Ae Lee; Kevin K. Dobbin; Jeongyoun Ahn

doi:10.1002/sim.6157

CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

Briefings in Bioinformatics ◽

10.1093/bib/bbz116 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1818-1824 ◽

Cited By ~ 1

Author(s):

Qi Zhao ◽

Yu Sun ◽

Zekun Liu ◽

Hongwan Zhang ◽

Xingyang Li ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Signature ◽

Unsupervised Clustering ◽

Batch Effect ◽

Consensus Clustering ◽

Expression Data ◽

Personalized Care ◽

Cancer Subtypes ◽

Multiple Datasets

Abstract Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.

Download Full-text

A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

The Pharmacogenomics Journal ◽

10.1038/tpj.2010.57 ◽

2010 ◽

Vol 10 (4) ◽

pp. 278-291 ◽

Cited By ~ 164

Author(s):

J Luo ◽

M Schumacher ◽

A Scherer ◽

D Sanoudou ◽

D Megherbi ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Prediction Performance ◽

Batch Effect ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Batch effect removal methods for microarray gene expression data integration: a survey

Briefings in Bioinformatics ◽

10.1093/bib/bbs037 ◽

2012 ◽

Vol 14 (4) ◽

pp. 469-490 ◽

Cited By ~ 153

Author(s):

C. Lazar ◽

S. Meganck ◽

J. Taminau ◽

D. Steenhoff ◽

A. Coletta ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Batch Effect ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

10.1101/2020.10.08.332080 ◽

2020 ◽

Author(s):

Shaoheng Liang ◽

Jinzhuang Dou ◽

Ramiz Iqbal ◽

Ken Chen

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Simulated Data ◽

Batch Effect ◽

Mouse Retina ◽

Expression Data ◽

Retina Development ◽

Cell Gene Expression ◽

Cell Gene

AbstractClustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. Batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate BCD on a simulated data as well as applied it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). BCD achieves more accurate clusters and better visualizations than state-of-the-art batch correction methods on longitudinal datasets. BCD can be directly integrated with most clustering and visualization methods to enable more scientific findings.

Download Full-text

Scaling Method for Batch Effect Correction to Gene Expression Data Based on Spectral Clustering

Current Bioinformatics ◽

10.2174/1574893615999200818093540 ◽

2020 ◽

Vol 15 ◽

Author(s):

Momo Matsuda ◽

Xiucai Ye ◽

Tetsuya Sakurai

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Spectral Clustering ◽

Optimization Problem ◽

Genomic Analysis ◽

Batch Effect ◽

Expression Data ◽

Batch Effects ◽

Scaling Method ◽

S Model

Background: Batch effects are usually introduced in gene expression data, which can dramatically reduce the accuracy of statistical inference in the genomic analysis since samples in different batches cannot be directly comparable. Objective: To accurately measure biological variability and obtain correct statistical inference, we considered to correct / remove the batch effects for merging the samples from different batches into a comparable dataset for high-throughput genomic analysis. Methods: The existing L/S model uses the empirical Bayes methods to find the constant values for multiplication/addition for each gene. Different from the L/S model, we used the dimensionality reduction method. We proposed an effective scaling method to scale each gene by multiplying a constant value, which was formulated as an optimization problem based on spectral clustering. The data samples from different batches can be merged into a comparable dataset with batch effect correction. Furthermore, we proposed an approximation solution to solve the optimization problem for the scaling adjustment values. Results: We evaluated the proposed method on both artificial and gene expression datasets by comparing it with the existing well-established batch effect correction methods. Numerical experiments show that the proposed method projects the data samples from different batches to resemble each other and outperforms the others on both microarray and singlecell RNA-seq datasets. Conclusion: The scaling adjustment for genes and dimensionality reduction improved the accuracy and removed the batch effects, thereby making the proposed method more robust for interfering genes.

Download Full-text

Cancer Classification from Gene Expression data using Fuzzy-Rough techniques An Empirical Study

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i6.415420 ◽

2018 ◽

Vol 6 (6) ◽

pp. 415-420

Author(s):

Ansuman Kumar ◽

Anindya Halder

Keyword(s):

Gene Expression ◽

Empirical Study ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data

Download Full-text

Statistical methods for analysis of time course gene expression data

Frontiers in Bioscience ◽

10.2741/a743 ◽

2002 ◽

Vol 7 (1) ◽

pp. a90-98 ◽

Cited By ~ 5

Author(s):

Hongzhe Li

Keyword(s):

Gene Expression ◽

Statistical Methods ◽

Gene Expression Data ◽

Time Course ◽

Expression Data

Download Full-text

Faculty Opinions recommendation of A new type of stochastic dependence revealed in gene expression data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1032265.370760 ◽

2006 ◽

Author(s):

Arcady Mushegian

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Stochastic Dependence ◽

New Type

Download Full-text

Faculty Opinions recommendation of A systematic comparison and evaluation of biclustering methods for gene expression data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1087930.540878 ◽

2007 ◽

Author(s):

Daniel Chamovitz

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Systematic Comparison

Download Full-text

Faculty Opinions recommendation of CAERUS: predicting CAncER oUtcomeS using relationship between protein structural information, protein networks, gene expression data, and mutation data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.11029956.11977055 ◽

2011 ◽

Author(s):

Yuanpeng Janet Huang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Structural Information ◽

Protein Networks ◽

Expression Data ◽

Cancer Outcomes ◽

Mutation Data

Download Full-text