Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data

AbstractThe main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics public catalogs are based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-Seq experiments and that this variability leads to false-positive peak calls. More concerning is that GC-effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell-line. However, accounting for GC-content in ChIP-Seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signal with the unwanted variability. To account for this challenge we introduce a statistical approach that accounts for GC-effects on both non-specific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false positive peaks as well as improved consistency across laboratories.

Download Full-text

Turning vice into virtue: Using Batch-Effects to Detect Errors in Large Genomic Datasets

10.1101/189670 ◽

2017 ◽

Cited By ~ 1

Author(s):

Fabrizio Mafessoni ◽

Rashmi B Prasad ◽

Leif Groop ◽

Ola Hansson ◽

Kay Prüfer

Keyword(s):

Large Scale ◽

Systematic Errors ◽

Large Datasets ◽

Batch Effects ◽

Sequencing Technology ◽

Combine Data ◽

Coding Regions ◽

1000 Genomes ◽

Sequencing Platforms ◽

Number Of Individuals

AbstractIt is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are less often found than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.

Download Full-text

Codon selection reduces GC content bias in nucleic acids encoding for intrinsically disordered proteins

Cellular and Molecular Life Sciences ◽

10.1007/s00018-019-03166-6 ◽

2019 ◽

Vol 77 (1) ◽

pp. 149-160 ◽

Cited By ~ 1

Author(s):

Christopher J. Oldfield ◽

Zhenling Peng ◽

Vladimir N. Uversky ◽

Lukasz Kurgan

Keyword(s):

Nucleic Acids ◽

Intrinsically Disordered Proteins ◽

Gc Content ◽

Disordered Proteins ◽

Intrinsically Disordered ◽

Content Bias ◽

Codon Selection

Download Full-text

Summarizing and correcting the GC content bias in high-throughput sequencing

Nucleic Acids Research ◽

10.1093/nar/gks001 ◽

2012 ◽

Vol 40 (10) ◽

pp. e72-e72 ◽

Cited By ~ 466

Author(s):

Yuval Benjamini ◽

Terence P. Speed

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Gc Content ◽

Content Bias

Download Full-text

Identification and prevention of a GC content bias in SAGE libraries

Nucleic Acids Research ◽

10.1093/nar/29.12.e60 ◽

2001 ◽

Vol 29 (12) ◽

pp. 60e-60 ◽

Cited By ~ 36

Author(s):

E. H. Margulies

Keyword(s):

Gc Content ◽

Content Bias

Download Full-text

ngsReports: An R Package for managing FastQC reports and other NGS related log files

10.1101/313148 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christopher M. Ward ◽

Hein To ◽

Stephen M Pederson

Keyword(s):

Quality Control ◽

Gc Content ◽

R Package ◽

Dependent Manner ◽

Batch Effects ◽

Large Sample ◽

Log Files ◽

Shiny App ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractMotivationHigh throughput next generation sequencing (NGS) has become exceedingly cheap facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and can be found in the outputs of popular bioinformatics tools such as FastQC and Picard. Although these tools provide considerable power when carrying out QC, large sample numbers can make identification of systemic bias a challenge.ResultsWe present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC output as well as that from aligners such as HISAT2, STAR and Bowtie2. Visualization can be carried out across many samples using heatmaps rendered using ggplot2 and plotly. Moreover, these can be displayed in an interactive shiny app or a HTML report. We also provide methods to assess observed GC content in an organism dependent manner for both transcriptomic and genomic datasets. Importantly, hierarchical clustering can be carried out on heatmaps with large sample sizes to quickly identify outliers and batch effects.Availability and ImplementationngsReports is available at https://github.com/UofABioinformaticsHub/ngsReports.

Download Full-text

Characterizing batch effects and binding site-specific variability in ChIP-seq data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab098 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Mingxiang Teng ◽

Dongliang Du ◽

Danfeng Chen ◽

Rafael A Irizarry

Keyword(s):

Gc Content ◽

Cell Types ◽

Low Complexity ◽

Ctcf Binding ◽

Batch Effects ◽

Multiple Sources ◽

Large Variability ◽

Public Repositories ◽

Different Cell Types ◽

Sequence Characteristics

Abstract Multiple sources of variability can bias ChIP-seq data toward inferring transcription factor (TF) binding profiles. As ChIP-seq datasets increase in public repositories, it is now possible and necessary to account for complex sources of variability in ChIP-seq data analysis. We find that two types of variability, the batch effects by sequencing laboratories and differences between biological replicates, not associated with changes in condition or state, vary across genomic sites. This implies that observed differences between samples from different conditions or states, such as cell-type, must be assessed statistically, with an understanding of the distribution of obscuring noise. We present a statistical approach that characterizes both differences of interests and these source of variability through the parameters of a mixed effects model. We demonstrate the utility of our approach on a CTCF binding dataset composed of 211 samples representing 90 different cell-types measured across three different laboratories. The results revealed that sites exhibiting large variability were associated with sequence characteristics such as GC-content and low complexity. Finally, we identified TFs associated with high-variance CTCF sites using TF motifs documented in public databases, pointing the possibility of these being false positives if the sources of variability are not properly accounted for.

Download Full-text

Missing Data and Technical Variability in Single-Cell RNA- Sequencing Experiments

10.1101/025528 ◽

2015 ◽

Cited By ~ 32

Author(s):

Stephanie C Hicks ◽

F. William Townes ◽

Mingxiang Teng ◽

Rafael A Irizarry

Keyword(s):

Gene Expression ◽

Missing Data ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Single Cells ◽

Systematic Errors ◽

Gene Expression Measurement ◽

Rna Seq ◽

Batch Effects

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

Download Full-text

Statistics and parallaxes

International Astronomical Union Colloquium ◽

10.1017/s0252921100073875 ◽

1978 ◽

Vol 48 ◽

pp. 7-29

Author(s):

T. E. Lutz

Keyword(s):

Experimental Design ◽

Statistical Methods ◽

Review Paper ◽

Systematic Errors ◽

Past Experience ◽

Random Errors ◽

Trigonometric Parallaxes ◽

Systematic And Random Errors

This review paper deals with the use of statistical methods to evaluate systematic and random errors associated with trigonometric parallaxes. First, systematic errors which arise when using trigonometric parallaxes to calibrate luminosity systems are discussed. Next, determination of the external errors of parallax measurement are reviewed. Observatory corrections are discussed. Schilt’s point, that as the causes of these systematic differences between observatories are not known the computed corrections can not be applied appropriately, is emphasized. However, modern parallax work is sufficiently accurate that it is necessary to determine observatory corrections if full use is to be made of the potential precision of the data. To this end, it is suggested that a prior experimental design is required. Past experience has shown that accidental overlap of observing programs will not suffice to determine observatory corrections which are meaningful.

Download Full-text

Dielectronic Recombination in the Average-Atom Model

International Astronomical Union Colloquium ◽

10.1017/s0252921100107730 ◽

1988 ◽

Vol 102 ◽

pp. 215

Author(s):

R.M. More ◽

G.B. Zimmerman ◽

Z. Zinamon

Keyword(s):

Detailed Balance ◽

Systematic Errors ◽

Dielectronic Recombination ◽

Rate Equations ◽

Strong Correlations ◽

Dense Plasmas ◽

Atom Model ◽

New Feature ◽

Average Atom Model ◽

Attachment Process

Autoionization and dielectronic attachment are usually omitted from rate equations for the non–LTE average–atom model, causing systematic errors in predicted ionization states and electronic populations for atoms in hot dense plasmas produced by laser irradiation of solid targets. We formulate a method by which dielectronic recombination can be included in average–atom calculations without conflict with the principle of detailed balance. The essential new feature in this extended average atom model is a treatment of strong correlations of electron populations induced by the dielectronic attachment process.

Download Full-text