scholarly journals Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data

2017 ◽  
Vol 27 (11) ◽  
pp. 1930-1938 ◽  
Author(s):  
Mingxiang Teng ◽  
Rafael A. Irizarry
2016 ◽  
Author(s):  
Mingxiang Teng ◽  
Rafael A. Irizarry

AbstractThe main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics public catalogs are based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-Seq experiments and that this variability leads to false-positive peak calls. More concerning is that GC-effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell-line. However, accounting for GC-content in ChIP-Seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signal with the unwanted variability. To account for this challenge we introduce a statistical approach that accounts for GC-effects on both non-specific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false positive peaks as well as improved consistency across laboratories.


2017 ◽  
Author(s):  
Fabrizio Mafessoni ◽  
Rashmi B Prasad ◽  
Leif Groop ◽  
Ola Hansson ◽  
Kay Prüfer

AbstractIt is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are less often found than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.


2012 ◽  
Vol 40 (10) ◽  
pp. e72-e72 ◽  
Author(s):  
Yuval Benjamini ◽  
Terence P. Speed

2018 ◽  
Author(s):  
Christopher M. Ward ◽  
Hein To ◽  
Stephen M Pederson

AbstractMotivationHigh throughput next generation sequencing (NGS) has become exceedingly cheap facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and can be found in the outputs of popular bioinformatics tools such as FastQC and Picard. Although these tools provide considerable power when carrying out QC, large sample numbers can make identification of systemic bias a challenge.ResultsWe present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC output as well as that from aligners such as HISAT2, STAR and Bowtie2. Visualization can be carried out across many samples using heatmaps rendered using ggplot2 and plotly. Moreover, these can be displayed in an interactive shiny app or a HTML report. We also provide methods to assess observed GC content in an organism dependent manner for both transcriptomic and genomic datasets. Importantly, hierarchical clustering can be carried out on heatmaps with large sample sizes to quickly identify outliers and batch effects.Availability and ImplementationngsReports is available at https://github.com/UofABioinformaticsHub/ngsReports.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Mingxiang Teng ◽  
Dongliang Du ◽  
Danfeng Chen ◽  
Rafael A Irizarry

Abstract Multiple sources of variability can bias ChIP-seq data toward inferring transcription factor (TF) binding profiles. As ChIP-seq datasets increase in public repositories, it is now possible and necessary to account for complex sources of variability in ChIP-seq data analysis. We find that two types of variability, the batch effects by sequencing laboratories and differences between biological replicates, not associated with changes in condition or state, vary across genomic sites. This implies that observed differences between samples from different conditions or states, such as cell-type, must be assessed statistically, with an understanding of the distribution of obscuring noise. We present a statistical approach that characterizes both differences of interests and these source of variability through the parameters of a mixed effects model. We demonstrate the utility of our approach on a CTCF binding dataset composed of 211 samples representing 90 different cell-types measured across three different laboratories. The results revealed that sites exhibiting large variability were associated with sequence characteristics such as GC-content and low complexity. Finally, we identified TFs associated with high-variance CTCF sites using TF motifs documented in public databases, pointing the possibility of these being false positives if the sources of variability are not properly accounted for.


2015 ◽  
Author(s):  
Stephanie C Hicks ◽  
F. William Townes ◽  
Mingxiang Teng ◽  
Rafael A Irizarry

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.


1978 ◽  
Vol 48 ◽  
pp. 7-29
Author(s):  
T. E. Lutz

This review paper deals with the use of statistical methods to evaluate systematic and random errors associated with trigonometric parallaxes. First, systematic errors which arise when using trigonometric parallaxes to calibrate luminosity systems are discussed. Next, determination of the external errors of parallax measurement are reviewed. Observatory corrections are discussed. Schilt’s point, that as the causes of these systematic differences between observatories are not known the computed corrections can not be applied appropriately, is emphasized. However, modern parallax work is sufficiently accurate that it is necessary to determine observatory corrections if full use is to be made of the potential precision of the data. To this end, it is suggested that a prior experimental design is required. Past experience has shown that accidental overlap of observing programs will not suffice to determine observatory corrections which are meaningful.


1988 ◽  
Vol 102 ◽  
pp. 215
Author(s):  
R.M. More ◽  
G.B. Zimmerman ◽  
Z. Zinamon

Autoionization and dielectronic attachment are usually omitted from rate equations for the non–LTE average–atom model, causing systematic errors in predicted ionization states and electronic populations for atoms in hot dense plasmas produced by laser irradiation of solid targets. We formulate a method by which dielectronic recombination can be included in average–atom calculations without conflict with the principle of detailed balance. The essential new feature in this extended average atom model is a treatment of strong correlations of electron populations induced by the dielectronic attachment process.


Sign in / Sign up

Export Citation Format

Share Document