scholarly journals Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-Seq data

2016 ◽  
Author(s):  
Mingxiang Teng ◽  
Rafael A. Irizarry

AbstractThe main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics public catalogs are based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-Seq experiments and that this variability leads to false-positive peak calls. More concerning is that GC-effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell-line. However, accounting for GC-content in ChIP-Seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signal with the unwanted variability. To account for this challenge we introduce a statistical approach that accounts for GC-effects on both non-specific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false positive peaks as well as improved consistency across laboratories.

eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Jun Yao ◽  
Douglas C Wu ◽  
Ryan M Nottingham ◽  
Alan M Lambowitz

Human plasma contains > 40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from > 19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised intron RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.


2018 ◽  
Author(s):  
Ahrim Youn ◽  
Eladio J. Marquez ◽  
Nathan Lawlor ◽  
Michael L. Stitzel ◽  
Duygu Ucar

ABSTRACTTranscription factor (TF) footprinting uncovers putative protein-DNA binding via combined analyses of chromatin accessibility patterns and their underlying TF sequence motifs. TF footprints are frequently used to identify TFs that regulate activities of cell/condition-specific genomic regions (target loci) in comparison to control regions (background loci) using standard enrichment tests. However, there is a strong association between the chromatin accessibility level and the GC content of a locus and the number and types of TF footprints that can be detected at this site. Traditional enrichment tests (e.g., hypergeometric) do not account for this bias and inflate false positive associations. Therefore, we developed a novel method, Bias-free Footprint Enrichment Test (BiFET), that corrects for the biases arising from the differences in chromatin accessibility levels and GC contents between target and background loci in footprint enrichment analyses. We applied BiFET on TF footprint calls obtained from human EndoC-βH1 ATAC-seq samples using three different algorithms (CENTIPEDE, HINT-BC, and PIQ) and showed BiFET’s ability to increase power and reduce false positive rate when compared to hypergeometric test. Furthermore, we used BiFET to study TF footprints from human PBMC and pancreatic islet ATAC-seq samples to show its utility to identify putative TFs associated with cell-type-specific loci.


2020 ◽  
Author(s):  
Jun Yao ◽  
Douglas C. Wu ◽  
Ryan M. Nottingham ◽  
Alan M. Lambowitz

SummaryHuman plasma contains >40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from >19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised introns RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.


2017 ◽  
Vol 27 (11) ◽  
pp. 1930-1938 ◽  
Author(s):  
Mingxiang Teng ◽  
Rafael A. Irizarry

1971 ◽  
Vol 68 (1_Suppl) ◽  
pp. S223-S246 ◽  
Author(s):  
C. R. Wira ◽  
H. Rochefort ◽  
E. E. Baulieu

ABSTRACT The definition of a RECEPTOR* in terms of a receptive site, an executive site and a coupling mechanism, is followed by a general consideration of four binding criteria, which include hormone specificity, tissue specificity, high affinity and saturation, essential for distinguishing between specific and nonspecific binding. Experimental approaches are proposed for choosing an experimental system (either organized or soluble) and detecting the presence of protein binding sites. Techniques are then presented for evaluating the specific protein binding sites (receptors) in terms of the four criteria. This is followed by a brief consideration of how receptors may be located in cells and characterized when extracted. Finally various examples of oestrogen, androgen, progestagen, glucocorticoid and mineralocorticoid binding to their respective target tissues are presented, to illustrate how researchers have identified specific corticoid and mineralocorticoid binding in their respective target tissue receptors.


2019 ◽  
Author(s):  
Andrea N. Bootsma ◽  
Analise C. Doney ◽  
Steven Wheeler

<p>Despite the ubiquity of stacking interactions between heterocycles and aromatic amino acids in biological systems, our ability to predict their strength, even qualitatively, is limited. Based on rigorous <i>ab initio</i> data, we have devised a simple predictive model of the strength of stacking interactions between heterocycles commonly found in biologically active molecules and the amino acid side chains Phe, Tyr, and Trp. This model provides rapid predictions of the stacking ability of a given heterocycle based on readily-computed heterocycle descriptors. We show that the values of these descriptors, and therefore the strength of stacking interactions with aromatic amino acid side chains, follow simple predictable trends and can be modulated by changing the number and distribution of heteroatoms within the heterocycle. This provides a simple conceptual model for understanding stacking interactions in protein binding sites and optimizing inhibitor binding in drug design.</p>


1989 ◽  
Vol 264 (31) ◽  
pp. 18707-18713 ◽  
Author(s):  
K Matsuno ◽  
C C Hui ◽  
S Takiya ◽  
T Suzuki ◽  
K Ueno ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document