Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-Seq data

Mapping Intimacies ◽

10.1101/090704 ◽

2016 ◽

Author(s):

Mingxiang Teng ◽

Rafael A. Irizarry

Keyword(s):

Binding Sites ◽

False Positive ◽

Statistical Approach ◽

Gc Content ◽

Batch Effects ◽

Peak Calling ◽

Protein Binding Sites ◽

Content Bias ◽

On Chip ◽

Genomic Regions

AbstractThe main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics public catalogs are based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-Seq experiments and that this variability leads to false-positive peak calls. More concerning is that GC-effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell-line. However, accounting for GC-content in ChIP-Seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signal with the unwanted variability. To account for this challenge we introduce a statistical approach that accounts for GC-effects on both non-specific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false positive peaks as well as improved consistency across laboratories.

Download Full-text

Identification of protein-protected mRNA fragments and structured excised intron RNAs in human plasma by TGIRT-seq peak calling

eLife ◽

10.7554/elife.60743 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Jun Yao ◽

Douglas C Wu ◽

Ryan M Nottingham ◽

Alan M Lambowitz

Keyword(s):

Human Plasma ◽

Binding Sites ◽

Full Length ◽

Peak Calling ◽

Healthy Individuals ◽

Protein Coding ◽

Protein Binding Sites ◽

Protein Coding Genes ◽

Non Coding Rnas ◽

Potential Biomarkers

Human plasma contains > 40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from > 19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised intron RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.

Download Full-text

BiFET: A Bias-free Transcription Factor Footprint Enrichment Test

10.1101/324277 ◽

2018 ◽

Author(s):

Ahrim Youn ◽

Eladio J. Marquez ◽

Nathan Lawlor ◽

Michael L. Stitzel ◽

Duygu Ucar

Keyword(s):

Transcription Factor ◽

False Positive ◽

False Positive Rate ◽

Strong Association ◽

Gc Content ◽

Chromatin Accessibility ◽

Hypergeometric Test ◽

Sequence Motifs ◽

Positive Rate ◽

Genomic Regions

ABSTRACTTranscription factor (TF) footprinting uncovers putative protein-DNA binding via combined analyses of chromatin accessibility patterns and their underlying TF sequence motifs. TF footprints are frequently used to identify TFs that regulate activities of cell/condition-specific genomic regions (target loci) in comparison to control regions (background loci) using standard enrichment tests. However, there is a strong association between the chromatin accessibility level and the GC content of a locus and the number and types of TF footprints that can be detected at this site. Traditional enrichment tests (e.g., hypergeometric) do not account for this bias and inflate false positive associations. Therefore, we developed a novel method, Bias-free Footprint Enrichment Test (BiFET), that corrects for the biases arising from the differences in chromatin accessibility levels and GC contents between target and background loci in footprint enrichment analyses. We applied BiFET on TF footprint calls obtained from human EndoC-βH1 ATAC-seq samples using three different algorithms (CENTIPEDE, HINT-BC, and PIQ) and showed BiFET’s ability to increase power and reduce false positive rate when compared to hypergeometric test. Furthermore, we used BiFET to study TF footprints from human PBMC and pancreatic islet ATAC-seq samples to show its utility to identify putative TFs associated with cell-type-specific loci.

Download Full-text

Identification of protein-protected mRNA fragments and structured excised intron RNAs in human plasma by TGIRT-seq peak calling

10.1101/2020.06.25.171439 ◽

2020 ◽

Author(s):

Jun Yao ◽

Douglas C. Wu ◽

Ryan M. Nottingham ◽

Alan M. Lambowitz

Keyword(s):

Human Plasma ◽

Binding Sites ◽

Full Length ◽

Peak Calling ◽

Healthy Individuals ◽

Protein Coding ◽

Protein Binding Sites ◽

Protein Coding Genes ◽

Non Coding Rnas ◽

Potential Biomarkers

SummaryHuman plasma contains >40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from >19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised introns RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.

Download Full-text

Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data

Genome Research ◽

10.1101/gr.220673.117 ◽

2017 ◽

Vol 27 (11) ◽

pp. 1930-1938 ◽

Cited By ~ 11

Author(s):

Mingxiang Teng ◽

Rafael A. Irizarry

Keyword(s):

Gc Content ◽

Systematic Errors ◽

Batch Effects ◽

Content Bias

Download Full-text

EVALUATION OF TISSUE STEROID BINDING IN VITRO

Acta Endocrinologica ◽

10.1530/acta.0.068s223 ◽

1971 ◽

Vol 68 (1_Suppl) ◽

pp. S223-S246 ◽

Cited By ~ 2

Author(s):

C. R. Wira ◽

H. Rochefort ◽

E. E. Baulieu

Keyword(s):

Protein Binding ◽

Binding Sites ◽

Experimental System ◽

Specific Protein ◽

Coupling Mechanism ◽

Target Tissue ◽

Protein Binding Sites ◽

Definition Of ◽

Hormone Specificity

ABSTRACT The definition of a RECEPTOR* in terms of a receptive site, an executive site and a coupling mechanism, is followed by a general consideration of four binding criteria, which include hormone specificity, tissue specificity, high affinity and saturation, essential for distinguishing between specific and nonspecific binding. Experimental approaches are proposed for choosing an experimental system (either organized or soluble) and detecting the presence of protein binding sites. Techniques are then presented for evaluating the specific protein binding sites (receptors) in terms of the four criteria. This is followed by a brief consideration of how receptors may be located in cells and characterized when extracted. Finally various examples of oestrogen, androgen, progestagen, glucocorticoid and mineralocorticoid binding to their respective target tissues are presented, to illustrate how researchers have identified specific corticoid and mineralocorticoid binding in their respective target tissue receptors.

Download Full-text

Predicting the Strength of Stacking Interactions Between Heterocycles and Aromatic Amino Acid Side Chains

10.26434/chemrxiv.7628939.v2 ◽

2019 ◽

Author(s):

Andrea N. Bootsma ◽

Analise C. Doney ◽

Steven Wheeler

Keyword(s):

Amino Acid ◽

Binding Sites ◽

Aromatic Amino Acid ◽

Aromatic Amino Acids ◽

Biologically Active ◽

Side Chains ◽

Stacking Interactions ◽

Inhibitor Binding ◽

Protein Binding Sites ◽

Amino Acid Side Chains

<p>Despite the ubiquity of stacking interactions between heterocycles and aromatic amino acids in biological systems, our ability to predict their strength, even qualitatively, is limited. Based on rigorous <i>ab initio</i> data, we have devised a simple predictive model of the strength of stacking interactions between heterocycles commonly found in biologically active molecules and the amino acid side chains Phe, Tyr, and Trp. This model provides rapid predictions of the stacking ability of a given heterocycle based on readily-computed heterocycle descriptors. We show that the values of these descriptors, and therefore the strength of stacking interactions with aromatic amino acid side chains, follow simple predictable trends and can be modulated by changing the number and distribution of heteroatoms within the heterocycle. This provides a simple conceptual model for understanding stacking interactions in protein binding sites and optimizing inhibitor binding in drug design.</p>

Download Full-text

Faculty Opinions recommendation of Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718061745.793481765 ◽

2013 ◽

Author(s):

Hans Lehrach

Keyword(s):

Binding Sites ◽

Related Factors ◽

Human Genomic ◽

Genomic Regions

Download Full-text

Rational and Random Strategies for the Mimicry of Discontinuous Protein Binding Sites

Protein and Peptide Letters ◽

10.2174/0929866043406931 ◽

2004 ◽

Vol 11 (4) ◽

pp. 281-290 ◽

Cited By ~ 13

Author(s):

Jutta Eichler

Keyword(s):

Protein Binding ◽

Binding Sites ◽

Protein Binding Sites

Download Full-text

ProBiS: a web server for detection of structurally similar protein binding sites

Nucleic Acids Research ◽

10.1093/nar/gkq479 ◽

2010 ◽

Vol 38 (Web Server) ◽

pp. W436-W440 ◽

Cited By ~ 59

Author(s):

J. Konc ◽

D. Janezic

Keyword(s):

Protein Binding ◽

Binding Sites ◽

Web Server ◽

Protein Binding Sites ◽

Similar Protein

Download Full-text

Transcription signals and protein binding sites for sericin gene transcription in vitro

Journal of Biological Chemistry ◽

10.1016/s0021-9258(18)51525-8 ◽

1989 ◽

Vol 264 (31) ◽

pp. 18707-18713 ◽

Cited By ~ 3

Author(s):

K Matsuno ◽

C C Hui ◽

S Takiya ◽

T Suzuki ◽

K Ueno ◽

...

Keyword(s):

Protein Binding ◽

Gene Transcription ◽

Binding Sites ◽

Protein Binding Sites ◽

Transcription In Vitro ◽

Transcription Signals

Download Full-text