scholarly journals STARRPeaker: Uniform processing and accurate identification of STARR-seq active regions

2019 ◽  
Author(s):  
Donghoon Lee ◽  
Manman Shi ◽  
Jennifer Moran ◽  
Martha Wall ◽  
Jing Zhang ◽  
...  

AbstractBackgroundHigh-throughput reporter assays, such as self-transcribing active regulatory region sequencing (STARR-seq), allow for unbiased and quantitative assessment of enhancers at a genome-wide scale. Recent advances in STARR-seq technology have employed progressively more complex genomic libraries and increased sequencing depths, to assay larger sized regions, up to the entire human genome. These advances necessitate a reliable processing pipeline and peak-calling algorithm.ResultsMost STARR-seq studies have relied on chromatin immunoprecipitation sequencing (ChIP-seq) processing pipelines. However, there are key differences in STARR-seq versus ChIP-seq. First, STARR-seq uses transcribed RNA to measure the activity of an enhancer, making an accurate determination of the basal transcription rate important. Second, STARR-seq coverage is highly non-uniform, overdispersed, and often confounded by sequencing biases, such as GC content and mappability. Lastly, here, we observed a clear correlation between RNA thermodynamic stability and STARR-seq readout, suggesting that STARR-seq may be sensitive to RNA secondary structure and stability. Considering these findings, we developed a negative-binomial regression framework for uniformly processing STARR-seq data, called STARRPeaker. In support of this, we generated whole-genome STARR-seq data from the HepG2 and K562 human cell lines and applied STARRPeaker to call enhancers.ConclusionsWe show STARRPeaker can unbiasedly detect active enhancers from both captured and whole-genome STARR-seq data. Specifically, we report ∼33,000 and ∼20,000 candidate enhancers from HepG2 and K562, respectively. Moreover, we show that STARRPeaker outperforms other peak callers in terms of identifying known enhancers with fewer false positives. Overall, we demonstrate an optimized processing framework for STARR-seq experiments can identify putative enhancers while addressing potential confounders.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Donghoon Lee ◽  
Manman Shi ◽  
Jennifer Moran ◽  
Martha Wall ◽  
Jing Zhang ◽  
...  

AbstractSTARR-seq technology has employed progressively more complex genomic libraries and increased sequencing depths. An issue with the increased complexity and depth is that the coverage in STARR-seq experiments is non-uniform, overdispersed, and often confounded by sequencing biases, such as GC content. Furthermore, STARR-seq readout is confounded by RNA secondary structure and thermodynamic stability. To address these potential confounders, we developed a negative binomial regression framework for uniformly processing STARR-seq data, called STARRPeaker. Moreover, to aid our effort, we generated whole-genome STARR-seq data from the HepG2 and K562 human cell lines and applied STARRPeaker to comprehensively and unbiasedly call enhancers in them.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Suhua Feng ◽  
Zhenhui Zhong ◽  
Ming Wang ◽  
Steven E. Jacobsen

Abstract Background 5′ methylation of cytosines in DNA molecules is an important epigenetic mark in eukaryotes. Bisulfite sequencing is the gold standard of DNA methylation detection, and whole-genome bisulfite sequencing (WGBS) has been widely used to detect methylation at single-nucleotide resolution on a genome-wide scale. However, sodium bisulfite is known to severely degrade DNA, which, in combination with biases introduced during PCR amplification, leads to unbalanced base representation in the final sequencing libraries. Enzymatic conversion of unmethylated cytosines to uracils can achieve the same end product for sequencing as does bisulfite treatment and does not affect the integrity of the DNA; enzymatic methylation sequencing may, thus, provide advantages over bisulfite sequencing. Results Using an enzymatic methyl-seq (EM-seq) technique to selectively deaminate unmethylated cytosines to uracils, we generated and sequenced libraries based on different amounts of Arabidopsis input DNA and different numbers of PCR cycles, and compared these data to results from traditional whole-genome bisulfite sequencing. We found that EM-seq libraries were more consistent between replicates and had higher mapping and lower duplication rates, lower background noise, higher average coverage, and higher coverage of total cytosines. Differential methylation region (DMR) analysis showed that WGBS tended to over-estimate methylation levels especially in CHG and CHH contexts, whereas EM-seq detected higher CG methylation levels in certain highly methylated areas. These phenomena can be mostly explained by a correlation of WGBS methylation estimation with GC content and methylated cytosine density. We used EM-seq to compare methylation between leaves and flowers, and found that CHG methylation level is greatly elevated in flowers, especially in pericentromeric regions. Conclusion We suggest that EM-seq is a more accurate and reliable approach than WGBS to detect methylation. Compared to WGBS, the results of EM-seq are less affected by differences in library preparation conditions or by the skewed base composition in the converted DNA. It may therefore be more desirable to use EM-seq in methylation studies.


2020 ◽  
Author(s):  
Suhua Feng ◽  
Zhenhui Zhong ◽  
Ming Wang ◽  
Steven E. Jacobsen

Abstract Background: 5’ methylation of cytosines in DNA molecules is an important epigenetic mark in eukaryotes. Bisulfite sequencing is the gold standard of DNA methylation detection, and whole-genome bisulfite sequencing (WGBS) has been widely used to detect methylation at single-nucleotide resolution on a genome-wide scale. However, sodium bisulfite is known to severely degrade DNA, which, in combination with biases introduced during PCR amplification, leads to unbalanced base representation in the final sequencing libraries. Enzymatic conversion of unmethylated cytosines to uracils can achieve the same end product for sequencing as does bisulfite treatment and does not affect the integrity of the DNA; enzymatic methylation sequencing may thus provide advantages over bisulfite sequencing.Results: Using an enzymatic methyl-seq (EM-seq) technique to selectively deaminate unmethylated cytosines to uracils, we generated and sequenced libraries based on different amounts of Arabidopsis input DNA and different numbers of PCR cycles, and compared these data to results from traditional whole genome bisulfite sequencing. We found that EM-seq libraries were more consistent between replicates and had higher mapping and lower duplication rates, lower background noise, higher average coverage, and higher coverage of total cytosines. Differential methylation region (DMR) analysis showed that WGBS tended to over-estimate methylation levels especially in CHG and CHH contexts, whereas EM-seq detected higher CG methylation levels in certain highly methylated areas. These phenomena can be mostly explained by a correlation of WGBS methylation estimation with GC content and methylated cytosine density. We used EM-seq to compare methylation between leaves and flowers, and found that CHG methylation level is greatly elevated in flowers, especially in pericentromeric regions. Conclusion: We suggest that EM-seq is a more accurate and reliable approach than WGBS to detect methylation. Compared to WGBS, the results of EM-seq are less affected by differences in library preparation conditions or by the skewed base composition in the converted DNA. It may therefore be more desirable to use EM-seq in methylation studies.


Plants ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 1620
Author(s):  
Xiao-Long Yuan ◽  
Cheng-Sheng Zhang ◽  
Fan-Yu Kong ◽  
Zhong-Feng Zhang ◽  
Feng-Long Wang

Phytophthora nicotianae is a widely distributed plant pathogen that can cause serious disease and cause significant economic losses to various crops, including tomatoes, tobacco, onions, and strawberries. To understand its pathogenic mechanisms and explore strategies for controlling diseases caused by this pathogen, we sequenced and analyzed the whole genome of Ph. nicotianae JM01. The Ph. nicotianae JM01 genome was assembled using a combination of approaches including shotgun sequencing, single-molecule sequencing, and the Hi-C technique. The assembled Ph. nicotianae JM01 genome is about 95.32 Mb, with contig and scaffold N50 54.23 kb and 113.15 kb, respectively. The average GC content of the whole-genome is about 49.02%, encoding 23,275 genes. In addition, we identified 19.15% of interspersed elements and 0.95% of tandem elements in the whole genome. A genome-wide phylogenetic tree indicated that Phytophthora diverged from Pythium approximately 156.32 Ma. Meanwhile, we found that 252 and 285 gene families showed expansion and contraction in Phytophthora when compared to gene families in Pythium. To determine the pathogenic mechanisms Ph. nicotianae JM01, we analyzed a suite of proteins involved in plant–pathogen interactions. The results revealed that gene duplication contributed to the expansion of Cell Wall Degrading Enzymes (CWDEs) such as glycoside hydrolases, and effectors such as Arg-Xaa-Leu-Arg (RXLR) effectors. In addition, transient expression was performed on Nicotiana benthamiana by infiltrating with Agrobacterium tumefaciens cells containing a cysteine-rich (SCR) protein. The results indicated that SCR can cause symptoms of hypersensitive response. Moreover, we also conducted comparative genome analysis among four Ph. nicotianae genomes. The completion of the Ph. nicotianae JM01 genome can not only help us understand its genomic characteristics, but also help us discover genes involved in infection and then help us understand its pathogenic mechanisms.


Author(s):  
Hamid Alinejad-Rokny ◽  
Rassa Ghavami ◽  
Hamid R. Rabiee ◽  
Narges Rezaei ◽  
Kin Tung Tam ◽  
...  

AbstractHi-C is a genome-wide chromosome conformation capture technology that detects interactions between pairs of genomic regions, and exploits higher order chromatin structures. Conceptually Hi-C data counts interaction frequencies between every position in the genome and every other position. Biologically functional interactions are expected to occur more frequently than random (background) interactions. To identify biologically relevant interactions, several background models that take biases such as distance, GC content and mappability into account have been proposed. Here we introduce MaxHiC, a background correction tool that deals with these complex biases and robustly identifies statistically significant interactions in both Hi-C and capture Hi-C experiments. MaxHiC uses a negative binomial distribution model and a maximum likelihood technique to correct biases in both Hi-C and capture Hi-C libraries. We systematically benchmark MaxHiC against major Hi-C background correction tools and demonstrate using published Hi-C and capture Hi-C datasets that 1) Interacting regions identified by MaxHiC have significantly greater levels of overlap with known regulatory features (e.g. active chromatin histone marks, CTCF binding sites, DNase sensitivity) and also disease-associated genome-wide association SNPs than those identified by currently existing models, and 2) the pairs of interacting regions are more likely to be linked by eQTL pairs and more likely to identify known enhancer-promoter pairs than any of the existing methods. We also demonstrate that interactions between different genomic region types have distinct distance distribution only revealed by MaxHiC. MaxHiC is publicly available as a python package for the analysis of Hi-C and capture Hi-C data.


2020 ◽  
Vol 66 (9) ◽  
pp. 505-520 ◽  
Author(s):  
Yingying Xiang ◽  
Wenyu Li ◽  
Fei Song ◽  
Xianghong Yang ◽  
Jing Zhou ◽  
...  

Enterococcus faecalis is a common pathogen causing refractory periapical periodontitis and secondary intraradicular infections. In this study, E. faecalis YN771 isolated from a re-treated root canal at a stomatology department was used as the host bacterium and was co-cultured with wastewater from the same department and patient samples to isolate a phage that lyses E. faecalis. We studied the biological and genomic characteristics of this phage. Transmission electron microscopy showed that this phage’s head is icosahedral in structure, with a head diameter of around 98.4 nm, and a contractile tail of around 228.5 nm in length and a diameter of 17.3 nm. The phage was identified as a member of the Myoviridae family and named PEf771. It is sensitive to proteinase K but resistant to chloroform and Triton X-100. Its lytic cycle is 45 min, burst size is 78, optimal multiplicity of infection is 0.1, lysis spectrum is narrow, and host strain specificity is strong. Its optimal growth temperature is 37 °C, most suitable pH is 6.0, and is sensitive to ultraviolet radiation. Whole-genome sequencing of PEf771 indicated it has a genome size of 151 052 bp, with a GC content of 36.97%, and encodes 197 proteins plus 26 tRNAs. PEf771 is most closely related to E. faecalis phage EFDG1. Phage PEf771 has strong host specificity and lytic ability, so it is important to further characterize this phage and its interaction with E. faecalis.


2021 ◽  
Vol 10 (23) ◽  
Author(s):  
Zhen Wang ◽  
Ji-xing Feng ◽  
Xue-peng Li ◽  
Jian Zhang

Micrococcus luteus MT1691313 is a Gram-positive bacterium isolated from the deep-sea sediment located at a −4,448-m depth in the Mariana Trench. Here, we report the complete genome sequence of this strain, which has a genome size of 2.32 Mb with a GC content of 72.04%.


2021 ◽  
Vol 9 (8) ◽  
pp. 1570
Author(s):  
Chien-Hsun Huang ◽  
Chih-Chieh Chen ◽  
Yu-Chun Lin ◽  
Chia-Hsuan Chen ◽  
Ai-Yun Lee ◽  
...  

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.


Forests ◽  
2018 ◽  
Vol 9 (8) ◽  
pp. 444
Author(s):  
Fumio Nakazawa ◽  
Yoshihisa Suyama ◽  
Satoshi Imura ◽  
Hideaki Motoyama

Pollen taxa in sediment samples can be identified based on morphology. However, closely related species do not differ substantially in pollen morphology, and accurate identification is generally limited to genera or families. Because many pollen grains in glaciers contain protoplasm, genetic information obtained from pollen grains should enable the identification of plant taxa at the species level. In the present study, species identification of Pinus pollen grains was attempted using whole-genome amplification (WGA). We used pollen grains extracted from surface snow (depth, 1.8–1.9 m) from the Belukha glacier in the summer of 2003. WGA was performed using a single pollen grain. Some regions of the chloroplast genome were amplified by PCR, and the DNA products were sequenced to identify the pollen grain. Pinus includes approximately 111 recognized species in two subgenera, four sections, and 11 subsections. The tree species Pinus sibirica and P. sylvestris are currently found at the periphery of the glacier. We identified the pollen grains from the Belukha glacier to the level of section or subsection to which P. sibirica and P. sylvestris belong. Moreover, we specifically identified two pollen grains as P. sibirica or P. cembra. Fifteen species, including P. sibirica, were candidates for the remaining pollen grain.


2016 ◽  
Author(s):  
Owen J.L. Rackham ◽  
Sarah R. Langley ◽  
Thomas Oates ◽  
Eleni Vradi ◽  
Nathan Harmston ◽  
...  

ABSTRACTDNA methylation is a key epigenetic modification involved in gene regulation whose contribution to disease susceptibility remains to be fully understood. Here, we present a novel Bayesian smoothing approach (called ABBA) to detect differentially methylated regions (DMRs) from whole-genome bisulphite sequencing (WGBS). We also show how this approach can be leveraged to identify disease-associated changes in DNA methylation, suggesting mechanisms through which these alterations might affect disease. From a data modeling perspective, ABBA has the distinctive feature of automatically adapting to different correlation structures in CpG methylation levels across the genome whilst taking into account the distance between CpG sites as a covariate. Our simulation study shows that ABBA has greater power to detect DMRs than existing methods, providing an accurate identification of DMRs in the large majority of simulated cases. To empirically demonstrate the method’s efficacy in generating biological hypotheses, we performed WGBS of primary macrophages derived from an experimental rat system of glomerulonephritis and used ABBA to identify >1,000 disease-associated DMRs. Investigation of these DMRs revealed differential DNA methylation localized to a 600bp region in the promoter of the Ifitm3 gene. This was confirmed by ChIP-seq and RNA-seq analyses, showing differential transcription factor binding at the Ifitm3 promoter by JunD (an established determinant of glomerulonephritis) and a consistent change in Ifitm3 expression. Our ABBA analysis allowed us to propose a new role for Ifitm3 in the pathogenesis of glomerulonephritis via a mechanism involving promoter hypermethylation that is associated with Ifitm3 repression in the rat strain susceptible to glomerulonephritis.


Sign in / Sign up

Export Citation Format

Share Document