scholarly journals Graph Peak Caller: calling ChIP-Seq Peaks on Graph-based Reference Genomes

2018 ◽  
Author(s):  
Ivar Grytten ◽  
Knut D. Rand ◽  
Alexander J. Nederbragt ◽  
Geir O. Storvik ◽  
Ingrid K. Glad ◽  
...  

AbstractGraph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2.

2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Gokhan Yavas ◽  
Huixiao Hong ◽  
Wenming Xiao

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.


2020 ◽  
Vol 10 (10) ◽  
pp. 3489-3495
Author(s):  
Natascha van Lieshout ◽  
Ate van der Burgt ◽  
Michiel E. de Vries ◽  
Menno ter Maat ◽  
David Eickholt ◽  
...  

With the rapid expansion of the application of genomics and sequencing in plant breeding, there is a constant drive for better reference genomes. In potato (Solanum tuberosum), the third largest food crop in the world, the related species S. phureja, designated “DM”, has been used as the most popular reference genome for the last 10 years. Here, we introduce the de novo sequenced genome of Solyntus as the next standard reference in potato genome studies. A true Solanum tuberosum made up of 116 contigs that is also highly homozygous, diploid, vigorous and self-compatible, Solyntus provides a more direct and contiguous reference then ever before available. It was constructed by sequencing with state-of-the-art long and short read technology and assembled with Canu. The 116 contigs were assembled into scaffolds to form each pseudochromosome, with three contigs to 17 contigs per chromosome. This assembly contains 93.7% of the single-copy gene orthologs from the Solanaceae set and has an N50 of 63.7 Mbp. The genome and related files can be found at https://www.plantbreeding.wur.nl/Solyntus/. With the release of this research line and its draft genome we anticipate many exciting developments in (diploid) potato research.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Michael Alonge ◽  
Sebastian Soyk ◽  
Srividya Ramakrishnan ◽  
Xingang Wang ◽  
Sara Goodwin ◽  
...  

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Qiuhui Li ◽  
Shilin Tian ◽  
Bin Yan ◽  
Chi Man Liu ◽  
Tak-Wah Lam ◽  
...  

AbstractPan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Richard J. Edwards ◽  
Matt A. Field ◽  
James M. Ferguson ◽  
Olga Dudchenko ◽  
Jens Keilwagen ◽  
...  

Abstract Background Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness. Results Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection. Conclusions The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.


2021 ◽  
Vol 7 (3) ◽  
Author(s):  
David R. Greig ◽  
Claire Jenkins ◽  
Saheer E. Gharbia ◽  
Timothy J. Dallman

Compared to short-read sequencing data, long-read sequencing facilitates single contiguous de novo assemblies and characterization of the prophage region of the genome. Here, we describe our methodological approach to using Oxford Nanopore Technology (ONT) sequencing data to quantify genetic relatedness and to look for microevolutionary events in the core and accessory genomes to assess the within-outbreak variation of four genetically and epidemiologically linked isolates. Analysis of both Illumina and ONT sequencing data detected one SNP between the four sequences of the outbreak isolates. The variant calling procedure highlighted the importance of masking homologous sequences in the reference genome regardless of the sequencing technology used. Variant calling also highlighted the systemic errors in ONT base-calling and ambiguous mapping of Illumina reads that results in variations in the genetic distance when comparing one technology to the other. The prophage component of the outbreak strain was analysed, and nine of the 16 prophages showed some similarity to the prophage in the Sakai reference genome, including the stx2a-encoding phage. Prophage comparison between the outbreak isolates identified minor genome rearrangements in one of the isolates, including an inversion and a deletion event. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on the evolutionary history, virulence and potentially the likely source and transmission of this zoonotic, foodborne pathogen.


2017 ◽  
Author(s):  
Erik Garrison ◽  
Jouni Sirén ◽  
Adam M. Novak ◽  
Glenn Hickey ◽  
Jordan M. Eizenga ◽  
...  

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.


2020 ◽  
Author(s):  
Pablo Duchen ◽  
Nicolas Salamin

AbstractNext-generation-sequencing haplotype callers are commonly used in studies to call variants from newly-sequenced species. However, due to the current availability of genomic resources, it is still common practice to use only one reference genome for a given genus, or even one reference for an entire clade of a higher taxon. The problem with traditional haplotype callers such as the one from GATK, is that they are optimized for variant calling at the population level, but not at the phylogenetic level. Thus, the consequences for downstream analyses can be substantial. Here, through simulations, we compare the performance between the haplotype callers of GATK and ATLAS, and present their differences at various phylogenetic scales. We show how the haplotype caller of GATK substantially underestimates the number of variants at the phylogenetic level, but not at the population level. We also quantified the level at which the accuracy of heterozygote calls declines with increasing distance to the reference genome. Such decrease is very sharp in GATK, while ATLAS maintains a high accuracy in variant calling, even at moderately-divergent species from the reference. We further suggest that efforts should be taken towards acquiring more reference genomes per species, before pursuing high-scale phylogenomic studies.


2019 ◽  
Vol 36 (8) ◽  
pp. 2569-2571 ◽  
Author(s):  
Cinta Pegueroles ◽  
Verónica Mixão ◽  
Laia Carreté ◽  
Manu Molina ◽  
Toni Gabaldón

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document