A cautionary note on the use of genotype callers in Phylogenomics

Systematic Biology ◽

10.1093/sysbio/syaa081 ◽

2020 ◽

Author(s):

Pablo Duchen ◽

Nicolas Salamin

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Population Level ◽

Cautionary Note ◽

High Scale ◽

Phylogenetic Level ◽

Current Availability ◽

The One ◽

Reference Genomes ◽

Generation Sequencing

Abstract Next-generation-sequencing genotype callers are commonly used in studies to call variants from newly-sequenced species. However, due to the current availability of genomic resources, it is still common practice to use only one reference genome for a given genus, or even one reference for an entire clade of a higher taxon. The problem with traditional genotype callers, such as the one from GATK, is that they are optimized for variant calling at the population level. However, when these callers are used at the phylogenetic level, the consequences for downstream analyses can be substantial. Here, we performed simulations to compare the performance between the genotype callers of GATK and ATLAS, and present their differences at various phylogenetic scales. We show that the genotype caller of GATK substantially underestimates the number of variants at the phylogenetic level, but not at the population level. We also found that the accuracy of heterozygote calls declines with increasing distance to the reference genome. We quantified this decline, and found that it is very sharp in GATK, while ATLAS maintains a high accuracy even at moderately-divergent species from the reference. We further suggest that efforts should be taken towards acquiring more reference genomes per species, before pursuing high-scale phylogenomic studies.

Download Full-text

A cautionary note on the use of haplotype callers in Phylogenomics

10.1101/2020.06.10.145011 ◽

2020 ◽

Author(s):

Pablo Duchen ◽

Nicolas Salamin

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Population Level ◽

Cautionary Note ◽

High Scale ◽

Phylogenetic Level ◽

Current Availability ◽

The One ◽

Reference Genomes ◽

Generation Sequencing

AbstractNext-generation-sequencing haplotype callers are commonly used in studies to call variants from newly-sequenced species. However, due to the current availability of genomic resources, it is still common practice to use only one reference genome for a given genus, or even one reference for an entire clade of a higher taxon. The problem with traditional haplotype callers such as the one from GATK, is that they are optimized for variant calling at the population level, but not at the phylogenetic level. Thus, the consequences for downstream analyses can be substantial. Here, through simulations, we compare the performance between the haplotype callers of GATK and ATLAS, and present their differences at various phylogenetic scales. We show how the haplotype caller of GATK substantially underestimates the number of variants at the phylogenetic level, but not at the population level. We also quantified the level at which the accuracy of heterozygote calls declines with increasing distance to the reference genome. Such decrease is very sharp in GATK, while ATLAS maintains a high accuracy in variant calling, even at moderately-divergent species from the reference. We further suggest that efforts should be taken towards acquiring more reference genomes per species, before pursuing high-scale phylogenomic studies.

Download Full-text

benchNGS : An approach to benchmark short reads alignment tools

10.7287/peerj.preprints.1007v1 ◽

2015 ◽

Author(s):

Farzana Rahman ◽

Mehedi Hassan ◽

Alona Kryshchenko ◽

Inna Dubchak ◽

Tatiana V Tatarinova ◽

...

Keyword(s):

Reference Genome ◽

Global Alignment ◽

Short Report ◽

Related Genome ◽

Genome Sequences ◽

Short Reads ◽

Relevant Reference ◽

Next Generation Sequencing Ngs ◽

Reference Genomes ◽

Generation Sequencing

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools .

Download Full-text

Optimization of the “in-silico” mate-pair method improved contiguity and accuracy of genome assembly

10.22541/au.163257605.53808833/v1 ◽

2021 ◽

Author(s):

Tao Zhou ◽

Liang Lu ◽

Chenhong Li

Keyword(s):

Genome Assembly ◽

In Silico ◽

Reference Genome ◽

Third Generation ◽

High Molecular Weight Dna ◽

The Third ◽

Mate Pair ◽

Pair Method ◽

Reference Genomes ◽

Generation Sequencing

A combination of next-generation sequencing technologies and mate-pair libraries of large insert sizes is used as a standard method to generate genome assemblies with high contiguity. The third-generation sequencing techniques also are used to improve the quality of assembled genomes. However, both mate-pair libraries and the third-generation libraries require high-molecular-weight DNA, making the use of these libraries inappropriate for samples with only degraded DNA. An in silico method that generates mate-pair libraries using a reference genome was devised for the task of assembling target genomes. Although the contiguity and completeness of assembled genomes were significantly improved by this method, a high level of errors manifested in the assembly, further to which the methods for using reference genomes were not optimized. Here, we tested different strategies for using reference genomes to generate in silico mate-pairs. The results showed that using a closely related reference genome from the same genus was more effective than using divergent references. Conservation of in silico mate-pairs by comparing two references and using those to guide genome assembly reduced the number of misassemblies (18.6% – 46.1%) and increased the contiguity of assembled genomes (9.7% – 70.7%), while maintaining gene completeness at a level that was either similar or marginally lower than that obtained via the current method. Finally, we compared the optimized method with another reference-guided assembler, RaGOO. We found that RaGOO produced longer scaffolds (17.8 Mbp vs 3.0 Mbp), but resulted in a much higher misassembly rate (85.68%) than our optimized in silico mate-pair method.

Download Full-text

HaploTypo: a variant-calling pipeline for phased genomes

Bioinformatics ◽

10.1093/bioinformatics/btz933 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2569-2571 ◽

Cited By ~ 3

Author(s):

Cinta Pegueroles ◽

Verónica Mixão ◽

Laia Carreté ◽

Manu Molina ◽

Toni Gabaldón

Keyword(s):

Genetic Variation ◽

Genetic Variant ◽

Reference Genome ◽

Variant Calling ◽

Supplementary Information ◽

Haplotype Structure ◽

Supplementary Data ◽

Heterozygous Variant ◽

Reference Genomes

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

benchNGS : An approach to benchmark short reads alignment tools

10.1101/018234 ◽

2015 ◽

Cited By ~ 2

Author(s):

Farzana Rahman ◽

Mehedi Hassan ◽

Alona Martin Kryshchenko ◽

Inna Dubchak ◽

Nikolai Nickolai Alexandrov ◽

...

Keyword(s):

Reference Genome ◽

Global Alignment ◽

Short Report ◽

Related Genome ◽

Genome Sequences ◽

Short Reads ◽

Relevant Reference ◽

Next Generation Sequencing Ngs ◽

Reference Genomes ◽

Generation Sequencing

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the precomputed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools.

Download Full-text

benchNGS : An approach to benchmark short reads alignment tools

10.7287/peerj.preprints.1007 ◽

2015 ◽

Author(s):

Farzana Rahman ◽

Mehedi Hassan ◽

Alona Kryshchenko ◽

Inna Dubchak ◽

Tatiana V Tatarinova ◽

...

Keyword(s):

Reference Genome ◽

Global Alignment ◽

Short Report ◽

Related Genome ◽

Genome Sequences ◽

Short Reads ◽

Relevant Reference ◽

Next Generation Sequencing Ngs ◽

Reference Genomes ◽

Generation Sequencing

Download Full-text

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

10.1101/2021.02.16.431517 ◽

2021 ◽

Author(s):

Jeremie S. Kim ◽

Can Firtina ◽

Meryem Banu Cavlak ◽

Damla Senol Cali ◽

Nastaran Hajinazar ◽

...

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Variant Calling ◽

Ground Truth ◽

Data Set ◽

C Elegans ◽

A Genome ◽

Downstream Analysis ◽

Similar Accuracy ◽

Reference Genomes

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Download Full-text

Mapping whole genome shotgun sequence and variant calling in mammalian species without their reference genomes

F1000Research ◽

10.12688/f1000research.2-244.v2 ◽

2014 ◽

Vol 2 ◽

pp. 244 ◽

Cited By ~ 5

Author(s):

Ted Kalbfleisch ◽

Michael P. Heaton

Keyword(s):

Gene Function ◽

Reference Genome ◽

Sequence Data ◽

Association Studies ◽

Mammalian Species ◽

Variant Calling ◽

Ovis Aries ◽

Genome Sequences ◽

Mammalian Gene ◽

Reference Genomes

Genomics research in mammals has produced reference genome sequences that are essential for identifying variation associated with disease. High quality reference genome sequences are now available for humans, model species, and economically important agricultural animals. Comparisons between these species have provided unique insights into mammalian gene function. However, the number of species with reference genomes is small compared to those needed for studying molecular evolutionary relationships in the tree of life. For example, among the even-toed ungulates there are approximately 300 species whose phylogenetic relationships have been calculated in the 10k trees project. Only six of these have reference genomes: cattle, swine, sheep, goat, water buffalo, and bison. Although reference sequences will eventually be developed for additional hoof stock, the resources in terms of time, money, infrastructure and expertise required to develop a quality reference genome may be unattainable for most species for at least another decade. In this work we mapped 35 Gb of next generation sequence data of a Katahdin sheep to its own species’ reference genome (Ovis aries Oar3.1) and to that of a species that diverged 15 to 30 million years ago (Bos taurus UMD3.1). In total, 56% of reads covered 76% of UMD3.1 to an average depth of 6.8 reads per site, 83 million variants were identified, of which 78 million were homozygous and likely represent interspecies nucleotide differences. Excluding repeat regions and sex chromosomes, nearly 3.7 million heterozygous sites were identified in this animal vs. bovine UMD3.1, representing polymorphisms occurring in sheep. Of these, 41% could be readily mapped to orthologous positions in ovine Oar3.1 with 80% corroborated as heterozygous. These variant sites, identified via interspecies mapping could be used for comparative genomics, disease association studies, and ultimately to understand mammalian gene function.

Download Full-text

TEfinder: A Bioinformatics Pipeline for Detecting New Transposable Element Insertion Events in Next-Generation Sequencing Data

10.20944/preprints202012.0473.v1 ◽

2020 ◽

Author(s):

Vista Sohrab ◽

Cristina López-Díaz ◽

Antonio Di Pietro ◽

Li-Jun Ma ◽

Dilay Hazal Ayhan

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Short Term ◽

Bioinformatics Pipeline ◽

Bioinformatics Software ◽

External Software ◽

Short Term Adaptation ◽

Generation Sequencing

Transposable elements (TEs) are mobile genetic elements capable of rapidly altering the genome through their movements. The importance of TE activity has been documented in many biological processes, such as introducing genetic instability, altering patterns of gene expression, and accelerating genome evolution. Increasing appreciation of TEs results in the growing number of bioinformatics software to identify insertion events. However, the application of existing TE finding tools is limited by either narrow-focused design of the package, too many dependencies on other tools, or prior knowledge required as input files that may not be readily available to all users. Here, we report a simple pipeline, TEfinder, developed for the detection of new TE insertions with minimal software dependencies using four inputs that can be easily generated with popular variant calling pipelines. The external software requirements are BEDTools, SAMtools, and Picard. Necessary inputs include TEs present in the reference genome, binary paired-end alignment, reference genome index, and a list of TE names. We tested TEfinder pipeline among several evolving populations of Fusarium oxysporum generated through a short-term adaptation study. Our results demonstrate that this easy-to-use tool can effectively detect new TE insertion events, making it accessible and practical for TE analysis.

Download Full-text

Graph Peak Caller: calling ChIP-Seq Peaks on Graph-based Reference Genomes

10.1101/286823 ◽

2018 ◽

Cited By ~ 3

Author(s):

Ivar Grytten ◽

Knut D. Rand ◽

Alexander J. Nederbragt ◽

Geir O. Storvik ◽

Ingrid K. Glad ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Individual Variation ◽

Reference Genome ◽

De Novo ◽

Variant Calling ◽

Graph Representation ◽

Pan Genome ◽

Reference Genomes ◽

Peak Caller ◽

Reference Graph

AbstractGraph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2.

Download Full-text