Scaffolding Contigs Using Multiple Reference Genomes

Computational Biology and Chemistry ◽

10.5772/intechopen.93456 ◽

2020 ◽

Author(s):

Yi-Kung Shieh ◽

Shu-Cheng Liu ◽

Chin Lung Lu

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

State Of The Art ◽

Draft Genome ◽

Evolutionary Relationship ◽

The State ◽

Target Genome ◽

Multiple Reference ◽

Reference Genomes

Scaffolding is an important step of the genome assembly and its function is to order and orient the contigs in the assembly of a draft genome into larger scaffolds. Several single reference-based scaffolders have currently been proposed. However, a single reference genome may not be sufficient alone for a scaffolder to correctly scaffold a target draft genome, especially when the target genome and the reference genome have distant evolutionary relationship or some rearrangements. This motivates researchers to develop the so-called multiple reference-based scaffolders that can utilize multiple reference genomes, which may provide different but complementary types of scaffolding information, to scaffold the target draft genome. In this chapter, we will review some of the state-of-the-art multiple reference-based scaffolders, such as Ragout, MeDuSa and Multi-CAR, and give a complete introduction to Multi-CSAR, an improved extension of Multi-CAR.

Download Full-text

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Download Full-text

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies

BMC Genomics ◽

10.1186/s12864-019-6070-x ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Gokhan Yavas ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Quality Assessment ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Quality Score ◽

De Novo Genome Assembly ◽

Genome Assemblies ◽

Reference Genomes ◽

Better Than

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.

Download Full-text

Optimization of the “in-silico” mate-pair method improved contiguity and accuracy of genome assembly

10.22541/au.163257605.53808833/v1 ◽

2021 ◽

Author(s):

Tao Zhou ◽

Liang Lu ◽

Chenhong Li

Keyword(s):

Genome Assembly ◽

In Silico ◽

Reference Genome ◽

Third Generation ◽

High Molecular Weight Dna ◽

The Third ◽

Mate Pair ◽

Pair Method ◽

Reference Genomes ◽

Generation Sequencing

A combination of next-generation sequencing technologies and mate-pair libraries of large insert sizes is used as a standard method to generate genome assemblies with high contiguity. The third-generation sequencing techniques also are used to improve the quality of assembled genomes. However, both mate-pair libraries and the third-generation libraries require high-molecular-weight DNA, making the use of these libraries inappropriate for samples with only degraded DNA. An in silico method that generates mate-pair libraries using a reference genome was devised for the task of assembling target genomes. Although the contiguity and completeness of assembled genomes were significantly improved by this method, a high level of errors manifested in the assembly, further to which the methods for using reference genomes were not optimized. Here, we tested different strategies for using reference genomes to generate in silico mate-pairs. The results showed that using a closely related reference genome from the same genus was more effective than using divergent references. Conservation of in silico mate-pairs by comparing two references and using those to guide genome assembly reduced the number of misassemblies (18.6% – 46.1%) and increased the contiguity of assembled genomes (9.7% – 70.7%), while maintaining gene completeness at a level that was either similar or marginally lower than that obtained via the current method. Finally, we compared the optimized method with another reference-guided assembler, RaGOO. We found that RaGOO produced longer scaffolds (17.8 Mbp vs 3.0 Mbp), but resulted in a much higher misassembly rate (85.68%) than our optimized in silico mate-pair method.

Download Full-text

A draft genome assembly of the eastern banjo frog Limnodynastes dumerilii dumerilii (Anura: Limnodynastidae)

10.1101/2020.03.03.971721 ◽

2020 ◽

Cited By ~ 1

Author(s):

Qiye Li ◽

Qunfei Guo ◽

Yang Zhou ◽

Huishuang Tan ◽

Terry Bertozzi ◽

...

Keyword(s):

Genome Assembly ◽

Draft Genome ◽

Protein Coding ◽

Large Genome ◽

Draft Genome Assembly ◽

Protein Coding Genes ◽

Repeat Content ◽

Australian Continent ◽

Large Genome Size ◽

Reference Genomes

AbstractAmphibian genomes are usually challenging to assemble due to large genome size and high repeat content. The Limnodynastidae is a family of frogs native to Australia, Tasmania and New Guinea. As an anuran lineage that successfully diversified on the Australian continent, it represents an important lineage in the amphibian tree of life but lacks reference genomes. Here we sequenced and annotated the genome of the eastern banjo frog Limnodynastes dumerilii dumerilii to fill this gap. The total length of the genome assembly is 2.38 Gb with a scaffold N50 of 285.9 kb. We identified 1.21 Gb of non-redundant sequences as repetitive elements and annotated 24,548 protein-coding genes in the assembly. BUSCO assessment indicated that more than 94% of the expected vertebrate genes were present in the genome assembly and the gene set. We anticipate that this annotated genome assembly will advance the future study of anuran phylogeny and amphibian genome evolution.

Download Full-text

Kmer2SNP: reference-free SNP calling from raw reads based on matching

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

Solyntus, the New Highly Contiguous Reference Genome for Potato (Solanum tuberosum)

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401550 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3489-3495

Author(s):

Natascha van Lieshout ◽

Ate van der Burgt ◽

Michiel E. de Vries ◽

Menno ter Maat ◽

David Eickholt ◽

...

Keyword(s):

Solanum Tuberosum ◽

Reference Genome ◽

De Novo ◽

Draft Genome ◽

Single Copy ◽

Rapid Expansion ◽

Potato Genome ◽

Homozygous Diploid ◽

Gene Orthologs ◽

Reference Genomes

With the rapid expansion of the application of genomics and sequencing in plant breeding, there is a constant drive for better reference genomes. In potato (Solanum tuberosum), the third largest food crop in the world, the related species S. phureja, designated “DM”, has been used as the most popular reference genome for the last 10 years. Here, we introduce the de novo sequenced genome of Solyntus as the next standard reference in potato genome studies. A true Solanum tuberosum made up of 116 contigs that is also highly homozygous, diploid, vigorous and self-compatible, Solyntus provides a more direct and contiguous reference then ever before available. It was constructed by sequencing with state-of-the-art long and short read technology and assembled with Canu. The 116 contigs were assembled into scaffolds to form each pseudochromosome, with three contigs to 17 contigs per chromosome. This assembly contains 93.7% of the single-copy gene orthologs from the Solanaceae set and has an N50 of 63.7 Mbp. The genome and related files can be found at https://www.plantbreeding.wur.nl/Solyntus/. With the release of this research line and its draft genome we anticipate many exciting developments in (diploid) potato research.

Download Full-text

GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

10.1101/810812 ◽

2019 ◽

Cited By ~ 9

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

Graph Alignment ◽

Link Type ◽

Long Reads ◽

Reference Genomes ◽

Genome Graph

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner

Download Full-text

ONT-based draft genome assembly and annotation of Alternaria atra

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-01-21-0016-a ◽

2021 ◽

Author(s):

Bhawna Bonthala ◽

Corinn Sophia Small ◽

Maximilian Anton Lutz ◽

Alexander Graf ◽

Stefan Krebs ◽

...

Keyword(s):

Genome Assembly ◽

Plant Pathogens ◽

Biocontrol Agent ◽

Reference Genome ◽

Gene Prediction ◽

Draft Genome ◽

Protein Coding ◽

Draft Genome Assembly ◽

Reference Genome Assembly ◽

Wide Range

Species of Alternaria (phylum Ascomycota, family Pleosporaceae) are known as serious plant pathogens, causing major losses on a wide range of crops. Alternaria atra (Preuss) Woudenb. & Crous (previously known as Ulocladium atrum) can grow as a saprophyte on many hosts and causes Ulocladium blight on potato. It has been reported that it can also be used as a biocontrol agent against a.o. Botrytis cinerea Here we present a scaffold-level reference genome assembly for A. atra. The assembly contains 43 scaffolds with a total length of 39.62 Mbp, with scaffold N50 of 3,893,166 bp , L50 of 4 and the longest 10 scaffolds containing 89.9% of the assembled data. RNA Seq-guided, gene prediction using BRAKER resulted in 12,173 protein-coding genes with their functional annotation. This first high-quality reference genome assembly and annotation for A. Atra can be used as a resource for studying evolution in the highly complicated Alternaria genus and might help understand the mechanisms defining its role as pathogen or biocontrol agent.

Download Full-text

An Algorithm to Build a Multi-genome Reference

10.1101/2020.04.11.036871 ◽

2020 ◽

Cited By ~ 2

Author(s):

Leily Rabbani ◽

Jonas Müller ◽

Detlef Weigel

Keyword(s):

Reference Genome ◽

Single Species ◽

High Quality ◽

Sequencing Technologies ◽

Single Genome ◽

Mapping Sequence ◽

A Genome ◽

Shared Information ◽

Multiple Reference ◽

Reference Genomes

1AbstractMotivationNew DNA sequencing technologies have enabled the rapid analysis of many thousands of genomes from a single species. At the same time, the conventional approach of mapping sequencing reads against a single reference genome sequence is no longer adequate. However, even where multiple high-quality reference genomes are available, the problem remains how one would integrate results from pairwise analyses.ResultTo overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.Availabilityhttps://github.com/LeilyR/[email protected]

Download Full-text

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

10.1101/2021.02.16.431517 ◽

2021 ◽

Author(s):

Jeremie S. Kim ◽

Can Firtina ◽

Meryem Banu Cavlak ◽

Damla Senol Cali ◽

Nastaran Hajinazar ◽

...

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Variant Calling ◽

Ground Truth ◽

Data Set ◽

C Elegans ◽

A Genome ◽

Downstream Analysis ◽

Similar Accuracy ◽

Reference Genomes

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Download Full-text