TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads

Mapping Intimacies ◽

10.1101/831248 ◽

2019 ◽

Cited By ~ 2

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Shengqiang Gu ◽

Ou Wang ◽

Rui Zhang ◽

...

Keyword(s):

Single Molecule ◽

Gene Annotation ◽

Variant Calling ◽

Software Tool ◽

Error Rates ◽

Third Generation ◽

Large Genome ◽

Long Reads ◽

Genome Assemblies ◽

Gap Closing

AbstractThe completeness and accuracy of genome assemblies determine the quality of subsequent bioinformatics analysis. Despite benefiting from the medium/long-range information of third-generation sequencing techniques, current gap-closing tools to enhance assemblies suffer multi-alignments and high error rates, resulting in huge time and money costs.We developed a software tool, TGS-GapCloser that uses the low depth (>=10X) single molecule sequencing long reads without any error correction to close gaps. The algorithm distinguishes gap regions from the alignments of long reads against original scaffolds, corrects only the candidate regions, and assigns the best sequences to each gap. We demonstrate that TGS-GapCloser improves the contig N50 value of draft assembly by 25-fold on average, updating above 90% gaps with 93.96% positive predictive value. Despite of high error rate of raw long reads, improved assemblies archive Q50 (99.999%) single-base accuracy with only 11.8% decrement to inputs. Besides it could complete more gaps, and is also ∼29-fold faster than mainstream gap-closing tools. BUSCO analysis revealed that 3.4%-13.1% more expected genes were complete. TGS-GapCloser also shows its power to fill gaps for ultra large genome assembly of ginkgo (∼12Gb) with 71.6% of gaps closed. The validation of inserted or merged gap sequences was conducted with NGS reads and reference genomes, respectively. The updated genome assemblies may promote the gene annotation, structure variant calling and thus improving the downstream analysis of ontogeny, phylogeny, and evolution.

Download Full-text

TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

GigaScience ◽

10.1093/gigascience/giaa094 ◽

2020 ◽

Vol 9 (9) ◽

Cited By ~ 3

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Shengqiang Gu ◽

Ou Wang ◽

Rui Zhang ◽

...

Keyword(s):

Single Molecule ◽

Sequence Data ◽

Software Tool ◽

Fold Increase ◽

Substantial Improvement ◽

Large Genome ◽

Long Reads ◽

Close Sequence ◽

Genome Assemblies ◽

Large Genomes

Abstract Background Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (∼10×) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only ∼10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (∼12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Quality of Third Generation Sequencing

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9630 ◽

2020 ◽

Vol 17 (12) ◽

pp. 5205-5209

Author(s):

Ali Elbialy ◽

M. A. El-Dosuky ◽

Ibrahim M. El-Henawy

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Gc Content ◽

Error Rates ◽

Third Generation ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing

Third generation sequencing (TGS) relates to long reads but with relatively high error rates. Quality of TGS is a hot topic, dealing with errors. This paper combines and investigates three quality related metrics. They are basecalling accuracy, Phred Quality Scores, and GC content. For basecalling accuracy, a deep neural network is adopted. The measured loss does not exceed 5.42.

Download Full-text

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm

10.1101/066100 ◽

2016 ◽

Cited By ~ 6

Author(s):

Aleksey V. Zimin ◽

Daniela Puiu ◽

Ming-Cheng Luo ◽

Tingting Zhu ◽

Sergey Koren ◽

...

Keyword(s):

Single Molecule ◽

Large Scale ◽

Aegilops Tauschii ◽

Large Data ◽

Artificial Chromosome ◽

Error Rates ◽

Plant Genome ◽

Hybrid Assembly ◽

Data Set ◽

Long Reads

AbstractLong sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

Download Full-text

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

10.1101/125534 ◽

2017 ◽

Author(s):

Chong Chu ◽

Xin Li ◽

Yufeng Wu

Keyword(s):

Sequence Data ◽

Bacterial Genome ◽

Software Tool ◽

Sea Bass ◽

Short Sequence ◽

Asian Sea Bass ◽

Long Reads ◽

Local Assembly ◽

Genomic Repeats ◽

Gap Closing

AbstractBackgroundClosing gaps in draft genomes is an important post processing step in genome assembly. It leads to more complete genomes, which benefits downstream genome analysis such as annotation and genotyping. Several tools have been developed for gap closing. However, these tools don’t fully utilize the information contained in the sequence data. For example, while it is known that many gaps are caused by genomic repeats, existing tools often ignore many sequence reads that originate from a repeat-related gap.ResultsIn this paper, we propose a new approach called GAPPadder for gap closing. The main advantage of GAPPadder is that it uses more information in sequence data for gap closing. In particular, GAPPadder finds and uses reads that originate from repeate-related gaps. We show that these repeat-associated reads are useful for gap closing, even though they are ignored by all existing tools. Other main features of GAPPadder include utilizing the information in sequence reads with different insert sizes and performing two-stage local assembly of gap sequences. We compare GAPPadder with GapCloser, GapFiller and Sealer on one bacterial genome, human chromosome 14 and the human whole genome with paired-end and mate-paired reads with both short and long insert sizes. Empirical results show that GAPPadder can close more gaps than these existing tools. Besides closing gaps on draft genomes assembled only from short sequence reads, GAPPadder can also be used to close gaps for draft genomes assembled with long reads. We show GAPPadder can close gaps on the bed bug genome and the Asian sea bass genome that are assembled partially and fully with long reads respectively. We also show GAPPadder is efficient in both time and memory usage. The software tool, GAPPadder, is available for download at https://github.com/Reedwarbler/GAPPadder.

Download Full-text

Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes

10.1101/064352 ◽

2016 ◽

Cited By ~ 10

Author(s):

Derek M. Bickhart ◽

Benjamin D. Rosen ◽

Sergey Koren ◽

Brian L. Sayre ◽

Alex R. Hastie ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Chromosome Length ◽

Chromatin Interaction ◽

Capra Hircus ◽

Immune Gene ◽

Ruminant Species ◽

Long Reads ◽

Assembly Algorithms ◽

Genome Assemblies

AbstractThe decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus), based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. Our assembly represents a >250-fold improvement in contiguity compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, supporting the most complete repeat family and immune gene complex representation ever produced for a ruminant species.

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

10.1101/016519 ◽

2015 ◽

Cited By ~ 4

Author(s):

Rene L Warren ◽

Benjamin P Vandervalk ◽

Steven JM Jones ◽

Inanc Birol

Keyword(s):

Genome Assembly ◽

Error Rates ◽

Great Promise ◽

Genomic Information ◽

E Coli ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

K 12 ◽

Genome Assemblies

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Established and emerging long read technologies show great promise in this regard, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they could be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a solution that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We show how the contiguity of an ABySS E. coli K-12 genome assembly could be increased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads and how LINKS leverages long-range information in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. Re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) and how LINKS scales to larger genomes is also presented. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. Availability: http://www.bcgsc.ca/bioinfo/software/links

Download Full-text

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Nature Communications ◽

10.1038/s41467-019-13355-3 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 10

Author(s):

Huilong Du ◽

Chengzhi Liang

Keyword(s):

Single Molecule ◽

Repetitive Sequences ◽

Tartary Buckwheat ◽

Gene Sequences ◽

Assembly Method ◽

Long Reads ◽

A Genome ◽

Genome Assemblies ◽

Reference Genomes

AbstractThe abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. Here we report a genome assembly method HERA, which resolves repeats efficiently by constructing a connection graph from an overlap graph. We test HERA on the genomes of rice, maize, human, and Tartary buckwheat with single-molecule sequencing and mapping data. HERA correctly assembles most of the previously unassembled regions, resulting in dramatically improved, highly contiguous genome assemblies with newly assembled gene sequences. For example, the maize contig N50 size reaches 61.2 Mb and the Tartary buckwheat genome comprises only 20 contigs. HERA can also be used to fill gaps and fix errors in reference genomes. The application of HERA will greatly improve the quality of new or existing assemblies of complex genomes.

Download Full-text