Unicycler: resolving bacterial genome assemblies from short and long sequencing reads

Mapping Intimacies ◽

10.1101/096412 ◽

2016 ◽

Cited By ~ 1

Author(s):

Ryan R. Wick ◽

Louise M. Judd ◽

Claire L. Gorrie ◽

Kathryn E. Holt

Keyword(s):

Dna Sequencing ◽

De Novo ◽

Bacterial Genome ◽

Read Depth ◽

Resolving Power ◽

Short Reads ◽

Sequencing Platform ◽

Long Reads ◽

Combining Data ◽

Genome Assemblies

1.AbstractThe Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce more complete genome assemblies, but the sequencing is more expensive and error prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate “hybrid” assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler utilises a novel semi-global aligner, which is used to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

Download Full-text

Improved long read correction for de novo assembly using an FM-index

10.1101/067272 ◽

2016 ◽

Cited By ~ 1

Author(s):

James M. Holt ◽

Jeremy R. Wang ◽

Corbin D. Jones ◽

Leonard McMillan

Keyword(s):

De Novo ◽

State Of The Art ◽

Genomic Research ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

The Cost ◽

Genome Assemblies ◽

Hybrid Assemblies ◽

Burrows Wheeler Transform

1AbstractLong read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy. To this end, we describe a novel application of a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We show that our method efficiently produces significantly higher quality corrected sequence than existing hybrid error-correction methods. We demonstrate the effectiveness of our method compared to state-of-the-art hybrid and long-read only de novo assembly methods.

Download Full-text

High-Quality Genome Assembly of Peronospora destructor, the Causal Agent of Onion Downy Mildew

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-10-19-0280-a ◽

2020 ◽

Vol 33 (5) ◽

pp. 718-720

Author(s):

Karthi Natesan ◽

Ji Yeon Park ◽

Cheol-Woo Kim ◽

Dong Suk Park ◽

Young-Seok Kwon ◽

...

Keyword(s):

Downy Mildew ◽

De Novo ◽

Gc Content ◽

Comparative Genomic ◽

High Quality ◽

Sequencing Platform ◽

Peronospora Destructor ◽

Genomic Studies ◽

Genome Assemblies ◽

High Quality Genome

Peronospora destructor is an obligate biotrophic oomycete that causes downy mildew on onion (Allium cepa). Onion is an important crop worldwide, but its production is affected by this pathogen. We sequenced the genome of P. destructor using the PacBio sequencing platform, and de novo assembly resulted in 74 contigs with a total contig size of 29.3 Mb and 48.48% GC content. Here, we report the first high-quality genome sequence of P. destructor and its comparison with the genome assemblies of other oomycetes. The genome is a very useful resource to serve as a reference for analysis of P. destructor isolates and for comparative genomic studies of the biotrophic oomycetes.

Download Full-text

riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions

10.1101/159798 ◽

2017 ◽

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

De Novo ◽

Bacterial Genome ◽

Genomic Context ◽

Inherent Difficulty ◽

Short Reads ◽

Genomic Architecture ◽

A Genome ◽

Ribosomal Operons ◽

Flanking Regions ◽

Bacterial Genome Sequencing

The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ≈10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit conservation in the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly-sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.

Download Full-text

Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes

10.1101/064352 ◽

2016 ◽

Cited By ~ 10

Author(s):

Derek M. Bickhart ◽

Benjamin D. Rosen ◽

Sergey Koren ◽

Brian L. Sayre ◽

Alex R. Hastie ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Chromosome Length ◽

Chromatin Interaction ◽

Capra Hircus ◽

Immune Gene ◽

Ruminant Species ◽

Long Reads ◽

Assembly Algorithms ◽

Genome Assemblies

AbstractThe decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus), based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. Our assembly represents a >250-fold improvement in contiguity compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, supporting the most complete repeat family and immune gene complex representation ever produced for a ruminant species.

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

Polypolish: short-read polishing of long-read bacterial genome assemblies

10.1101/2021.10.14.464465 ◽

2021 ◽

Author(s):

Ryan R Wick ◽

Kathryn E Holt

Keyword(s):

Bacterial Genome ◽

Short Read ◽

Read Alignment ◽

Short Reads ◽

Repeat Sequences ◽

Short Read Alignment ◽

Long Read ◽

Genome Assemblies ◽

Residual Errors

Long-read-only bacterial genome assemblies usually contain residual errors, most commonly homopolymer-length errors. Short-read polishing tools can use short reads to fix these errors, but most rely on short-read alignment which is unreliable in repeat regions. Errors in such regions are therefore challenging to fix and often remain after short-read polishing. Here we introduce Polypolish, a new short-read polisher which uses all-per-read alignments to repair errors in repeat sequences that other polishers cannot. In benchmarking tests using both simulated and real reads, we find that Polypolish performs well, and the best results are achieved by using Polypolish in combination with other short-read polishers.

Download Full-text

Highly-accurate long-read sequencing improves variant detection and assembly of a human genome

10.1101/519025 ◽

2019 ◽

Cited By ~ 27

Author(s):

Aaron M. Wenger ◽

Paul Peluso ◽

William J. Rowell ◽

Pi-Chuan Chang ◽

Richard J. Hall ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Short Reads ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Variant Detection ◽

High Quality Genome ◽

Circular Consensus Sequencing

AbstractThe major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.

Download Full-text

Chromosomal-scale De novo Genome Assemblies of Cynomolgus Macaque and Common Marmoset

10.1101/2020.12.04.411207 ◽

2020 ◽

Author(s):

Vasanthan Jayakumar ◽

Osamu Nishimura ◽

Mitsutaka Kadota ◽

Naoki Hirose ◽

Hiromi Sano ◽

...

Keyword(s):

Human Genome ◽

Biomedical Research ◽

De Novo ◽

Massively Parallel Sequencing ◽

Cynomolgus Macaque ◽

Common Marmoset ◽

Contact Maps ◽

Long Reads ◽

Genome Assemblies ◽

Human Primate

AbstractCynomolgus macaque (Macaca fascicularis) and common marmoset (Callithrix jacchus) have been widely used in human biomedical research. Their genomes were sequenced and assembled initially using short-read sequences, with the advent of massively parallel sequencing. However, the resulting contig sequences tended to remain fragmentary, and long-standing primate genome assemblies used the human genome as a reference for ordering and orienting the assembled fragments into chromosomes. Here we performed de novo genome assembly of these two species without any human genome-based bias observed in the genome assemblies released earlier. Firstly we assembled PacBio long reads, and the resultant contigs were scaffolded with Hi-C data. The scaffolded sequences obtained were further refined based on assembly results of alternate de novo assemblies and Hi-C contact maps by resolving identified inconsistencies. The final assemblies achieved N50 lengths of 149 Mb and 137 Mb for cynomolgus macaque and common marmoset, respectively, and the numbers of scaffolds longer than 10Mb are equal to their chromosome numbers. The high fidelity of our assembly is ascertained by concordance to the BAC-end read pairs observed for common marmoset, as well as a high resemblance of their karyotypic organization. Our assembly of cynomolgus macaque outperformed all the available assemblies of this species in terms of contiguity. The chromosome-scale genome assemblies produced in this study are valuable resources for non-human primate models and provide an important baseline in human biomedical research.

Download Full-text

Contamination as a major factor in poor Illumina assembly of microbial isolate genomes

10.1101/081885 ◽

2016 ◽

Cited By ~ 5

Author(s):

Haeyoung Jeong ◽

Jae-Goo Pan ◽

Seung-Hwan Park

Keyword(s):

Illumina Sequencing ◽

De Novo ◽

Repetitive Sequences ◽

Low Frequency ◽

Read Depth ◽

16S Rrna Genes ◽

Rrna Genes ◽

Sequencing Error ◽

Sequencing Data ◽

Long Reads

ABSTRACTThe nonhybrid hierarchical assembly of PacBio long reads is becoming the most preferred method for obtaining genomes for microbial isolates. On the other hand, among massive numbers of Illumina sequencing reads produced, there is a slim chance of re-evaluating failed microbial genome assembly (high contig number, large total contig size, and/or the presence of low-depth contigs). We generated Illumina-type test datasets with various levels of sequencing error, pretreatment (trimming and error correction), repetitive sequences, contamination, and ploidy from both simulated and real sequencing data and applied k-mer abundance analysis to quickly detect possible diagnostic signatures of poor assemblies. Contamination was the only factor leading to poor assemblies for the test dataset derived from haploid microbial genomes, resulting in an extraordinary peak within low-frequency k-mer range. When thirteen Illumina sequencing reads of microbes belonging to genera Bacillus or Paenibacillus from a single multiplexed run were subjected to a k-mer abundance analysis, all three samples leading to poor assemblies showed peculiar patterns of contamination. Read depth distribution along the contig length indicated that all problematic assemblies suffered from too many contigs with low average read coverage, where 1% to 15% of total reads were mapped to low-coverage contigs. We found that subsampling or filtering out reads having rare k-mers could efficiently remove low-level contaminants and greatly improve the de novo assemblies. An analysis of 16S rRNA genes recruited from reads or contigs and the application of read classification tools originally designed for metagenome analyses can help identify the source of a contamination. The unexpected presence of proteobacterial reads across multiple samples, which had no relevance to our lab environment, implies that such prevalent contamination might have occurred after the DNA preparation step, probably at the place where sequencing service was provided.

Download Full-text