Comparison of De Novo Assembly Strategies for Bacterial Genomes

Pengfei Zhang; Dike Jiang; Yin Wang; Xueping Yao; Yan Luo; Zexiao Yang

doi:10.3390/ijms22147668

Comparison of De Novo Assembly Strategies for Bacterial Genomes

International Journal of Molecular Sciences ◽

10.3390/ijms22147668 ◽

2021 ◽

Vol 22 (14) ◽

pp. 7668

Author(s):

Pengfei Zhang ◽

Dike Jiang ◽

Yin Wang ◽

Xueping Yao ◽

Yan Luo ◽

...

Keyword(s):

De Novo ◽

Bacterial Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Accurate Analysis ◽

Short Read ◽

Sequencing Errors ◽

Assembly Accuracy ◽

Protein Prediction ◽

Long Read

(1) Background: Short-read sequencing allows for the rapid and accurate analysis of the whole bacterial genome but does not usually enable complete genome assembly. Long-read sequencing greatly assists with the resolution of complex bacterial genomes, particularly when combined with short-read Illumina data. However, it is not clear how different assembly strategies affect genomic accuracy, completeness, and protein prediction. (2) Methods: we compare different assembly strategies for Haemophilus parasuis, which causes Glässer’s disease, characterized by fibrinous polyserositis and arthritis, in swine by using Illumina sequencing and long reads from the sequencing platforms of either Oxford Nanopore Technologies (ONT) or SMRT Pacific Biosciences (PacBio). (3) Results: Assembly with either PacBio or ONT reads, followed by polishing with Illumina reads, facilitated high-quality genome reconstruction and was superior to the long-read-only assembly and hybrid-assembly strategies when evaluated in terms of accuracy and completeness. An equally excellent method was correction with Homopolish after the ONT-only assembly, which had the advantage of avoiding hybrid sequencing with Illumina. Furthermore, by aligning transcripts to assembled genomes and their predicted CDSs, the sequencing errors of the ONT assembly were mainly indels that were generated when homopolymer regions were sequenced, thus critically affecting protein prediction. Polishing can fill indels and correct mistakes. (4) Conclusions: The assembly of bacterial genomes can be directly achieved by using long-read sequencing techniques. To maximize assembly accuracy, it is essential to polish the assembly with homologous sequences of related genomes or sequencing data from short-read technology.

Download Full-text

Do You Want to Build a Genome? Benchmarking Hybrid Bacterial Genome Assembly Methods

10.1101/2021.11.07.467652 ◽

2021 ◽

Author(s):

Georgia L Breckell ◽

Olin K Silander

Keyword(s):

Genome Assembly ◽

Bacterial Genome ◽

Ground Truth ◽

Bacterial Genomes ◽

Short Read ◽

Assembly Method ◽

Assembly Accuracy ◽

A Genome ◽

Lower Accuracy ◽

Long Read

Long read sequencing technologies now allow routine highly contiguous assembly of bacterial genomes. However, because of the lower accuracy of some long read data, it is often combined with short read data (e.g. Illumina), to improve assembly quality. There are a number of methods available for producing such hybrid assemblies. Here we use Illumina and Oxford Nanopore (ONT) data from 49 natural isolates of Escherichia coli to characterise differences in assembly accuracy for five assembly methods (Canu, Unicycler, Raven, Flye, and Redbean). We evaluate assembly accuracy using five metrics designed to measure structural accuracy and sequence accuracy (indel and substitution frequency). We assess structural accuracy by quantifying (1) the contiguity of chromosomes and plasmids; (2) the fraction of concordantly mapped Illumina reads withheld from the assembly; and (3) whether rRNA operons are correctly oriented. We assess indel and substitution frequency by quantifying (1) the fraction of open reading frames that appear truncated and (2) the number of variants that are called using Illumina reads only. Applying these assembly metrics to a large number of E. coli strains, we find that different assembly methods offer different advantages. In particular, we find that Unicycler assemblies have the highest sequence accuracy in non-repetitive regions, while Flye and Raven tend to be the most structurally accurate. In addition, we find that there are unidentified strain-specific characteristics that affect ONT consensus accuracy, despite individual reads having similar levels of accuracy. The differences in consensus accuracy of the ONT reads can preclude accurate assembly regardless of assembly method. These results provide quantitative insight into the best approaches for hybrid assembly of bacterial genomes and the expected levels of structural and sequence accuracy. They also show that there are intrinsic idiosyncratic strain-level differences that inhibit accurate long read bacterial genome assembly. However, we also show it is possible to diagnose problematic assemblies, even in the absence of ground truth, by comparing long-read first and short-read first assemblies.

Download Full-text

Draft genome assemblies using sequencing reads from Oxford Nanopore Technology and Illumina platforms for four species of North American Fundulus killifish

GigaScience ◽

10.1093/gigascience/giaa067 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 3

Author(s):

Lisa K Johnson ◽

Ruta Sahasrabudhe ◽

James Anthony Gill ◽

Jennifer L Roach ◽

Lutz Froenicke ◽

...

Keyword(s):

North American ◽

De Novo ◽

Draft Genome ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Sequence Coverage ◽

Short Read ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Background Whole-genome sequencing data from wild-caught individuals of closely related North American killifish species (Fundulus xenicus, Fundulus catenatus, Fundulus nottii, and Fundulus olivaceus) were obtained using long-read Oxford Nanopore Technology (ONT) PromethION and short-read Illumina platforms. Findings Draft de novo reference genome assemblies were generated using a combination of long and short sequencing reads. For each species, the PromethION platform was used to generate 30–45× sequence coverage, and the Illumina platform was used to generate 50–160× sequence coverage. Illumina-only assemblies were fragmented with high numbers of contigs, while ONT-only assemblies were error prone with low BUSCO scores. The highest N50 values, ranging from 0.4 to 2.7 Mb, were from assemblies generated using a combination of short- and long-read data. BUSCO scores were consistently >90% complete using the Eukaryota database. Conclusions High-quality genomes can be obtained from a combination of using short-read Illumina data to polish assemblies generated with long-read ONT data. Draft assemblies and raw sequencing data are available for public use. We encourage use and reuse of these data for assembly benchmarking and other analyses.

Download Full-text

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480v1 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Download Full-text

Comparative analysis of alignment tools for application on Nanopore sequencing data

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2021-2212 ◽

2021 ◽

Vol 7 (2) ◽

pp. 831-834

Author(s):

Chiara Becht ◽

Jonas Schmidt ◽

Frithjof Blessing ◽

Folker Wenzel

Keyword(s):

Error Rate ◽

De Novo ◽

Performance Criteria ◽

Computational Time ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Accurate Analysis ◽

Match Rate ◽

Long Read ◽

And Performance

Abstract INTRODUCTION: Long-read sequencing techniques such as Oxford Nanopore sequencing, are representing a promising novel approach in molecular-biological methodology, enabling potential facilitation in mapping and de novo assembly. In comparison to conventional sequencing methods, novel alignment tools are mandated to compensate differing data structures (especially high error rate) to achieve acceptably accurate analysis results. METHODS: In this study, benchmarking for long read aligners BLASR, GraphMap, LAST, minimap2, NGMLR and the short-read aligner BWA MEM on three experimental datasets was conducted. Obtained alignment results were compared for various quality and performance criteria, such as match rate, mismatch rate, error rate, working memory usage and computational time. RESULTS: The comparison yielded differences in alignment quality and performance of tools under test. Tool LAST showed the largest differences among all tools. Minimap2 achieved constant quality with good performance. BLASR, GraphMap, BWA MEM and NGMLR showed slight differences only. CONCLUSION: Differences among the tools could be reasoned with dataset characteristics and algorithm approaches of individual tools. All tools except BLASR seem applicable for Nanopore sequencing data. Therefore, selection of the tool should be done under consideration of the experimental design and the further downstream analysis

Download Full-text

DACCOR–Detection, characterization, and reconstruction of repetitive regions in bacterial genomes

PeerJ ◽

10.7717/peerj.4742 ◽

2018 ◽

Vol 6 ◽

pp. e4742 ◽

Cited By ~ 1

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping-based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program detection, characterization, and reconstruction of repetitive regions, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Download Full-text

Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes

10.1101/530824 ◽

2019 ◽

Cited By ~ 3

Author(s):

Nicola De Maio ◽

Liam P. Shaw ◽

Alasdair Hubbard ◽

Sophie George ◽

Nick Sanderson ◽

...

Keyword(s):

Bacterial Genome ◽

Hybrid Assembly ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Genome Reconstruction ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms

ABSTRACTIllumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.IMPACT STATEMENTIllumina short-read sequencing is frequently used for tasks in bacterial genomics, such as assessing which species are present within samples, checking if specific genes of interest are present within individual isolates, and reconstructing the evolutionary relationships between strains. However, while short-read sequencing can reveal significant detail about the genomic content of bacterial isolates, it is often insufficient for assessing genomic structure: how different genes are arranged within genomes, and particularly which genes are on plasmids – potentially highly mobile components of the genome frequently carrying antimicrobial resistance elements. This is because Illumina short reads are typically too short to span repetitive structures in the genome, making it impossible to accurately reconstruct these repetitive regions. One solution is to complement Illumina short reads with long reads generated with SMRT Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) sequencing platforms. Using this approach, called ‘hybrid assembly’, we show that we can automatically fully reconstruct complex bacterial genomes of Enterobacteriaceae isolates in the majority of cases (best-performing method: 17/20 isolates). In particular, by comparing different methods we find that using the assembler Unicycler with Illumina and ONT reads represents a low-cost, high-quality approach for reconstructing bacterial genomes using publicly available software.DATA SUMMARYRaw sequencing data and assemblies have been deposited in NCBI under BioProject Accession PRJNA422511 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422511). We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

Download Full-text

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

Download Full-text

A high-throughput multiplexing and selection strategy to complete bacterial genomes

GigaScience ◽

10.1093/gigascience/giab079 ◽

2021 ◽

Vol 10 (12) ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Anna K Pöntinen ◽

François Cléon ◽

Rebecca A Gladstone ◽

Anita C Schürch ◽

...

Keyword(s):

High Throughput ◽

Selection Strategy ◽

Bacterial Isolates ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Read Sequencing ◽

Long Read ◽

Isolate Selection ◽

Selection Of

Abstract Background Bacterial whole-genome sequencing based on short-read technologies often results in a draft assembly formed by contiguous sequences. The introduction of long-read sequencing technologies permits those contiguous sequences to be unambiguously bridged into complete genomes. However, the elevated costs associated with long-read sequencing frequently limit the number of bacterial isolates that can be long-read sequenced. Here we evaluated the recently released 96 barcoding kit from Oxford Nanopore Technologies (ONT) to generate complete genomes on a high-throughput basis. In addition, we propose an isolate selection strategy that optimizes a representative selection of isolates for long-read sequencing considering as input large-scale bacterial collections. Results Despite an uneven distribution of long reads per barcode, near-complete chromosomal sequences (assembly contiguity = 0.89) were generated for 96 Escherichia coli isolates with associated short-read sequencing data. The assembly contiguity of the plasmid replicons was even higher (0.98), which indicated the suitability of the multiplexing strategy for studies focused on resolving plasmid sequences. We benchmarked hybrid and ONT-only assemblies and showed that the combination of ONT sequencing data with short-read sequencing data is still highly desirable (i) to perform an unbiased selection of isolates for long-read sequencing, (ii) to achieve an optimal genome accuracy and completeness, and (iii) to include small plasmids underrepresented in the ONT library. Conclusions The proposed long-read isolate selection ensures the completion of bacterial genomes that span the genome diversity inherent in large collections of bacterial isolates. We show the potential of using this multiplexing approach to close bacterial genomes on a high-throughput basis.

Download Full-text

Complete, closed bacterial genomes from microbiomes using nanopore sequencing

Nature Biotechnology ◽

10.1038/s41587-020-0422-6 ◽

2020 ◽

Vol 38 (6) ◽

pp. 701-707 ◽

Cited By ~ 34

Author(s):

Eli L. Moss ◽

Dylan G. Maghini ◽

Ami S. Bhatt

Keyword(s):

Bacterial Species ◽

Genome Structure ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Repeat Elements ◽

Long Read ◽

Metagenomics Data ◽

Read Error Correction

AbstractMicrobial genomes can be assembled from short-read sequencing data, but the assembly contiguity of these metagenome-assembled genomes is constrained by repeat elements. Correct assignment of genomic positions of repeats is crucial for understanding the effect of genome structure on genome function. We applied nanopore sequencing and our workflow, named Lathe, which incorporates long-read assembly and short-read error correction, to assemble closed bacterial genomes from complex microbiomes. We validated our approach with a synthetic mixture of 12 bacterial species. Seven genomes were completely assembled into single contigs and three genomes were assembled into four or fewer contigs. Next, we used our methods to analyze metagenomics data from 13 human stool samples. We assembled 20 circular genomes, including genomes of Prevotella copri and a candidate Cibiobacter sp. Despite the decreased nucleotide accuracy compared with alternative sequencing and assembly approaches, our methods improved assembly contiguity, allowing for investigation of the role of repeat elements in microbial function and adaptation.

Download Full-text

On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data

10.1101/086744 ◽

2016 ◽

Cited By ~ 11

Author(s):

Sergio Arredondo-Alonso ◽

Willem van Schaik ◽

Rob J. Willems ◽

Anita C. Schürch

Keyword(s):

Complete Genome ◽

De Novo ◽

Bacterial Genome ◽

Composition Analysis ◽

Bacterial Survival ◽

Sequencing Data ◽

Genome Sequences ◽

High Throughput Analysis ◽

Short Read ◽

Short Read Sequencing

AbstractPlasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences.PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36).Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar.This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data.Author SummaryShort read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.

Download Full-text