Recovering individual haplotypes and a contiguous genome assembly from pooled long read sequencing of the diamondback moth (Lepidoptera: Plutellidae)

Mapping Intimacies ◽

10.1101/867879 ◽

2019 ◽

Author(s):

Samuel Whiteford ◽

Arjen E. van’t Hof ◽

Ritesh Krishna ◽

Thea Marubbi ◽

Stephanie Widdison ◽

...

Keyword(s):

Linkage Map ◽

Diamondback Moth ◽

Reference Sequence ◽

Final Assembly ◽

Long Reads ◽

Long Read ◽

Divergent Haplotype ◽

Map Integration ◽

Related Individuals ◽

Methodological Approaches

AbstractBackgroundRecent advances in genomics have addressed the challenge that divergent haplotypes pose to the reconstruction of haploid genomes. However for many organisms, the sequencing of either field-caught individuals or a pool of heterogeneous individuals is still the only practical option. Here we present methodological approaches to achieve three outcomes from pooled long read sequencing: the generation of a contiguous haploid reference sequence, the sequences of heterozygous haplotypes; and reconstructed genomic sequences of individuals related to the pooled material.ResultsPacBio long read sequencing, Dovetail Hi-C scaffolding and linkage map integration yielded a haploid chromosome-level assembly for the diamondback moth (Plutella xylostella), a global pest of Brassica crops, from a pool of related individuals. The final assembly consisted of 573 scaffolds, with a total assembly size of 343.6Mbp a scaffold N50 value of 11.3Mbp (limited by chromosome size) and a maximum scaffold size of 14.4Mbp. This assembly was then integrated with an existing RAD-seq linkage map, anchoring 95% of the assembled sequence to defined chromosomal positions.ConclusionsWe describe an approach to resolve divergent haplotype sequences and describe multiple validation approaches. We also reconstruct individual genomes from pooled long-reads, by applying a recently developed k-mer binning method.

Download Full-text

Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating Pacific Biosciences long reads and a high-density linkage map

GigaScience ◽

10.1093/gigascience/giab097 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Olli-Pekka Smolander ◽

Daniel Blande ◽

Virpi Ahola ◽

Pasi Rastas ◽

Jaakko Tanskanen ◽

...

Keyword(s):

Linkage Map ◽

Metapopulation Dynamics ◽

Melitaea Cinxia ◽

Pacific Biosciences ◽

Final Assembly ◽

Long Reads ◽

Glanville Fritillary Butterfly ◽

Gene Models ◽

High Density Linkage Map ◽

Chromosome Level

Abstract Background The Glanville fritillary (Melitaea cinxia) butterfly is a model system for metapopulation dynamics research in fragmented landscapes. Here, we provide a chromosome-level assembly of the butterfly's genome produced from Pacific Biosciences sequencing of a pool of males, combined with a linkage map from population crosses. Results The final assembly size of 484 Mb is an increase of 94 Mb on the previously published genome. Estimation of the completeness of the genome with BUSCO indicates that the genome contains 92–94% of the BUSCO genes in complete and single copies. We predicted 14,810 genes using the MAKER pipeline and manually curated 1,232 of these gene models. Conclusions The genome and its annotated gene models are a valuable resource for future comparative genomics, molecular biology, transcriptome, and genetics studies on this species.

Download Full-text

Accelerating long-read analysis on modern CPUs

10.1101/2021.07.21.453294 ◽

2021 ◽

Author(s):

Saurabh Kalikar ◽

Chirag Jain ◽

Vasimuddin Md ◽

Sanchit Misra

Keyword(s):

Data Structure ◽

Sequence Alignment ◽

Genome Assembly ◽

Draft Genome ◽

Reference Sequence ◽

Pairwise Sequence Alignment ◽

Draft Genome Assembly ◽

Index Data ◽

Long Reads ◽

Long Read

Long read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping of long reads or a draft genome assembly to a reference sequence is often one of the most time consuming steps in these applications. Here, we present techniques to accelerate minimap2, a widely used software for mapping. We present multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment. These result in reduction of end-to-end mapping time of minimap2 by up to 3.5x while maintaining identical output.

Download Full-text

Ultra-accurate microbial amplicon sequencing with synthetic long reads

Microbiome ◽

10.1186/s40168-021-01072-3 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Benjamin J. Callahan ◽

Dmitry Grinevich ◽

Siddhartha Thakur ◽

Michael A. Balamotis ◽

Tuval Ben Yehezkel

Keyword(s):

Microbial Community ◽

16S Rrna ◽

Amplicon Sequencing ◽

Species Level ◽

Full Length ◽

16S Rrna Genes ◽

Rrna Genes ◽

Strain Identification ◽

Long Reads ◽

Long Read

Abstract Background Out of the many pathogenic bacterial species that are known, only a fraction are readily identifiable directly from a complex microbial community using standard next generation DNA sequencing. Long-read sequencing offers the potential to identify a wider range of species and to differentiate between strains within a species, but attaining sufficient accuracy in complex metagenomes remains a challenge. Methods Here, we describe and analytically validate LoopSeq, a commercially available synthetic long-read (SLR) sequencing technology that generates highly accurate long reads from standard short reads. Results LoopSeq reads are sufficiently long and accurate to identify microbial genes and species directly from complex samples. LoopSeq perfectly recovered the full diversity of 16S rRNA genes from known strains in a synthetic microbial community. Full-length LoopSeq reads had a per-base error rate of 0.005%, which exceeds the accuracy reported for other long-read sequencing technologies. 18S-ITS and genomic sequencing of fungal and bacterial isolates confirmed that LoopSeq sequencing maintains that accuracy for reads up to 6 kb in length. LoopSeq full-length 16S rRNA reads could accurately classify organisms down to the species level in rinsate from retail meat samples, and could differentiate strains within species identified by the CDC as potential foodborne pathogens. Conclusions The order-of-magnitude improvement in length and accuracy over standard Illumina amplicon sequencing achieved with LoopSeq enables accurate species-level and strain identification from complex- to low-biomass microbiome samples. The ability to generate accurate and long microbiome sequencing reads using standard short read sequencers will accelerate the building of quality microbial sequence databases and removes a significant hurdle on the path to precision microbial genomics.

Download Full-text

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

BMC Genomics ◽

10.1186/s12864-021-07702-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Padmini Ramachandran ◽

Niranjan Nagarajan ◽

Denis Bertrand ◽

...

Keyword(s):

Public Health ◽

Public Health Response ◽

High Quality ◽

Short Read ◽

Short Reads ◽

The Core ◽

Long Reads ◽

Health Response ◽

Long Read ◽

Core Genes

Abstract Background Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. Results We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. Conclusion The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Short and long-read genome sequencing methodologies for somatic variant detection; genomic analysis of a patient with diffuse large B-cell lymphoma

Scientific Reports ◽

10.1038/s41598-021-85354-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hannah E. Roberts ◽

Maria Lopopolo ◽

Alistair T. Pagnamenta ◽

Eshita Sharma ◽

Duncan Parkes ◽

...

Keyword(s):

B Cell ◽

Genome Sequencing ◽

Cell Lymphoma ◽

B Cell Lymphoma ◽

Somatic Variation ◽

Single Nucleotide Variants ◽

Germline Variants ◽

Specificity And Sensitivity ◽

Long Reads ◽

Long Read

AbstractRecent advances in throughput and accuracy mean that the Oxford Nanopore Technologies PromethION platform is a now a viable solution for genome sequencing. Much of the validation of bioinformatic tools for this long-read data has focussed on calling germline variants (including structural variants). Somatic variants are outnumbered many-fold by germline variants and their detection is further complicated by the effects of tumour purity/subclonality. Here, we evaluate the extent to which Nanopore sequencing enables detection and analysis of somatic variation. We do this through sequencing tumour and germline genomes for a patient with diffuse B-cell lymphoma and comparing results with 150 bp short-read sequencing of the same samples. Calling germline single nucleotide variants (SNVs) from specific chromosomes of the long-read data achieved good specificity and sensitivity. However, results of somatic SNV calling highlight the need for the development of specialised joint calling algorithms. We find the comparative genome-wide performance of different tools varies significantly between structural variant types, and suggest long reads are especially advantageous for calling large somatic deletions and duplications. Finally, we highlight the utility of long reads for phasing clinically relevant variants, confirming that a somatic 1.6 Mb deletion and a p.(Arg249Met) mutation involving TP53 are oriented in trans.

Download Full-text

Long read sequencing reveals sequential complex rearrangements driven by Hepatitis B virus integration

10.1101/2021.12.09.471697 ◽

2021 ◽

Author(s):

Songbo Wang ◽

Jiadong Lin ◽

Xiaofei Yang ◽

Zihang Li ◽

Tun Xu ◽

...

Keyword(s):

Hepatitis B ◽

Clinical Samples ◽

Metabolic Dysfunction ◽

Cellular Functions ◽

Human Genomes ◽

Long Reads ◽

B Virus ◽

Long Read ◽

Genetic Structures ◽

Virus Integration

Integration of Hepatitis B (HBV) virus into human genome disrupts genetic structures and cellular functions. Here, we conducted multiplatform long read sequencing on two cell lines and five clinical samples of HBV-induced hepatocellular carcinomas (HCC). We resolved two types of complex viral integration induced genome rearrangements and established a Time-phased Integration and Rearrangement Model (TIRM) to depict their formation progress by differentiating inserted HBV copies with HiFi long reads. We showed that the two complex types were initialized from focal replacements and the fragile virus-human junctions triggered subsequent rearrangements. We further revealed that these rearrangements promoted a prevalent loss-of-heterozygosity at chr4q, accounting for 19.5% of HCC samples in ICGC cohort and contributing to immune and metabolic dysfunction. Overall, our long read based analysis reveals a novel sequential rearrangement progress driven by HBV integration, hinting the structural and functional implications on human genomes.

Download Full-text

Contiguity: Contig adjacency graph construction and visualisation

10.7287/peerj.preprints.1037v1 ◽

2015 ◽

Cited By ~ 8

Author(s):

Mitchell J Sullivan ◽

Nouri L Ben Zakour ◽

Brian M Forde ◽

Mitchell Stanton-Cook ◽

Scott A Beatson

Keyword(s):

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Interactive Software ◽

Graph Exploration ◽

Adjacency Graph ◽

Highly Sensitive ◽

Long Read ◽

Genome Assemblies ◽

Adjacency Graphs

Contiguity is an interactive software for the visualization and manipulation of de novo genome assemblies. Contiguity creates and displays information on contig adjacency which is contextualized by the simultaneous display of a comparison between assembled contigs and reference sequence. Where scaffolders allow unambiguous connections between contigs to be resolved into a single scaffold, Contiguity allows the user to create all potential scaffolds in ambiguous regions of the genome. This enables the resolution of novel sequence or structural variants from the assembly. In addition, Contiguity provides a sequencing and assembly agnostic approach for the creation of contig adjacency graphs. To maximize the number of contig adjacencies determined, Contiguity combines information from read pair mappings, sequence overlap and De Bruijn graph exploration. We demonstrate how highly sensitive graphs can be achieved using this method. Contig adjacency graphs allow the user to visualize potential arrangements of contigs in unresolvable areas of the genome. By combining adjacency information with comparative genomics, Contiguity provides an intuitive approach for exploring and improving sequence assemblies. It is also useful in guiding manual closure of long read sequence assemblies. Contiguity is an open source application, implemented using Python and the Tkinter GUI package that can run on any Unix, OSX and Windows operating system. It has been designed and optimized for bacterial assemblies. Contiguity is available at http://mjsull.github.io/Contiguity .

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Complete Genome Sequences of 12 Quinolone-Resistant Escherichia coli Strains Containing qnrS1 Based on Hybrid Assemblies

Microbiology Resource Announcements ◽

10.1128/mra.01190-20 ◽

2021 ◽

Vol 10 (4) ◽

Author(s):

Håkon Kaspersen ◽

Thomas H. A. Haverkamp ◽

Hanna Karin Ilag ◽

Øivind Øines ◽

Camilla Sekse ◽

...

Keyword(s):

Escherichia Coli ◽

Complete Genome ◽

Flow Cell ◽

Hybrid Assembly ◽

Genome Sequences ◽

Content Type ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Hybrid Assemblies

ABSTRACT In total, 12 quinolone-resistant Escherichia coli (QREC) strains containing qnrS1 were submitted to long-read sequencing using a FLO-MIN106 flow cell on a MinION device. The long reads were assembled with short reads (Illumina) and analyzed using the MOB-suite pipeline. Six of these QREC genome sequences were closed after hybrid assembly.

Download Full-text