Fast-SG: An alignment-free algorithm for hybrid assembly

Mapping Intimacies ◽

10.1101/209122 ◽

2017 ◽

Author(s):

Alex Di Genova ◽

Gonzalo A. Ruz ◽

Marie-France Sagot ◽

Alejandro Maass

Keyword(s):

De Novo ◽

Reference Level ◽

Hybrid Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Alignment Free ◽

Long Reads ◽

Long Read ◽

Definition Of ◽

Large Genomes

ABSTRACTLong read sequencing technologies are the ultimate solution for genome repeats, allowing near reference level reconstructions of large genomes. However, long read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods which combine short and long read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. In this paper, we propose a new method, called FAST-SG, which uses a new ultra-fast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. FAST-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how FAST-SG outperforms the state-of-the-art short read aligners when building the scaffolding graph, and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using FAST-SG with shallow long read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878).

Download Full-text

I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

10.1101/2020.09.28.317594 ◽

2020 ◽

Author(s):

Sina Baharlouei ◽

Meisam Razaviyayn ◽

Elizabeth Tseng ◽

David Tse

Keyword(s):

Computational Complexity ◽

De Novo ◽

Error Rates ◽

De Novo Transcriptome ◽

Transcript Isoforms ◽

Significant Challenge ◽

Sequencing Technologies ◽

Alignment Free ◽

Long Reads ◽

Long Read

Long-read sequencing technologies demonstrate high potential for de novo discovery of complex transcript isoforms, but high error rates pose a significant challenge. Existing error correction methods rely on clustering reads based on isoform-level alignment and cannot be efficiently scaled. We propose a new method, I-CONVEX, that performs fast, alignment-free isoform clustering with almost linear computational complexity, and leads to better consensus accuracy on simulated, synthetic, and real datasets.

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

Complete Genome Resequencing of Thermus thermophilus Strain TMY by Hybrid Assembly of Long- and Short-Read Sequencing Technologies

Microbiology Resource Announcements ◽

10.1128/mra.00979-21 ◽

2021 ◽

Vol 10 (46) ◽

Author(s):

Kentaro Miyazaki ◽

Natsuko Tokito

Keyword(s):

Complete Genome ◽

Thermus Thermophilus ◽

Genomic Analysis ◽

Comparative Genomic ◽

Hybrid Assembly ◽

Genome Resequencing ◽

Short Read ◽

Content Type ◽

Sequencing Technologies ◽

Long Read

Complete genome resequencing was conducted for Thermus thermophilus strain TMY by hybrid assembly of Oxford Nanopore Technologies long-read and MGI short-read data. Errors in the previously reported genome sequence determined by PacBio technology alone were corrected, allowing for high-quality comparative genomic analysis of closely related T. thermophilus genomes.

Download Full-text

A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping

10.1101/327817 ◽

2018 ◽

Cited By ~ 2

Author(s):

Stáphane Deschamps ◽

Yun Zhang ◽

Victor Llaca ◽

Liang Ye ◽

Gregory May ◽

...

Keyword(s):

Sorghum Bicolor ◽

De Novo ◽

Hybrid Assembly ◽

The Public ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Optical Maps ◽

Eukaryotic Genomes ◽

Direct Label

The advent of long-read sequencing technologies has greatly facilitated assemblies of large eukaryotic genomes. In this paper, Oxford Nanopore sequences generated on a MinION sequencer were combined with BioNano Genomics Direct Label and Stain (DLS) optical maps to generate a chromosome-scale de novo assembly of the repeat-rich Sorghum bicolor Tx430 genome. The final hybrid assembly consists of 29 scaffolds, encompassing in most cases entire chromosome arms. It has a scaffold N50 value of 33.28Mbps and covers >90% of Sorghum bicolor expected genome length. A sequence accuracy of 99.67% was obtained in unique regions after aligning contigs against Illumina Tx430 data. Alignments showed that 99.4% of the 34,211 public gene models are present in the assembly, including 94.2% mapping end-to-end. Comparisons of the DLS optical maps against the public Sorghum Bicolor v3.0.1 BTx623 genome assembly suggest the presence of substantial genomic rearrangements whose origin remains to be determined.

Download Full-text

Impact of Chimera-less Long Reads on Metagenomics of Human Gut Viromes Treated With Multiple Displacement Amplification

10.21203/rs.3.rs-58640/v1 ◽

2020 ◽

Author(s):

Yuya Kiguchi ◽

Suguru Nishijima ◽

Naveen Kumar ◽

Masahira Hattori ◽

Wataru Suda

Keyword(s):

De Novo ◽

Multiple Displacement Amplification ◽

Read Length ◽

Human Gut ◽

Average Read Length ◽

Sequencing Technologies ◽

Genomic Complexity ◽

Long Reads ◽

Long Read ◽

The Impact

Abstract Background: The ecological and biological features of the indigenous phage community (virome) in the human gut microbiome are poorly understood, possibly due to many fragmented contigs and fewer complete genomes based on conventional short-read metagenomics. Long-read sequencing technologies have attracted attention as an alternative approach to reconstruct long and accurate contigs from microbial communities. However, the impact of long-read metagenomics on human gut virome analysis has not been well evaluated. Results: Here we present chimera-less PacBio long-read metagenomics of multiple displacement amplification (MDA)-treated human gut virome DNA. The method included the development of a novel bioinformatics tool, SACRA (Split Amplified Chimeric Read Algorithm), which efficiently detects and splits numerous chimeric reads in PacBio reads from the MDA-treated virome samples. SACRA treatment of PacBio reads from five samples markedly reduced the average chimera ratio from 72 to 1.5%, generating chimera-less PacBio reads with an average read-length of 1.8 kb. De novo assembly of the chimera-less long reads generated contigs with an average N50 length of 11.1 kb, whereas those of MiSeq short reads from the same samples were 0.7 kb, dramatically improving contig extension. Alignment of both contig sets generated 378 high-quality merged contigs (MCs) composed of the minimum scaffolds of 434 MiSeq and 637 PacBio contigs, respectively, and also identified numerous MiSeq short fragmented contigs ≤500 bp additionally aligned to MCs, which possibly originated from a small fraction of MiSeq chimeric reads. The alignment also revealed that fragmentations of the scaffolded MiSeq contigs were caused primarily by genomic complexity of the community, including local repeats, hypervariable regions, and highly conserved sequences in and between the phage genomes. We identified 142 complete and near-complete phage genomes including 108 novel genomes, varying from 5 to 185 kb in length, the majority of which were predicted to be Microviridae phages including several variants with homologous but distinct genomes, which were fragmented in MiSeq contigs. Conclusions: Long-read metagenomics coupled with SACRA provides an improved method to reconstruct accurate and extended phage genomes from MDA-treated virome samples of the human gut, and potentially from other environmental virome samples.

Download Full-text

CaBagE: A Cas9-based Background Elimination strategy for targeted, long-read DNA sequencing

PLoS ONE ◽

10.1371/journal.pone.0241253 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0241253

Author(s):

Amelia D. Wallace ◽

Thomas A. Sasani ◽

Jordan Swanier ◽

Brooke L. Gates ◽

Jeff Greenland ◽

...

Keyword(s):

Dna Sequencing ◽

Tandem Repeats ◽

Short Read ◽

Background Elimination ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Repeat Expansions ◽

Sequencing Platforms ◽

Als Patients

A substantial fraction of the human genome is difficult to interrogate with short-read DNA sequencing technologies due to paralogy, complex haplotype structures, or tandem repeats. Long-read sequencing technologies, such as Oxford Nanopore’s MinION, enable direct measurement of complex loci without introducing many of the biases inherent to short-read methods, though they suffer from relatively lower throughput. This limitation has motivated recent efforts to develop amplification-free strategies to target and enrich loci of interest for subsequent sequencing with long reads. Here, we present CaBagE, a method for target enrichment that is efficient and useful for sequencing large, structurally complex targets. The CaBagE method leverages the stable binding of Cas9 to its DNA target to protect desired fragments from digestion with exonuclease. Enriched DNA fragments are then sequenced with Oxford Nanopore’s MinION long-read sequencing technology. Enrichment with CaBagE resulted in a median of 116X coverage (range 39–416) of target loci when tested on five genomic targets ranging from 4-20kb in length using healthy donor DNA. Four cancer gene targets were enriched in a single reaction and multiplexed on a single MinION flow cell. We further demonstrate the utility of CaBagE in two ALS patients with C9orf72 short tandem repeat expansions to produce genotype estimates commensurate with genotypes derived from repeat-primed PCR for each individual. With CaBagE there is a physical enrichment of on-target DNA in a given sample prior to sequencing. This feature allows adaptability across sequencing platforms and potential use as an enrichment strategy for applications beyond sequencing. CaBagE is a rapid enrichment method that can illuminate regions of the ‘hidden genome’ underlying human disease.

Download Full-text

The draft genome sequence of Eucalyptus polybractea based on hybrid assembly with short- and long-reads reads

10.1101/2021.05.18.444652 ◽

2021 ◽

Author(s):

Teng Li ◽

David Kainer ◽

William J Foley ◽

Allen Rodrigo ◽

Carsten Kuelheim

Keyword(s):

Population Genomics ◽

De Novo ◽

Draft Genome ◽

Hybrid Assembly ◽

Illumina Hiseq ◽

Protein Coding ◽

Genome Coverage ◽

Protein Coding Genes ◽

Long Reads ◽

Long Read

Eucalyptus polybractea is a small, multi-stemmed tree, which is widely cultivated in Australia for the production of Eucalyptus oil. We report the hybrid assembly of the E. polybractea genome utilizing both short- and long-read technology. We generated 44 Gb of Illumina HiSeq short reads and 8 Gb of Nanopore long reads, representing approximately 83 and 15 times genome coverage, respectively. The hybrid-assembled genome, after polishing, contained 24,864 scaffolds with an accumulated length of 523 Mb (N50 = 40.3 kb; BUSCO-calculated genome completeness of 94.3%). The genome contained 35,385 predicted protein-coding genes detected by combining homology-based and de novo approaches. We have provided the first assembled genome based on hybrid sequences from the highly diverse Eucalyptus subgenus Symphyomyrtus, and revealed the value of including long-reads from Nanopore technology for enhancing the contiguity of the assembled genome, as well as for improving its completeness. We anticipate that the E. polybractea genome will be an invaluable resource supporting a range of studies in genetics, population genomics and evolution of related species in Eucalyptus.

Download Full-text

SVJedi: Genotyping structural variations with long reads

10.1101/849208 ◽

2019 ◽

Author(s):

Lolita Lecompte ◽

Pierre Peterlongo ◽

Dominique Lavenier ◽

Claire Lemaitre

Keyword(s):

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Clinical Diagnoses ◽

Long Read ◽

Reference Sequences ◽

The One

AbstractMotivationStudies on structural variants (SV) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies.ResultsWe present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of reference sequences that represent the two alleles of each structural variant. Long reads are aligned to these reference sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype insertions and deletions with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches.Availabilityhttps://github.com/llecompte/[email protected]

Download Full-text

Highly-accurate long-read sequencing improves variant detection and assembly of a human genome

10.1101/519025 ◽

2019 ◽

Cited By ~ 27

Author(s):

Aaron M. Wenger ◽

Paul Peluso ◽

William J. Rowell ◽

Pi-Chuan Chang ◽

Richard J. Hall ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Short Reads ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Variant Detection ◽

High Quality Genome ◽

Circular Consensus Sequencing

AbstractThe major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.

Download Full-text

Robust Benchmark Structural Variant Calls of An Asian Using the State-of-Art Long Fragment Sequencing Technologies

10.1101/2020.08.10.245308 ◽

2020 ◽

Author(s):

Xiao Du ◽

Lili Li ◽

Fan Liang ◽

Sanyang Liu ◽

Wenxin Zhang ◽

...

Keyword(s):

B Lymphocyte ◽

De Novo ◽

Structural Variants ◽

High Confidence ◽

False Negatives ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Circular Consensus Sequencing

AbstractThe importance of structural variants (SVs) on phenotypes and human diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of those approaches, our work established an Asian reference material comprising identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8,938 SVs in an EBV immortalized B lymphocyte line, by integrating four alignment-based SV callers [from 109× PacBio continuous long read (CLR), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore long reads, and 114× optical mapping platform (Bionano)] and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR and Sanger sequencing, proofing the robustness of our SV calls. Combining trio-binning based haplotype assemblies, we established an SV benchmark for identification of false negatives and false positives by constructing the continuous high confident regions (CHCRs), which cover 1.46Gb and 6,882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical diagnosis.

Download Full-text