A recurrence based approach for validating structural variation using long-read sequencing technology

Mapping Intimacies ◽

10.1101/105817 ◽

2017 ◽

Author(s):

Xuefang Zhao ◽

Alexandra M. Weber ◽

Ryan E. Mills

Keyword(s):

Efficient Algorithm ◽

Structural Variation ◽

Direct Evidence ◽

High Fidelity ◽

Sequencing Data ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Manual Inspection ◽

Long Read ◽

Computational Resources

ABSTRACTAlthough there are numerous algorithms that have been developed to identify structural variation (SVs) in genomic sequences, there is a dearth of approaches that can be used to evaluate their results. The emergence of new sequencing technologies that generate longer sequence reads can, in theory, provide direct evidence for all types of SVs regardless of the length of region through which it spans. However, current efforts to use these data in this manner require the use of large computational resources to assemble these sequences as well as manual inspection of each region. Here, we present VaPoR, a highly efficient algorithm that autonomously validates large SV sets using long read sequencing data. We assess of the performance of VaPoR on both simulated and real SVs and report a high-fidelity rate for various features including overall accuracy, sensitivity of breakpoint precision, and predicted genotype.

Download Full-text

A benchmark of structural variation detection by long reads through a realistic simulated model

10.1101/2020.12.25.424397 ◽

2020 ◽

Author(s):

Nicolas Dierckxsens ◽

Tong Li ◽

Joris R. Vermeesch ◽

Zhi Xie

Keyword(s):

Structural Variation ◽

Rapid Evolution ◽

Detection Methods ◽

Sequencing Data ◽

Simulated Model ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms ◽

The Impact

ABSTRACTDespite the rapid evolution of new sequencing technologies, structural variation detection remains poorly ascertained. The high discrepancy between the results of structural variant analysis programs makes it difficult to assess their performance on real datasets. Accurate simulations of structural variation distributions and sequencing data of the human genome are crucial for the development and benchmarking of new tools. In order to gain a better insight into the detection of structural variation with long sequencing reads, we created a realistic simulated model to thoroughly compare SV detection methods and the impact of the chosen sequencing technology and sequencing depth. To achieve this, we developed Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it revealed the strengths and weaknesses for current available structural variation callers and long read sequencing platforms. Our findings were also supported by the latest structural variation benchmark set developed by the GIAB Consortium. With these findings, we developed a new method (combiSV) that can combine the results from five different SV callers into a superior call set with increased recall and precision. Both Sim-it and combiSV are open source and can be downloaded at https://github.com/ndierckx/.

Download Full-text

Comparative Analysis for the Performance of Long-Read-Based Structural Variation Detection Pipelines in Tandem Repeat Regions

Frontiers in Pharmacology ◽

10.3389/fphar.2021.658072 ◽

2021 ◽

Vol 12 ◽

Author(s):

Mingkun Guo ◽

Shihai Li ◽

Yifan Zhou ◽

Menglong Li ◽

Zhining Wen

Keyword(s):

Tandem Repeat ◽

Structural Variation ◽

Contextual Information ◽

Sequencing Data ◽

Genome Research ◽

Structural Variations ◽

Sequencing Technologies ◽

Long Read ◽

Real Practice ◽

Deep Exploration

There has been growing recognition of the vital links between structural variations (SVs) and diverse diseases. Research suggests that, with much longer DNA fragments and abundant contextual information, long-read technologies have advantages in SV detection even in complex repetitive regions. So far, several pipelines for calling SVs from long-read sequencing data have been proposed and used in human genome research. However, the performance of these pipelines is still lack of deep exploration and adequate comparison. In this study, we comprehensively evaluated the performance of three commonly used long-read SV detection pipelines, namely PBSV, Sniffles and PBHoney, especially the performance on detecting the SVs in tandem repeat regions (TRRs). Evaluated by using a robust benchmark for germline SV detection as the gold standard, we thoroughly estimated the precision, recall and F1 score of insertions and deletions detected by the pipelines. Our results revealed that all these pipelines clearly exhibited better performance outside TRRs than that in TRRs. The F1 scores of Sniffles in and outside TRRs were 0.60 and 0.76, respectively. The performance of PBSV was similar to that of Sniffles, and was generally higher than that of PBHoney. In conclusion, our findings can be benefit for choosing the appropriate pipelines in real practice and are good complementary to the application of long-read sequencing technologies in the research of rare diseases.

Download Full-text

SVants – A long-read based method for structural variation detection in bacterial genomes

10.1101/822312 ◽

2019 ◽

Cited By ~ 1

Author(s):

BM Hanson ◽

JS Johnson ◽

SR Leopold ◽

E Sodergren ◽

GM Weinstock

Keyword(s):

Structural Variation ◽

Tandem Repeats ◽

Bacterial Genome ◽

Genetic Material ◽

Bacterial Cells ◽

Sequencing Data ◽

E Coli ◽

Sequencing Technologies ◽

Long Read ◽

New Locations

AbstractMotivationMobile genetic elements (MGEs) are genetic material that can transfer between bacterial cells and move to new locations within a single bacterial genome. These elements range from several hundred to tens of thousands of bases, and are often bordered by repeat regions, which makes resolving these elements difficult with short-read sequencing data. The development and availability of long-read sequencing technologies has opened up new opportunities in the study of structural variation but there is a lack of bioinformatics tools designed to take advantage of these longer reads.ResultsWe present an assembly-free method for identifying the location of these MGEs when compared to any reference genome (including draft genomes). Using an artificially constructed Escherichia coli genome containing single and tandem-repeats of a Tn9 transposon, we demonstrate the ability of SVants to accurately identify multiple insertion sites as well as count the number of repeats of this MGE. Additionally, we show that SVants accurately identifies the transposon of interest, Tn9, but does not erroneously identify existing IS1 regions present within the chromosome of the E. coli artificial reference.Availability and ImplementationSVants is available as open-source software at https://github.com/EpiBlake/SVants

Download Full-text

Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology

10.1101/2021.08.26.457816 ◽

2021 ◽

Author(s):

Marc-André Lemay ◽

Jonas A. Sibbesen ◽

Davoud Torkamaneh ◽

Jérémie Hamel ◽

Roger C. Levesque ◽

...

Keyword(s):

Structural Variation ◽

Pfam Domain ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Short Read ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Population Structure Analysis ◽

Illumina Data ◽

Long Read

Background: Structural variant (SV) discovery based on short reads is challenging due to their complex signatures and tendency to occur in repeated regions. The increasing availability of long-read technologies has greatly facilitated SV discovery, however these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars. Results: We used Oxford Nanopore sequencing data (~12X mean coverage) for 17 samples to both benchmark SV calls made from the Illumina data and predict SVs that were subsequently genotyped in a population of 102 samples using Illumina data. Benchmarking results show that variants discovered using Oxford Nanopore can be accurately genotyped from the Illumina data. We first use the genotyped SVs for population structure analysis and show that results are comparable to those based on single-nucleotide variants. We observe that the population frequency and distribution within the genome of SVs are constrained by the location of genes. Gene Ontology and PFAM domain enrichment analyses also confirm previous reports that genes harboring high-frequency SVs are enriched for functions in defense response. Finally, we discover polymorphic transposable elements from the SVs and report evidence of the recent activity of a Stowaway MITE. Conclusions: Our results demonstrate that long-read and short-read sequencing technologies can be efficiently combined to enhance SV analysis in large populations, providing a reusable framework for their study in a wider range of samples and non-model species.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

Defining Blood Group Gene Reference Alleles by Long-Read Sequencing: Proof of Concept in the ACKR1 Gene Encoding the Duffy Antigens

Transfusion Medicine and Hemotherapy ◽

10.1159/000504584 ◽

2019 ◽

Vol 47 (1) ◽

pp. 23-32 ◽

Cited By ~ 2

Author(s):

Yann Fichou ◽

Isabelle Berlivet ◽

Gaëlle Richard ◽

Christophe Tournamille ◽

Lilian Castilho ◽

...

Keyword(s):

Blood Group ◽

Single Molecule ◽

Pcr Amplification ◽

Null Alleles ◽

Sequencing Technology ◽

Gene Encoding ◽

Next Generation Sequencing Technology ◽

Sequencing Technologies ◽

Long Read ◽

Long Range Pcr

Background: In the novel era of blood group genomics, (re-)defining reference gene/allele sequences of blood group genes has become an important goal to achieve, both for diagnostic and research purposes. As novel potent sequencing technologies are available, we thought to investigate the variability encountered in the three most common alleles of ACKR1, the gene encoding the clinically relevant Duffy antigens, at the haplotype level by a long-read sequencing approach. Materials and Methods: After long-range PCR amplification spanning the whole ACKR1 gene locus (∼2.5 kilobases), amplicons generated from 81 samples with known genotypes were sequenced in a single read by using the Pacific Biosciences (PacBio) single molecule, real-time (SMRT) sequencing technology. Results: High-quality sequencing reads were obtained for the 162 alleles (accuracy >0.999). Twenty-two nucleotide variations reported in databases were identified, defining 19 haplotypes: four, eight, and seven haplotypes in 46 ACKR1*01, 63 ACKR1*02, and 53 ACKR1*02N.01 alleles, respectively. Discussion: Overall, we have defined a subset of reference alleles by third-generation (long-read) sequencing. This technology, which provides a “longitudinal” overview of the loci of interest (several thousand base pairs) and is complementary to the second-generation (short-read) next-generation sequencing technology, is of critical interest for resolving novel, rare, and null alleles.

Download Full-text

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A recurrence-based approach for validating structural variation using long-read sequencing technology

GigaScience ◽

10.1093/gigascience/gix061 ◽

2017 ◽

Vol 6 (8) ◽

Cited By ~ 4

Author(s):

Xuefang Zhao ◽

Alexandra M. Weber ◽

Ryan E. Mills

Keyword(s):

Structural Variation ◽

Sequencing Technology ◽

Long Read

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text