SVants – A long-read based method for structural variation detection in bacterial genomes

Mapping Intimacies ◽

10.1101/822312 ◽

2019 ◽

Cited By ~ 1

Author(s):

BM Hanson ◽

JS Johnson ◽

SR Leopold ◽

E Sodergren ◽

GM Weinstock

Keyword(s):

Structural Variation ◽

Tandem Repeats ◽

Bacterial Genome ◽

Genetic Material ◽

Bacterial Cells ◽

Sequencing Data ◽

E Coli ◽

Sequencing Technologies ◽

Long Read ◽

New Locations

AbstractMotivationMobile genetic elements (MGEs) are genetic material that can transfer between bacterial cells and move to new locations within a single bacterial genome. These elements range from several hundred to tens of thousands of bases, and are often bordered by repeat regions, which makes resolving these elements difficult with short-read sequencing data. The development and availability of long-read sequencing technologies has opened up new opportunities in the study of structural variation but there is a lack of bioinformatics tools designed to take advantage of these longer reads.ResultsWe present an assembly-free method for identifying the location of these MGEs when compared to any reference genome (including draft genomes). Using an artificially constructed Escherichia coli genome containing single and tandem-repeats of a Tn9 transposon, we demonstrate the ability of SVants to accurately identify multiple insertion sites as well as count the number of repeats of this MGE. Additionally, we show that SVants accurately identifies the transposon of interest, Tn9, but does not erroneously identify existing IS1 regions present within the chromosome of the E. coli artificial reference.Availability and ImplementationSVants is available as open-source software at https://github.com/EpiBlake/SVants

Download Full-text

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab083 ◽

2021 ◽

Author(s):

Eric S Tvedte ◽

Mark Gasser ◽

Benjamin C Sparklin ◽

Jane Michalski ◽

Carl E Hjelmen ◽

...

Keyword(s):

Bacterial Genome ◽

Hybrid Approach ◽

Cost Effective ◽

Fruit Fly ◽

Drosophila Ananassae ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

E Coli ◽

Hybrid Approaches ◽

Long Read

Abstract The newest generation of DNA sequencing technology is highlighted by the ability to generate sequence reads hundreds of kilobases in length. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have pioneered competitive long read platforms, with more recent work focused on improving sequencing throughput and per-base accuracy. We used whole-genome sequencing data produced by three PacBio protocols (Sequel II CLR, Sequel II HiFi, RS II) and two ONT protocols (Rapid Sequencing and Ligation Sequencing) to compare assemblies of the bacteria Escherichia coli and the fruit fly Drosophila ananassae. In both organisms tested, Sequel II assemblies had the highest consensus accuracy, even after accounting for differences in sequencing throughput. ONT and PacBio CLR had the longest reads sequenced compared to PacBio RS II and HiFi, and genome contiguity was highest when assembling these datasets. ONT Rapid Sequencing libraries had the fewest chimeric reads in addition to superior quantification of E. coli plasmids versus ligation-based libraries. The quality of assemblies can be enhanced by adopting hybrid approaches using Illumina libraries for bacterial genome assembly or polishing eukaryotic genome assemblies, and an ONT-Illumina hybrid approach would be more cost-effective for many users. Genome-wide DNA methylation could be detected using both technologies, however ONT libraries enabled the identification of a broader range of known E. coli methyltransferase recognition motifs in addition to undocumented D. ananassae motifs. The ideal choice of long read technology may depend on several factors including the question or hypothesis under examination. No single technology outperformed others in all metrics examined.

Download Full-text

A benchmark of structural variation detection by long reads through a realistic simulated model

10.1101/2020.12.25.424397 ◽

2020 ◽

Author(s):

Nicolas Dierckxsens ◽

Tong Li ◽

Joris R. Vermeesch ◽

Zhi Xie

Keyword(s):

Structural Variation ◽

Rapid Evolution ◽

Detection Methods ◽

Sequencing Data ◽

Simulated Model ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms ◽

The Impact

ABSTRACTDespite the rapid evolution of new sequencing technologies, structural variation detection remains poorly ascertained. The high discrepancy between the results of structural variant analysis programs makes it difficult to assess their performance on real datasets. Accurate simulations of structural variation distributions and sequencing data of the human genome are crucial for the development and benchmarking of new tools. In order to gain a better insight into the detection of structural variation with long sequencing reads, we created a realistic simulated model to thoroughly compare SV detection methods and the impact of the chosen sequencing technology and sequencing depth. To achieve this, we developed Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it revealed the strengths and weaknesses for current available structural variation callers and long read sequencing platforms. Our findings were also supported by the latest structural variation benchmark set developed by the GIAB Consortium. With these findings, we developed a new method (combiSV) that can combine the results from five different SV callers into a superior call set with increased recall and precision. Both Sim-it and combiSV are open source and can be downloaded at https://github.com/ndierckx/.

Download Full-text

Comparison of long read sequencing technologies in resolving bacteria and fly genomes

10.1101/2020.07.21.213975 ◽

2020 ◽

Author(s):

Eric S. Tvedte ◽

Mark Gasser ◽

Benjamin C. Sparklin ◽

Jane Michalski ◽

Xuechu Zhao ◽

...

Keyword(s):

Genome Sequencing ◽

Bacterial Genome ◽

Fruit Fly ◽

Drosophila Ananassae ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

E Coli ◽

Hybrid Approaches ◽

Long Read ◽

Genome Assemblies

ABSTRACTBackgroundThe newest generation of DNA sequencing technology is highlighted by the ability to sequence reads hundreds of kilobases in length, and the increased availability of long read data has democratized the genome sequencing and assembly process. PacBio and Oxford Nanopore Technologies (ONT) have pioneered competitive long read platforms, with more recent work focused on improving sequencing throughput and per-base accuracy. Released in 2019, the PacBio Sequel II platform advertises substantial enhancements over previous PacBio systems.ResultsWe used whole-genome sequencing data produced by two PacBio platforms (Sequel II and RS II) and two ONT protocols (Rapid Sequencing and Ligation Sequencing) to compare assemblies of the bacteria Escherichia coli and the fruit fly Drosophila ananassae. Sequel II assemblies had higher contiguity and consensus accuracy relative to other methods, even after accounting for differences in sequencing throughput. ONT RAPID libraries had the fewest chimeric reads in addition to superior quantification of E. coli plasmids versus ligation-based libraries. The quality of assemblies can be enhanced by adopting hybrid approaches using Illumina libraries for bacterial genome assemblies or combined ONT and Sequel II libraries for eukaryotic genome assemblies. Genome-wide DNA methylation could be detected using both technologies, however ONT libraries enabled the identification of a broader range of known E. coli methyltransferase recognition motifs in addition to undocumented D. ananassae motifs.ConclusionsThe ideal choice of long read technology may depend on several factors including the question or hypothesis under examination. No single technology outperformed others in all metrics examined.

Download Full-text

A recurrence based approach for validating structural variation using long-read sequencing technology

10.1101/105817 ◽

2017 ◽

Author(s):

Xuefang Zhao ◽

Alexandra M. Weber ◽

Ryan E. Mills

Keyword(s):

Efficient Algorithm ◽

Structural Variation ◽

Direct Evidence ◽

High Fidelity ◽

Sequencing Data ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Manual Inspection ◽

Long Read ◽

Computational Resources

ABSTRACTAlthough there are numerous algorithms that have been developed to identify structural variation (SVs) in genomic sequences, there is a dearth of approaches that can be used to evaluate their results. The emergence of new sequencing technologies that generate longer sequence reads can, in theory, provide direct evidence for all types of SVs regardless of the length of region through which it spans. However, current efforts to use these data in this manner require the use of large computational resources to assemble these sequences as well as manual inspection of each region. Here, we present VaPoR, a highly efficient algorithm that autonomously validates large SV sets using long read sequencing data. We assess of the performance of VaPoR on both simulated and real SVs and report a high-fidelity rate for various features including overall accuracy, sensitivity of breakpoint precision, and predicted genotype.

Download Full-text

Comparative Analysis for the Performance of Long-Read-Based Structural Variation Detection Pipelines in Tandem Repeat Regions

Frontiers in Pharmacology ◽

10.3389/fphar.2021.658072 ◽

2021 ◽

Vol 12 ◽

Author(s):

Mingkun Guo ◽

Shihai Li ◽

Yifan Zhou ◽

Menglong Li ◽

Zhining Wen

Keyword(s):

Tandem Repeat ◽

Structural Variation ◽

Contextual Information ◽

Sequencing Data ◽

Genome Research ◽

Structural Variations ◽

Sequencing Technologies ◽

Long Read ◽

Real Practice ◽

Deep Exploration

There has been growing recognition of the vital links between structural variations (SVs) and diverse diseases. Research suggests that, with much longer DNA fragments and abundant contextual information, long-read technologies have advantages in SV detection even in complex repetitive regions. So far, several pipelines for calling SVs from long-read sequencing data have been proposed and used in human genome research. However, the performance of these pipelines is still lack of deep exploration and adequate comparison. In this study, we comprehensively evaluated the performance of three commonly used long-read SV detection pipelines, namely PBSV, Sniffles and PBHoney, especially the performance on detecting the SVs in tandem repeat regions (TRRs). Evaluated by using a robust benchmark for germline SV detection as the gold standard, we thoroughly estimated the precision, recall and F1 score of insertions and deletions detected by the pipelines. Our results revealed that all these pipelines clearly exhibited better performance outside TRRs than that in TRRs. The F1 scores of Sniffles in and outside TRRs were 0.60 and 0.76, respectively. The performance of PBSV was similar to that of Sniffles, and was generally higher than that of PBHoney. In conclusion, our findings can be benefit for choosing the appropriate pipelines in real practice and are good complementary to the application of long-read sequencing technologies in the research of rare diseases.

Download Full-text

Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology

10.1101/2021.08.26.457816 ◽

2021 ◽

Author(s):

Marc-André Lemay ◽

Jonas A. Sibbesen ◽

Davoud Torkamaneh ◽

Jérémie Hamel ◽

Roger C. Levesque ◽

...

Keyword(s):

Structural Variation ◽

Pfam Domain ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Short Read ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Population Structure Analysis ◽

Illumina Data ◽

Long Read

Background: Structural variant (SV) discovery based on short reads is challenging due to their complex signatures and tendency to occur in repeated regions. The increasing availability of long-read technologies has greatly facilitated SV discovery, however these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars. Results: We used Oxford Nanopore sequencing data (~12X mean coverage) for 17 samples to both benchmark SV calls made from the Illumina data and predict SVs that were subsequently genotyped in a population of 102 samples using Illumina data. Benchmarking results show that variants discovered using Oxford Nanopore can be accurately genotyped from the Illumina data. We first use the genotyped SVs for population structure analysis and show that results are comparable to those based on single-nucleotide variants. We observe that the population frequency and distribution within the genome of SVs are constrained by the location of genes. Gene Ontology and PFAM domain enrichment analyses also confirm previous reports that genes harboring high-frequency SVs are enriched for functions in defense response. Finally, we discover polymorphic transposable elements from the SVs and report evidence of the recent activity of a Stowaway MITE. Conclusions: Our results demonstrate that long-read and short-read sequencing technologies can be efficiently combined to enhance SV analysis in large populations, providing a reusable framework for their study in a wider range of samples and non-model species.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

A Novel Weissella cibaria Strain UTNGt21O Isolated from Wild Solanum quitoense Fruit: Genome Sequence and Characterization of a Peptide with Highly Inhibitory Potential toward Gram-Negative Bacteria

Foods ◽

10.3390/foods9091242 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1242

Author(s):

Gabriela N. Tenea ◽

Pamela Hurtado ◽

Clara Ortega

Keyword(s):

Cell Death ◽

Ethylenediaminetetraacetic Acid ◽

Sequence Similarity ◽

Genome Mining ◽

Chelating Agent ◽

Target Cells ◽

Bacterial Cells ◽

Sequencing Data ◽

E Coli ◽

Weissella Cibaria

A novel Weissella cibaria strain UTNGt21O from the fruit of the Solanum quitoense (naranjilla) shrub produces a peptide that inhibits the growth of both Salmonella enterica subsp. enterica ATCC51741 and Escherichia coli ATCC25922 at different stages. A total of 31 contigs were assembled, with a total length of 1,924,087 bases, 20 contig hits match the core genome of different groups within Weissella, while for 11 contigs no match was found in the database. The GT content was 39.53% and the genome repeats sequences constitute around 186,760 bases of the assembly. The UTNGt21O matches the W. cibaria genome with 83% identity and no gaps (0). The sequencing data were deposited in the NCBI Database (BioProject accessions: PRJNA639289). The antibacterial activity and interaction mechanism of the peptide UTNGt21O on target bacteria were investigated by analyzing the growth, integrity, and morphology of the bacterial cells following treatment with different concentrations (1×, 1.5× and 2× MIC) of the peptide applied alone or in combination with chelating agent ethylenediaminetetraacetic acid (EDTA) at 20 mM. The results indicated a bacteriolytic effect at both early and late target growth at 3 h of incubation and total cell death at 6 h when EDTA was co-inoculated with the peptide. Based on BAGEL 4 (Bacteriocin Genome Mining Tool) a putative bacteriocin having 33.4% sequence similarity to enterolysin A was detected within the contig 12. The interaction between the peptide UTNGt21O and the target strains caused permeability in a dose-, time- response manner, with Salmonella (3200 AU/mL) more susceptible than E. coli (6400 AU/mL). The results indicated that UTNGt21O may damage the integrity of the cell target, leading to release of cytoplasmic components followed by cell death. Differences in membrane shape changes in target cells treated with different doses of peptide were observed by transmission electronic microscopy (TEM). Spheroplasts with spherical shapes were detected in Salmonella while larger shaped spheroplasts with thicker and deformed membranes along with filamentous cells were observed in E. coli upon the treatment with the UTNGt21O peptide. These results indicate the promising potential of the putative bacteriocin released by the novel W. cibaria strain UTNGt21O to be further tested as a new antimicrobial substance.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text