MitoFinder: efficient automated large-scale extraction of mitogenomic data in target enrichment phylogenomics

Mapping Intimacies ◽

10.1101/685412 ◽

2019 ◽

Cited By ~ 2

Author(s):

Rémi Allio ◽

Alex Schomaker-Bastos ◽

Jonathan Romiguier ◽

Francisco Prosdocimi ◽

Benoit Nabholz ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Target Enrichment ◽

Sequencing Technologies ◽

Coi Barcoding ◽

Genomic Markers ◽

Order Of Magnitude ◽

Dna Elements ◽

User Friendly

AbstractThanks to the development of high-throughput sequencing technologies, target enrichment sequencing of nuclear ultraconserved DNA elements (UCEs) now allows routinely inferring phylogenetic relationships from thousands of genomic markers. Recently, it has been shown that mitochondrial DNA (mtDNA) is frequently sequenced alongside the targeted loci in such capture experiments. Despite its broad evolutionary interest, mtDNA is rarely assembled and used in conjunction with nuclear markers in capture-based studies. Here, we developed MitoFinder, a user-friendly bioinformatic pipeline, to efficiently assemble and annotate mitogenomic data from hundreds of UCE libraries. As a case study, we used ants (Formicidae) for which 501 UCE libraries have been sequenced whereas only 29 mitogenomes are available. We compared the efficiency of four different assemblers (IDBA-UD, MEGAHIT, MetaSPAdes, and Trinity) for assembling both UCE and mtDNA loci. Using MitoFinder, we show that metagenomic assemblers, in particular MetaSPAdes, are well suited to assemble both UCEs and mtDNA. Mitogenomic signal was successfully extracted from all 501 UCE libraries allowing confirming species identification using COI barcoding. Moreover, our automated procedure retrieved 296 cases in which the mitochondrial genome was assembled in a single contig, thus increasing the number of available ant mitogenomes by an order of magnitude. By leveraging the power of metagenomic assemblers, MitoFinder provides an efficient tool to extract complementary mitogenomic data from UCE libraries, allowing testing for potential mito-nuclear discordance. Our approach is potentially applicable to other sequence capture methods, transcriptomic data, and whole genome shotgun sequencing in diverse taxa.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

A New Paralog Removal Pipeline Resolves Conflict between RAD-seq and Enrichment

10.1101/2020.10.26.355248 ◽

2020 ◽

Author(s):

Wenbin Zhou ◽

John Soghigian ◽

Qiu-yun (Jenny) Xiang

Keyword(s):

High Throughput Sequencing ◽

Sequence Similarity ◽

Phylogenetic Analyses ◽

Disjunct Distribution ◽

Divergence Times ◽

Target Enrichment ◽

Sequencing Technologies ◽

Duplication Events ◽

The Witch ◽

Phylogenomic Analyses

ABSTRACTTarget enrichment and RAD-seq are well-established high throughput sequencing technologies that have been increasingly used for phylogenomic studies, and the choice between methods is a practical issue for plant systematists studying the evolutionary histories of biodiversity of relatively recent origins. However, few studies have compared the congruence and conflict between results from the two methods within the same group of organisms, especially in plants, where extensive genome duplication events may complicate phylogenomic analyses. Unfortunately, currently widely used pipelines for target enrichment data analysis do not have a vigorous procedure for remove paralogs in Hyb-Seq data. In this study, we employed RAD-seq and Hyb-Seq of Angiosperm 353 genes in phylogenomic and biogeographic studies of Hamamelis (the witch-hazels) and Castanea (chestnuts), two classic examples exhibiting the well-known eastern Asian-eastern North American disjunct distribution. We compared these two methods side by side and developed a new pipeline (PPD) with a more vigorous removal of putative paralogs from Hyb-Seq data. The new pipeline considers both sequence similarity and heterozygous sites at each locus in identification of paralogous. We used our pipeline to construct robust datasets for comparison between methods and downstream analyses on the two genera. Our results demonstrated that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed significant differences between data from HybPiper and data from our new PPD pipeline, likely due to the error signals from the paralogous genes undetected by HybPiper, but trimmed by PPD. We found that phylogenies and divergence times estimated from our RAD-seq and Hyb-Seq-PPD were largely congruent. We highlight the importance of removal paralogs in enrichment data, and discuss the merits of RAD-seq and Hyb-Seq. Finally, phylogenetic analyses of RAD-seq and Hyb-Seq resulted in well-resolved species relationships, and revealed ancient introgression in both genera. Biogeographic analyses including fossil data revealed a complicated history of each genus involving multiple intercontinental dispersals and local extinctions in areas outside of the taxa’s modern ranges in both the Paleogene and Neogene. Our study demonstrates the value of additional steps for filtering paralogous gene content from Angiosperm 353 data, such as our new PPD pipeline described in this study. [RAD-seq, Hyb-Seq, paralogs, Castanea, Hamamelis, eastern Asia-eastern North America disjunction, biogeography, ancient introgression]

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

Rapid detection of germline mutations for hereditary gastrointestinal polyposis/cancers using HaloPlex target enrichment and high-throughput sequencing technologies

Familial Cancer ◽

10.1007/s10689-016-9872-x ◽

2016 ◽

Vol 15 (4) ◽

pp. 553-562 ◽

Cited By ~ 12

Author(s):

Masakazu Kohda ◽

Kensuke Kumamoto ◽

Hidetaka Eguchi ◽

Tomoko Hirata ◽

Yuhki Tada ◽

...

Keyword(s):

High Throughput ◽

Rapid Detection ◽

High Throughput Sequencing ◽

Germline Mutations ◽

Target Enrichment ◽

Gastrointestinal Polyposis ◽

Sequencing Technologies

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Natrix: A Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

10.1101/2020.09.23.309864 ◽

2020 ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Sequencing Technologies ◽

User Friendly

AbstractSequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires effcient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an effcient workflow management system. We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix).

Download Full-text

USER FRIENDLY OPEN GIS TOOL FOR LARGE SCALE DATA ASSIMILATION – A CASE STUDY OF HYDROLOGICAL MODELLING

ISPRS - International Archives of the Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xxxix-b4-427-2012 ◽

2012 ◽

Vol XXXIX-B4 ◽

pp. 427-430 ◽

Cited By ~ 2

Author(s):

P. K. Gupta

Keyword(s):

Data Assimilation ◽

Large Scale ◽

Hydrological Modelling ◽

Large Scale Data ◽

User Friendly ◽

Scale Data ◽

Open Gis

Download Full-text

Multi-platform discovery of haplotype-resolved structural variation in human genomes

10.1101/193144 ◽

2017 ◽

Cited By ~ 32

Author(s):

Mark J.P. Chaisson ◽

Ashley D. Sanders ◽

Xuefang Zhao ◽

Ankit Malhotra ◽

David Porubsky ◽

...

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Structural Variation ◽

High Throughput Sequencing ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Full Spectrum ◽

Variant Discovery ◽

Sequencing Technologies ◽

Sequencing Studies

ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.

Download Full-text

Database Systems in Biology

Enterprise Business Modeling, Optimization Techniques, and Flexible Information Systems ◽

10.4018/978-1-4666-3946-1.ch007 ◽

2013 ◽

pp. 80-96

Author(s):

Elisa Pappalardo ◽

Domenico Cantone

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

Database Systems ◽

Efficient Algorithms ◽

Biological Data ◽

Biological Databases ◽

Proteomic Data ◽

Sequencing Technologies ◽

Object Relational

The successful sequencing of the genoma of various species leads to a great amount of data that need to be managed and analyzed. With the increasing popularity of high-throughput sequencing technologies, such data require the design of flexible scalable, efficient algorithms and enterprise data structures to be manipulated by both biologists and computational scientists; this emerging scenario requires flexible, scalable, efficient algorithms and enterprise data structures. This chapter focuses on the design of large scale database-driven applications for genomic and proteomic data; it is largely believed that biological databases are similar to any standard database-drive application; however, a number of different and increasingly complex challenges arises. In particular, while standard databases are used just to manage information, in biology, they represent a main source for further computational analysis, which frequently focuses on the identification of relations and properties of a network of entities. The analysis starts from the first text-based storage approach and ends with new insights on object relational mapping for biological data.

Download Full-text

Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021339 ◽

2019 ◽

Vol 2 (1) ◽

pp. 39-67

Author(s):

Chao Deng ◽

Timothy Daley ◽

Guilherme De Sena Brandine ◽

Andrew D. Smith

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Biological Data ◽

Sequencing Error ◽

Molecular Heterogeneity ◽

Modern Biology ◽

Sequencing Technologies ◽

Large Scale Data ◽

Statistical Ecology ◽

Scale Data

High-throughput sequencing technologies have evolved at a stellar pace for almost a decade and have greatly advanced our understanding of genome biology. In these sampling-based technologies, there is an important detail that is often overlooked in the analysis of the data and the design of the experiments, specifically that the sampled observations often do not give a representative picture of the underlying population. This has long been recognized as a problem in statistical ecology and in the broader statistics literature. In this review, we discuss the connections between these fields, methodological advances that parallel both the needs and opportunities of large-scale data analysis, and specific applications in modern biology. In the process we describe unique aspects of applying these approaches to sequencing technologies, including sequencing error, population and individual heterogeneity, and the design of experiments.

Download Full-text