scholarly journals DiscoSnp++: de novo detection of small variants from raw unassembled read set(s)

2017 ◽  
Author(s):  
Pierre Peterlongo ◽  
Chloé Riou ◽  
Erwan Drezen ◽  
Claire Lemaitre

AbstractMotivationNext Generation Sequencing (NGS) data provide an unprecedented access to life mechanisms. In particular, these data enable to detect polymorphisms such as SNPs and indels. As these polymorphisms represent a fundamental source of information in agronomy, environment or medicine, their detection in NGS data is now a routine task. The main methods for their prediction usually need a reference genome. However, non-model organisms and highly divergent genomes such as in cancer studies are extensively investigated.ResultsWe propose DiscoSnp++, in which we revisit the DiscoSnp algorithm. DiscoSnp++ is designed for detecting and ranking all kinds of SNPs and small indels from raw read set(s). It outputs files in fasta and VCF formats. In particular, predicted variants can be automatically localized afterwards on a reference genome if available. Its usage is extremely simple and its low resource requirements make it usable on common desktop computers. Results show that DiscoSnp++ performs better than state-of-the-art methods in terms of computational resources and in terms of results quality. An important novelty is the de novo detection of indels, for which we obtained 99% precision when calling indels on simulated human datasets and 90% recall on high confident indels from the Platinum dataset.LicenseGNU Affero general public licenseAvailabilityhttps://github.com/GATB/[email protected]

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Adrián Casanova ◽  
Francesco Maroso ◽  
Andrés Blanco ◽  
Miguel Hermida ◽  
Néstor Ríos ◽  
...  

Abstract Background The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identification of thousands of molecular markers and their genotyping for refined genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three fish species (brown trout, silver catfish and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control. Results Despite different outcomes were observed between pipelines and species with the diverse SNP calling and filtering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons. Conclusions Tested building-loci pipelines for selection of SNP panels seem to have low influence on population genetics inference across the diverse case-study scenarios here studied. However, preliminary trials with different bioinformatic pipelines are suggested to evaluate their influence on population parameters according with the specific goals of each study.


2020 ◽  
Author(s):  
Adrian Casanova ◽  
Francesco Maroso ◽  
Andrés Blanco ◽  
Miguel Hermida ◽  
Nestor Rios ◽  
...  

Abstract Background The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identification of thousands of molecular markers and their genotyping for refined genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three fish species (brown trout, silver catfish and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control. Results Despite different outcomes were observed between pipelines and species with the diverse SNP calling and filtering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons. Conclusions Building-loci pipelines seem not to have a substantial influence on population genetics inference. Anyway, we recommend being careful with certain building-loci pipeline parameters and SNP filtering steps, especially when a de novo approach is used. Preliminary trials with subsets of data should be performed for comparison of genetic diversity and differentiation, but always considering the specific goals of the study.


2020 ◽  
Author(s):  
Cristian Salinas-Restrepo ◽  
Elizabeth Misas ◽  
Sebastian Estrada-Gomez ◽  
Juan Quintana ◽  
Fanny Guzman ◽  
...  

Abstract Background: Spiders are among the most venomous animals in nature. Their venom constitutes a source of novel and innovative peptides and proteins with medicinal and biotechnological interest. However, their potential as antimicrobial, anti-cancerous, anti-hypertensive and even in the modulation of nociception is under-studied, mainly because handling the venom is technically challenging and there is paucity of next-generation-sequencing (NGS) data. Due to the increasing evidence of underestimation of the number of genes by the use of a single transcriptome assembler, we re-assembled and optimized the de novo transcriptome of the venom gland of the recently described Colombian spider P. verdolaga, by using three free access algorithms: Trinity, Soapdenovo and SPAdes. All the assemblies were evaluated by statistical parameters (e.g. contigs, GC%, max and min length and N50), by applying BUSCO´s terms retrieval against the arthropod data set to determine the best assembly for each software.Results: Our analyses show that while approximately 54% of all the assembled and structurally annotated sequences could be found in all three algorithms, around 23% of these were unique for Trinity and 21% were unique for SPAdes. The non-redundant merge of all three assemblies’ output permitted the annotation of 8640 sequences; at least 15% more when compared to each software separately, and an increase of 20% when compared to a previous P. verdolaga assembly. Analysis of the annotated genes allowed the identification of unreported lectins, kinins and over 200 peptides and proteins with potential antimicrobial and protease inhibition activities. Furthermore, homology search against the Arachnoserver database and the EROP knowledgebase allowed the identification of 135 novel theraphotoxins of biotechnological interest.Conclusion: Transcriptomic data is of utmost importance for spiders, as it is one of the more feasible and scalable ways to characterize these organisms. However, the use of a single de novo assembler implies an under representation of the expressed sequences, as it has been shown here. In the generation of data for non-model organisms as well as in the search for novel peptides and proteins with biotechnological interest, it is highly recommended that at least two different assemblers are employed.


2020 ◽  
Author(s):  
Adrian Casanova ◽  
Francesco Maroso ◽  
Andrés Blanco ◽  
Miguel Hermida ◽  
Nestor Rios ◽  
...  

Abstract Background: The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identification of thousands of molecular markers and their genotyping for refined genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three fish species (brown trout, silver catfish and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control. Results: Despite different outcomes were observed between pipelines and species with the diverse SNP calling and filtering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons. Conclusions: Tested building-loci pipelines seem not to have a substantial influence on population genetics inference. Preliminary trials with bioinformatic pipelines are suggested to evaluate their influence in population parameters related to the specific goals of the study.


Animals ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 2226
Author(s):  
Sazia Kunvar ◽  
Sylwia Czarnomska ◽  
Cino Pertoldi ◽  
Małgorzata Tokarska

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.


2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Xin Zhou ◽  
Serafim Batzoglou ◽  
Arend Sidow ◽  
Lu Zhang

AbstractBackgroundDe novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.ResultsTo address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM.HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80% to 99% of false positives regardless of how large the candidate DNM set is.ConclusionsHAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.


2020 ◽  
Author(s):  
Nan Dong ◽  
Julia Bandura ◽  
Zhaolei Zhang ◽  
Yan Wang ◽  
Karine Labadie ◽  
...  

Abstract Background. The pond snail Lymnaea stagnalis (L. stagnalis) has been widely used as a model organism in neurobiology, ecotoxicology, and parasitology due to the relative simplicity of its CNS. However, its usefulness is restricted by a limited availability of transcriptome data. While sequence information for the L. stagnalis CNS transcripts has been obtained from EST library and a de novo RNA-seq assembly, the quality of these assemblies is limited by a combination of low coverage of EST libraries, the fragmented nature of de novo assemblies, and lack of reference genome. Results. In this study, taking advantage of the recent availability of the L. stagnalis reference genome, we generated an RNA-seq library from the adult L. stagnalis CNS, using a combination of genome-guided and de novo assembly programs to identify 17,832 protein-coding L. stagnalis transcripts. We combined our library with existing resources to produce a transcript set with greater sequence length, completeness, and diversity than previously available ones. Using our assembly and functional domain analysis, we profiled L. stagnalis CNS transcripts encoding ion channels and ionotropic receptors, which are key proteins for CNS function, and compared their sequences to other vertebrate and invertebrate model organisms. Interestingly, L. stagnalis transcripts encoding numerous putative Ca2+ channels showed the most sequence similarity to those of mouse, zebrafish, Xenopus tropicalis, fruit fly, and C. elegans, suggesting that many calcium channel-related signaling pathways may be evolutionarily conserved. Conclusions. Our study provides the most thorough characterization to date of the L. stagnalis transcriptome and provides insights into differences between vertebrates and invertebrates in CNS transcript diversity, according to function and protein class. Furthermore, this study is, to the best of our knowledge, the first to provide a complete characterization of the ion channels of a single species, opening new avenues for future research on fundamental neurobiological processes.


Viruses ◽  
2020 ◽  
Vol 12 (7) ◽  
pp. 758 ◽  
Author(s):  
Keylie M. Gibson ◽  
Margaret C. Steiner ◽  
Uzma Rentia ◽  
Matthew L. Bendall ◽  
Marcos Pérez-Losada ◽  
...  

Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.


2017 ◽  
Vol 35 (4_suppl) ◽  
pp. 123-123
Author(s):  
Ciara Marie Kelly ◽  
Yelena Yuriy Janjigian ◽  
David Paul Kelsen ◽  
Marinela Capanu ◽  
Joanne F. Chou ◽  
...  

123 Background: FOLFOX is a preferred 1st-line tx for advanced EGA. We sought to characterize outcomes on subsequent tx and to see if MSK-IMPACT, a 410-gene next generation sequencing (NGS) platform, increases tx options. Methods: We retrospectively identified patients (pts) with advanced, Her2-negative EGA treated with 1st-line FOLFOX between Jan 2012 to Dec 2014. Clinicopathologic, tx and outcome data were analyzed. Overall survival (OS) was calculated from start of FOLFOX using Kaplan-Meier methods. Landmark analysis was used to compare OS and response status. Results: 185 pts were identified. The majority were Caucasian (82%), male (76%), ECOG PS 1 (67%), with poorly differentiated histology (72%) and de novo metastatic disease (84%). Median age was 64 years. The disease-control rate (DCR, partial response + stable disease) of FOLFOX was 80% [95%CI: 74%-85%]; 19% were FOLFOX primary refractory (FR). Median time-to-progression (TTP) on FOLFOX was 7 and 2 months (mo) for FOLFOX sensitive (FS) and FR pts, respectively. There was a higher proportion of females (26% vs. 14%, P = 0.18), gastric (43% vs. 23%, P = 0.051) and moderately differentiated tumors (26% vs. 12%, p = 0.113) in the FS vs. FR group. Six mo survival from the landmark time of 2 mo after initiation of FOLFOX was 83% [95%CI: 76%-89%], and 38% [95%CI: 20%-56%] for FS and FR pts, respectively (p < 0.01). A similar proportion of FS and FR pts received 2nd-line tx (65% vs. 69%). The DCR was similar in both groups (31% vs 29%). 2nd-line tx included: irinotecan- (51%) and taxane-based regimens (32%) or a clinical trial (CT) (13%). The median TTP on 2nd-line tx was similar in FS and FR groups (2.5 vs 2 mo). Ramucirumab was given in 14% of 2nd line regimens. 3rd-line chemo use was similar in both groups (37% vs 31%) but the DCR was lower in FR patients (18% vs. 9%). 51 pts had IMPACT; 1 pt (2%) enrolled onto a genotyped-matched CT. 14 pts received immunotherapy; 1 FS Pt has ongoing complete response 1+ year. Conclusions: Surprisingly, FS and FR pts derive similar, marginal benefit from 2nd-line tx, emphasizing the appropriateness of CT options in this setting. NGS rarely expanded tx options. Updated and in-depth NGS data will be presented.


Sign in / Sign up

Export Citation Format

Share Document