Full-length de novo viral quasispecies assembly through variation graph construction

Mapping Intimacies ◽

10.1101/287177 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jasmijn A. Baaijens ◽

Bastiaan Van der Roest ◽

Johannes Köster ◽

Leen Stougie ◽

Alexander Schönhuth

Keyword(s):

Reference Genome ◽

De Novo ◽

Simulated Data ◽

Error Rates ◽

Full Length ◽

Maximal Length ◽

Data Sets ◽

Viral Quasispecies ◽

Suitable Structure ◽

Selection Of

AbstractMotivationViruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly refers to reconstructing the strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains, an important step for various treatment-related reasons. Reference-genome-independent (“de novo”) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. It remains to reconstruct full-length haplotypes together with their abundances from such contigs.MethodWe first construct a variation graph, a recently popular, suitable structure for arranging and integrating several related genomes, from the short input contigs, without making use of a reference genome. To obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances.ResultsBenchmarking experiments on challenging simulated data sets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates. As a consequence, our method outperforms all state-of-the-art viral quasispecies assemblers that aim at the construction of full-length haplotypes, in terms of various relevant assembly measures. Our tool, Virus-VG, is publicly available at https://bitbucket.org/jbaaijens/virus-vg.

Download Full-text

Full-length de novo viral quasispecies assembly through variation graph construction

Bioinformatics ◽

10.1093/bioinformatics/btz443 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5086-5094 ◽

Cited By ~ 6

Author(s):

Jasmijn A Baaijens ◽

Bastiaan Van der Roest ◽

Johannes Köster ◽

Leen Stougie ◽

Alexander Schönhuth

Keyword(s):

Reference Genome ◽

De Novo ◽

State Of The Art ◽

Error Rates ◽

Full Length ◽

Maximal Length ◽

Supplementary Information ◽

Viral Quasispecies ◽

Mutant Strains ◽

Selection Of

Abstract Motivation Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Reference genome independent (‘de novo’) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. The remaining challenge is to reconstruct full-length haplotypes together with their abundances from such contigs. Results We present Virus-VG as a de novo approach to viral haplotype reconstruction from preassembled contigs. Our method constructs a variation graph from the short input contigs without making use of a reference genome. Then, to obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is, optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Benchmarking experiments on challenging simulated and real datasets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates compared to the state-of-the-art viral quasispecies assemblers. Availability and implementation Virus-VG is freely available at https://bitbucket.org/jbaaijens/virus-vg. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evaluating Phylogenetic Informativeness as a Predictor of Phylogenetic Signal for Metazoan, Fungal, and Mammalian Phylogenomic Data Sets

BioMed Research International ◽

10.1155/2013/621604 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 14

Author(s):

Francesc López-Giráldez ◽

Andrew H. Moeller ◽

Jeffrey P. Townsend

Keyword(s):

Phylogenetic Signal ◽

Simulated Data ◽

Quantitative Measure ◽

Data Sets ◽

Phylogenetic Informativeness ◽

Phylogenetic Resolution ◽

Taxonomic Groups ◽

Diverse Groups ◽

Simulated Data Sets ◽

Selection Of

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.

Download Full-text

HAHap: a read-based haplotyping method using hierarchical assembly

PeerJ ◽

10.7717/peerj.5852 ◽

2018 ◽

Vol 6 ◽

pp. e5852

Author(s):

Yu-Yu Lin ◽

Ping Chun Wu ◽

Pei-Lung Chen ◽

Yen-Jen Oyang ◽

Chien-Yu Chen

Keyword(s):

Simulated Data ◽

Real Data ◽

Error Rates ◽

Lower Number ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Hierarchical Assembly ◽

Sequencing Technologies ◽

Error Corrections ◽

Selection Of

Background The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. Results This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.

Download Full-text

GBStools: A Unified Approach for Reduced Representation Sequencing and Genotyping

10.1101/030494 ◽

2015 ◽

Author(s):

Thomas F Cooke ◽

Muh-Ching Yee ◽

Marina Muzzio ◽

Alexandra Sockell ◽

Ryan Bell ◽

...

Keyword(s):

Restriction Site ◽

Variant Calling ◽

Simulated Data ◽

Error Rates ◽

Genomic Diversity ◽

Model Organisms ◽

Data Sets ◽

Reduced Representation ◽

Restriction Site Polymorphisms ◽

Reduced Representation Sequencing

Reduced representation sequencing methods such as genotyping-by-sequencing (GBS) enable low-cost measurement of genetic variation without the need for a reference genome assembly. These methods are widely used in genetic mapping and population genetics studies, especially with non-model organisms. Variant calling error rates, however, are higher in GBS than in standard sequencing, in particular due to restriction site polymorphisms, and few computational tools exist that specifically model and correct these errors. We developed a statistical method to remove errors caused by restriction site polymorphisms, implemented in the software package GBStools. We evaluated it in several simulated data sets, varying in number of samples, mean coverage and population mutation rate, and in two empirical human data sets (N = 8 and N = 63 samples). In our simulations, GBStools improved genotype accuracy more than commonly used filters such as Hardy-Weinberg equilibrium p-values. GBStools is most effective at removing genotype errors in data sets over 100 samples when coverage is 40X or higher, and the improvement is most pronounced in species with high genomic diversity. We also demonstrate the utility of GBS and GBStools for human population genetic inference in Argentine populations and reveal widely varying individual ancestry proportions and an excess of singletons, consistent with recent population growth.

Download Full-text

DIMA: Data-driven selection of a suitable imputation algorithm

10.1101/2020.10.13.323618 ◽

2020 ◽

Author(s):

Janine Egert ◽

Bettina Warscheid ◽

Clemens Kreutz

Keyword(s):

Quantitative Proteomics ◽

Missing Values ◽

Simulated Data ◽

Data Driven ◽

Data Sets ◽

Proteomics Data ◽

High Performing ◽

Missing Completely At Random ◽

Proteomics Data Analysis ◽

Selection Of

AbstractMotivationImputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, the performance of different imputation methods is difficult to assess and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of a suitable imputation algorithm (DIMA).ResultsThe performance and broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5 – 50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 84% of the cases.Availability and ImplementationSource code is freely available for download at github.com/clemenskreutz/OmicsData.

Download Full-text

Bayesian non-parametric clustering of single-cell mutation profiles

10.1101/2020.01.15.907345 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nico Borgsmüller ◽

Jose Bonet ◽

Francesco Marass ◽

Abel Gonzalez-Perez ◽

Nuria Lopez-Bigas ◽

...

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

Tumor Heterogeneity ◽

Missing Values ◽

Parametric Method ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Dirichlet Process Mixture ◽

Non Parametric

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.

Download Full-text

Using a Visual Structured Criterion for the Analysis of Alternating-Treatment Designs

Behavior Modification ◽

10.1177/0145445517739278 ◽

2017 ◽

Vol 43 (1) ◽

pp. 115-131 ◽

Cited By ~ 3

Author(s):

Marc J. Lanovaz ◽

Patrick Cardinal ◽

Mary Francis

Keyword(s):

Visual Analysis ◽

Type I Error ◽

Single Case ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Type I Error Rates ◽

Single Case Designs ◽

Treatment Designs

Although visual inspection remains common in the analysis of single-case designs, the lack of agreement between raters is an issue that may seriously compromise its validity. Thus, the purpose of our study was to develop and examine the properties of a simple structured criterion to supplement the visual analysis of alternating-treatment designs. To this end, we generated simulated data sets with varying number of points, number of conditions, effect sizes, and autocorrelations, and then measured Type I error rates and power produced by the visual structured criterion (VSC) and permutation analyses. We also validated the results for Type I error rates using nonsimulated data. Overall, our results indicate that using the VSC as a supplement for the analysis of systematically alternating-treatment designs with at least five points per condition generally provides adequate control over Type I error rates and sufficient power to detect most behavior changes.

Download Full-text

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

10.1101/029306 ◽

2015 ◽

Cited By ~ 7

Author(s):

Mahul Chakraborty ◽

James G. Baldwin-Brown ◽

Anthony D. Long ◽

J.J. Emerson

Keyword(s):

Best Practices ◽

Dna Isolation ◽

Reference Genome ◽

De Novo ◽

Error Rates ◽

Read Coverage ◽

Sequencing Data ◽

Sequencing Coverage ◽

Long Read ◽

Genome Assemblies

AbstractGenome assemblies that are accurate, complete, and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements, and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a “missing manual” that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.

Download Full-text

Strain-aware assembly of genomes from mixed samples using flow variation graphs

10.1101/645721 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jasmijn A. Baaijens ◽

Leen Stougie ◽

Alexander Schönhuth

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Current Approach ◽

Full Length ◽

Strain Level ◽

Abundance Estimation ◽

Flow Variation ◽

Assembly Accuracy ◽

Mixed Samples

AbstractThe goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains. Given that the use of a reference genome can introduce significant biases, de novo approaches are most suitable for this task. So far, reference-genome-independent assemblers have been shown to reconstruct haplotypes for mixed samples of limited complexity and genomes not exceeding 10000 bp in length.Here, we present VG-Flow, a de novo approach that enables full-length haplotype reconstruction from pre-assembled contigs of complex mixed samples. Our method increases contiguity of the input assembly and, at the same time, it performs haplotype abundance estimation. VG-Flow is the first approach to require polynomial, and not exponential runtime in terms of the underlying graphs. Since runtime increases only linearly in the length of the genomes in practice, it enables the reconstruction also of genomes that are longer by orders of magnitude, thereby establishing the first de novo solution to strain-aware full-length genome assembly applicable to bacterial sized genomes.VG-Flow is based on the flow variation graph as a novel concept that both captures all diversity present in the sample and enables to cast the central contig abundance estimation problem as a flow-like, polynomial time solvable optimization problem. As a consequence, we are in position to compute maximal-length haplotypes in terms of decomposing the resulting flow efficiently using a greedy algorithm, and obtain accurate frequency estimates for the reconstructed haplotypes through linear programming techniques.Benchmarking experiments show that our method outperforms state-of-the-art approaches on mixed samples from short genomes in terms of assembly accuracy as well as abundance estimation. Experiments on longer, bacterial sized genomes demonstrate that VG-Flow is the only current approach that can reconstruct full-length haplotypes from mixed samples at the strain level in human-affordable runtime.

Download Full-text

Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies

PeerJ ◽

10.7717/peerj.2988 ◽

2017 ◽

Vol 5 ◽

pp. e2988 ◽

Cited By ~ 45

Author(s):

Cédric Cabau ◽

Frédéric Escudié ◽

Anis Djari ◽

Yann Guiguen ◽

Julien Bobe ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Transcriptome Assembly ◽

Error Rates ◽

Rna Seq ◽

Software Packages ◽

Redundancy Reduction ◽

Assembly Pipeline ◽

Free Open Source

Background De novo transcriptome assembly of short reads is now a common step in expression analysis of organisms lacking a reference genome sequence. Several software packages are available to perform this task. Even if their results are of good quality it is still possible to improve them in several ways including redundancy reduction or error correction. Trinity and Oases are two commonly used de novo transcriptome assemblers. The contig sets they produce are of good quality. Still, their compaction (number of contigs needed to represent the transcriptome) and their quality (chimera and nucleotide error rates) can be improved. Results We built a de novo RNA-Seq Assembly Pipeline (DRAP) which wraps these two assemblers (Trinity and Oases) in order to improve their results regarding the above-mentioned criteria. DRAP reduces from 1.3 to 15 fold the number of resulting contigs of the assemblies depending on the read set and the assembler used. This article presents seven assembly comparisons showing in some cases drastic improvements when using DRAP. DRAP does not significantly impair assembly quality metrics such are read realignment rate or protein reconstruction counts. Conclusion Transcriptome assembly is a challenging computational task even if good solutions are already available to end-users, these solutions can still be improved while conserving the overall representation and quality of the assembly. The de novo RNA-Seq Assembly Pipeline (DRAP) is an easy to use software package to produce compact and corrected transcript set. DRAP is free, open-source and available under GPL V3 license at http://www.sigenae.org/drap.

Download Full-text