Solving scaffolding problem with repeats

Mapping Intimacies ◽

10.1101/330472 ◽

2018 ◽

Author(s):

Igor Mandric ◽

Alex Zelikovsky

Keyword(s):

Genome Assembly ◽

State Of The Art ◽

High Demand ◽

Computational Approaches ◽

Link Type ◽

Optimization Formulation

AbstractOne of the most important steps in genome assembly is scaffolding. Increasing the length of sequencing reads allows assembling short genomes but assembly of long repeat-rich genomes remains one of the most interesting and challenging problems in bioinformatics. There is a high demand in developing computational approaches for repeat aware scaffolding. In this paper, we propose a novel repeat-aware scaffolder BATISCAF based on the optimization formulation for filtering out repeated and short contigs. Our experiments with five benchmarking datasets show that the proposed tool BATISCAF outperforms state-of-the-art tools. BATISCAF is freely available on GitHub: https://github.com/mandricigor/batiscaf.

Download Full-text

Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1

10.1101/2020.02.04.933135 ◽

2020 ◽

Cited By ~ 8

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

Xiaoqiang Huang ◽

Eric W. Bell ◽

Xiaogen Zhou ◽

...

Keyword(s):

State Of The Art ◽

Careful Analysis ◽

Spike Protein ◽

The Novel ◽

Intermediate Hosts ◽

Cellular Mechanisms ◽

Computational Approaches ◽

Link Type ◽

Hiv 1 ◽

Existing Data

AbstractAs the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the “novel insertions” observed in the spike protein are naturally evolved from bat coronaviruses.

Download Full-text

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

10.1101/378356 ◽

2018 ◽

Author(s):

Jasmijn A. Baaijens ◽

Alexander Schönhuth

Keyword(s):

Recent Work ◽

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Specific Sequence ◽

New Approach ◽

Link Type ◽

Polyploid Genome

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

Download Full-text

GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

10.1101/810812 ◽

2019 ◽

Cited By ~ 9

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

Graph Alignment ◽

Link Type ◽

Long Reads ◽

Reference Genomes ◽

Genome Graph

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner

Download Full-text

DANNP: an efficient artificial neural network pruning tool

PeerJ Computer Science ◽

10.7717/peerj-cs.137 ◽

2017 ◽

Vol 3 ◽

pp. e137 ◽

Cited By ~ 7

Author(s):

Mona Alshahrani ◽

Othman Soufan ◽

Arturo Magana-Mora ◽

Vladimir B. Bajic

Keyword(s):

Neural Network ◽

State Of The Art ◽

Model Performance ◽

Training Data ◽

Classification Problems ◽

Link Type ◽

On Line ◽

Pruning Algorithms ◽

Artificial Neural ◽

The Impact

Background Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, determining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state-of-the-art feature selection (FS) methods. Results Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086.

Download Full-text

REVA as a Well-curated Database for Human Expression-modulating Variants

10.1101/2021.02.24.432622 ◽

2021 ◽

Author(s):

Yu Wang ◽

Fang-Yuan Shi ◽

Yu Liang ◽

Ge Gao

Keyword(s):

Large Scale ◽

Regulatory Mechanism ◽

State Of The Art ◽

Scale Analysis ◽

Computational Tools ◽

Functional Annotations ◽

Link Type ◽

Large Scale Analysis ◽

Multiple State ◽

Limited Sensitivity

AbstractMore than 80% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community. REVA is available at http://reva.gao-lab.org.

Download Full-text

Author Correction: Genome assembly and annotation of Meloidogyne enterolobii, an emerging parthenogenetic root-knot nematode

Scientific Data ◽

10.1038/s41597-020-00747-0 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Georgios D. Koutsovoulos ◽

Marine Poullet ◽

Abdelnaser Elashry ◽

Djampa K. L. Kozlowski ◽

Erika Sallet ◽

...

Keyword(s):

Genome Assembly ◽

Root Knot Nematode ◽

Link Type ◽

Meloidogyne Enterolobii

A Correction to this paper has been published: 10.1038/s41597-020-00747-0

Download Full-text

Conformational dynamics of carbohydrates: Raman optical activity of d-glucuronic acid and N-acetyl-d-glucosamine using a combined molecular dynamics and quantum chemical approach

Physical Chemistry Chemical Physics ◽

10.1039/c4cp05517a ◽

2015 ◽

Vol 17 (8) ◽

pp. 6016-6027 ◽

Cited By ~ 29

Author(s):

Shaun T. Mutter ◽

François Zielinski ◽

James R. Cheeseman ◽

Christian Johannessen ◽

Paul L. A. Popelier ◽

...

Keyword(s):

Molecular Dynamics ◽

Optical Activity ◽

Quantum Chemical ◽

Glucuronic Acid ◽

State Of The Art ◽

Conformational Dynamics ◽

Conformational Space ◽

Computational Approaches ◽

Raman Optical Activity ◽

Quantum Chemical Approach

Raman optical activity combined with state-of-the-art computational approaches successfully probes the conformational space of two important carbohydrates.

Download Full-text

State-of-the-art computation of the rotational and IR spectra of the methyl-cyclopropyl cation: hints on its detection in space

Physical Chemistry Chemical Physics ◽

10.1039/c8cp04629h ◽

2019 ◽

Vol 21 (7) ◽

pp. 3431-3439 ◽

Cited By ~ 9

Author(s):

Cristina Puzzarini ◽

Nicola Tasinato ◽

Julien Bloino ◽

Lorenzo Spada ◽

Vincenzo Barone

Keyword(s):

Ir Spectra ◽

State Of The Art ◽

Spectroscopic Characterization ◽

Computational Approaches

A route toward the detection of the methyl-cyclopropenyl cation in space: a spectroscopic characterization by state-of-the-art computational approaches.

Download Full-text

PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

10.1101/523068 ◽

2019 ◽

Author(s):

Priyanka Ghosh ◽

Sriram Krishnamoorthy ◽

Ananth Kalyanaraman

Keyword(s):

Genome Assembly ◽

Large Scale ◽

Distributed Memory ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Fundamental Problem ◽

Parallel Computer ◽

Assembly Process ◽

Data Movement

AbstractDe novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

Download Full-text