UPS-indel: a Universal Positioning System for Indels

Mapping Intimacies ◽

10.1101/133553 ◽

2017 ◽

Cited By ~ 3

Author(s):

Mohammad Shabbir Hasan ◽

Xiaowei Wu ◽

Layne T. Watson ◽

Zhiyi Li ◽

Liqing Zhang

Keyword(s):

State Of The Art ◽

Online Version ◽

Positioning System ◽

Command Line ◽

Human Chromosomes ◽

Link Type ◽

Indel Calling ◽

Downstream Analysis ◽

Command Line Version ◽

New System

AbstractBackgroundIndels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools.ResultsUPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets.ConclusionsUPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the command line version is freely available to download at http://ups-indel.sourceforge.net. The online version of UPS-indel is available at http://bench.cs.vt.edu/ups-indel/.

Download Full-text

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Microbial Genomics ◽

10.1099/mgen.0.000685 ◽

2021 ◽

Vol 7 (11) ◽

Author(s):

Oliver Schwengers ◽

Lukas Jelonek ◽

Marius Alfred Dieckmann ◽

Sebastian Beyvers ◽

Jochen Blom ◽

...

Keyword(s):

Software Tool ◽

Software Tools ◽

Command Line ◽

Bacterial Genomes ◽

Functional Annotations ◽

Link Type ◽

Small Proteins ◽

Alignment Free ◽

Sequence Identification ◽

Downstream Analysis

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

Download Full-text

TRTools: a toolkit for genome-wide analysis of tandem repeats

10.1101/2020.03.17.996033 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nima Mousavi ◽

Jonathan Margoliash ◽

Neha Pusarla ◽

Shubham Saini ◽

Richard Yanicky ◽

...

Keyword(s):

Quality Control ◽

Tandem Repeats ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Genome Wide Analysis ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Downstream Analysis

AbstractSummaryA rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and a suite of command-line tools for filtering, merging, and quality control of TR genotype files. TRTools utilizes an internal harmonization module making it compatible with outputs from a wide range of TR genotypers.AvailabilityTRTools is freely available at https://github.com/gymreklab/[email protected] informationSupplementary data are available at bioRxiv.

Download Full-text

BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

10.1101/040675 ◽

2016 ◽

Author(s):

Stephen R. Bond ◽

Karl E. Keat ◽

Sofia N. Barreira ◽

Andreas D. Baxevanis

Keyword(s):

Sequence Alignment ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

General Purpose ◽

Command Line ◽

Link Type ◽

File Formats ◽

Downstream Analysis ◽

Python Package ◽

Common Sequence

AbstractThe ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.

Download Full-text

Fortran90 sources of the subroutines ofUMWEG. III. TheUMWEG-specific subroutines

Journal of Applied Crystallography ◽

10.1107/s0021889806051211 ◽

2007 ◽

Vol 40 (1) ◽

pp. 185-187

Author(s):

Elisabeth Rossmanith

Keyword(s):

Command Line ◽

Main Program ◽

Command Line Version

The Fortran90 sources of theUMWEG-specific subroutines of the programUMWEGare presented and deposited, together with the PostScript-plot software subroutines and the simple and short main program of the command-line version of the programUMWEG.

Download Full-text

Erratum: Martínez-López et al. (2017)

Journal of Teaching in Physical Education ◽

10.1123/jtpe.2017-0147 ◽

2017 ◽

Vol 36 (3) ◽

pp. 371

Keyword(s):

Physical Education ◽

Self Efficacy ◽

Online Version ◽

Physical Education Teachers ◽

Link Type ◽

Efficacy Expectations

In the article by Martínez-López, E.J., Zamora-Aguilera, N., Grao-Cruces, A., and De la Torre-Cruz, M.J., “The Association Between Spanish Physical Education Teachers’ Self-Efficacy Expectations and Their Attitudes Toward Overweight and Obese Students,” in Journal of Teaching in Physical Education, 36, 2, https://doi.org/10.1123/jtpe.2014-0125, the author order was incorrectly listed. The online version of this article has been corrected.

Download Full-text

Automated Dental Identification with Lowest Cost Path-Based Teeth and Jaw Separation

Scandinavian Journal of Forensic Science ◽

10.1515/sjfs-2016-0008 ◽

2016 ◽

Vol 22 (2) ◽

pp. 44-56 ◽

Cited By ~ 3

Author(s):

Jan-Vidar Ølberg ◽

Morten Goodwin

Keyword(s):

High Stability ◽

Distance Measure ◽

State Of The Art ◽

Test Set ◽

X Ray ◽

Dental Identification ◽

Verification Process ◽

Stability And Accuracy ◽

New System ◽

Dental Work

Abstract Teeth are some of the most resilient tissues of the human body. Because of their placement, teeth often yield intact indicators even when other metrics, such as finger prints and DNA, are missing. Forensics on dental identification is now mostly manual work which is time and resource intensive. Systems for automated human identification from dental X-ray images have the potential to greatly reduce the necessary efforts spent on dental identification, but it requires a system with high stability and accuracy so that the results can be trusted. This paper proposes a new system for automated dental X-ray identification. The scheme extracts tooth and dental work contours from the X-ray images and uses the Hausdorff-distance measure for ranking persons. This combination of state-of-the-art approaches with a novel lowest cost path-based method for separating a dental X-ray image into individual teeth, is able to achieve comparable and better results than what is available in the literature. The proposed scheme is fully functional and is used to accurately identify people within a real dental database. The system is able to perfectly separate 88.7% of the teeth in the test set. Further, in the verification process, the system ranks the correct person in top in 86% of the cases, and among the top five in an astonishing 94% of the cases. The approach has compelling potential to significantly reduce the time spent on dental identification.

Download Full-text

DANNP: an efficient artificial neural network pruning tool

PeerJ Computer Science ◽

10.7717/peerj-cs.137 ◽

2017 ◽

Vol 3 ◽

pp. e137 ◽

Cited By ~ 7

Author(s):

Mona Alshahrani ◽

Othman Soufan ◽

Arturo Magana-Mora ◽

Vladimir B. Bajic

Keyword(s):

Neural Network ◽

State Of The Art ◽

Model Performance ◽

Training Data ◽

Classification Problems ◽

Link Type ◽

On Line ◽

Pruning Algorithms ◽

Artificial Neural ◽

The Impact

Background Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, determining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state-of-the-art feature selection (FS) methods. Results Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1

10.1101/2020.02.04.933135 ◽

2020 ◽

Cited By ~ 8

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

Xiaoqiang Huang ◽

Eric W. Bell ◽

Xiaogen Zhou ◽

...

Keyword(s):

State Of The Art ◽

Careful Analysis ◽

Spike Protein ◽

The Novel ◽

Intermediate Hosts ◽

Cellular Mechanisms ◽

Computational Approaches ◽

Link Type ◽

Hiv 1 ◽

Existing Data

AbstractAs the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the “novel insertions” observed in the spike protein are naturally evolved from bat coronaviruses.

Download Full-text

REVA as a Well-curated Database for Human Expression-modulating Variants

10.1101/2021.02.24.432622 ◽

2021 ◽

Author(s):

Yu Wang ◽

Fang-Yuan Shi ◽

Yu Liang ◽

Ge Gao

Keyword(s):

Large Scale ◽

Regulatory Mechanism ◽

State Of The Art ◽

Scale Analysis ◽

Computational Tools ◽

Functional Annotations ◽

Link Type ◽

Large Scale Analysis ◽

Multiple State ◽

Limited Sensitivity

AbstractMore than 80% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community. REVA is available at http://reva.gao-lab.org.

Download Full-text