BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

Mapping Intimacies ◽

10.1101/040675 ◽

2016 ◽

Author(s):

Stephen R. Bond ◽

Karl E. Keat ◽

Sofia N. Barreira ◽

Andreas D. Baxevanis

Keyword(s):

Sequence Alignment ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

General Purpose ◽

Command Line ◽

Link Type ◽

File Formats ◽

Downstream Analysis ◽

Python Package ◽

Common Sequence

AbstractThe ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.

Download Full-text

Pydigree: a python library for manipulation and forward-time simulation and of genetic datasets

10.1101/213413 ◽

2017 ◽

Author(s):

James E. Hicks

Keyword(s):

Population Genetics ◽

Data Structures ◽

Genetic Epidemiology ◽

Genetic Data ◽

Link Type ◽

File Formats ◽

Time Simulation ◽

Cross Platform ◽

User Friendly ◽

Python Package

AbstractThe development of software for working with data from population genetics or genetic epidemiology often requires substantial time spent implementing common procedures. Pydigree is a cross-platform Python 3 library that contains efficient, user friendly implementations for many of these common functions, and support for input from common file formats. Developers can combine the functions and data structures to rapidly implement programs handling genetic data. Pydigree presents a useful environment for development of applications for genetic data or rapid prototyping before reimplementation in a higher-performance language.Pydigree is freely available under an open source license. Stable sources can be found in the Python Package Index at https://pypi.python.org/pypi/pydigree/, and development sources can be downloaded at https://github.com/jameshicks/pydigree/

Download Full-text

AGEpy: a Python package for computational biology

10.1101/450890 ◽

2018 ◽

Cited By ~ 1

Author(s):

Franziska Metge ◽

Robert Sehlke ◽

Jorge Boucas

Keyword(s):

Computational Biology ◽

Open Source ◽

High Throughput ◽

Biological Data ◽

Command Line ◽

High Throughput Analysis ◽

Throughput Analysis ◽

Link Type ◽

Biological Meaning ◽

Python Package

AbstractSummary:AGEpy is a Python package focused on the transformation of interpretable data into biological meaning. It is designed to support high-throughput analysis of pre-processed biological data using either local Python based processing or Python based API calls to local or remote servers. In this application note we describe its different Python modules as well as its command line accessible toolsaDiff,abed,blasto,david, andobo2tsv.Availability:The open source AGEpy Python package is freely available at:https://github.com/mpg-age-bioinformatics/AGEpy.Contact:[email protected]

Download Full-text

A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees

Molecular Biology and Evolution ◽

10.1093/molbev/msab264 ◽

2021 ◽

Author(s):

Jakob McBroome ◽

Bryan Thornlow ◽

Angie S Hinrichs ◽

Alexander Kramer ◽

Nicola De Maio ◽

...

Keyword(s):

Phylogenetic Trees ◽

Evolutionary History ◽

Command Line ◽

Sequencing Data ◽

Comprehensive View ◽

File Formats ◽

Public Data

Abstract The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils—a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.

Download Full-text

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Microbial Genomics ◽

10.1099/mgen.0.000685 ◽

2021 ◽

Vol 7 (11) ◽

Author(s):

Oliver Schwengers ◽

Lukas Jelonek ◽

Marius Alfred Dieckmann ◽

Sebastian Beyvers ◽

Jochen Blom ◽

...

Keyword(s):

Software Tool ◽

Software Tools ◽

Command Line ◽

Bacterial Genomes ◽

Functional Annotations ◽

Link Type ◽

Small Proteins ◽

Alignment Free ◽

Sequence Identification ◽

Downstream Analysis

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

Download Full-text

Phylommand - a command line software package for phylogenetics

F1000Research ◽

10.12688/f1000research.10446.1 ◽

2016 ◽

Vol 5 ◽

pp. 2903

Author(s):

Martin Ryberg

Keyword(s):

Software Package ◽

Evolutionary Biology ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Command Line ◽

File Formats

Phylogenetics is an intrinsic part of many analyses in evolutionary biology and ecology, and as the amount of data available for these analyses is increasing rapidly the need for automated pipelines to deal with the data also increases. Phylommand is a package of four programs to create, manipulate, and/or analyze phylogenetic trees or pairwise alignments. It is built to be easily implemented in software workflows, both directly on the command prompt, and executed using scripts. Inputs can be taken from standard input or a file, and the behavior of the programs can be changed through switches. By using standard file formats for phylogenetic analyses, such as newick, nexus, phylip, and fasta, phylommand is widely compatible with other software.

Download Full-text

Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale

10.1101/146514 ◽

2017 ◽

Author(s):

Alex V. Kotlar ◽

Cristina E. Trevino ◽

Michael E. Zwick ◽

David J. Cutler ◽

Thomas S. Wingo

Keyword(s):

Natural Language ◽

Search Engine ◽

Processing Time ◽

General Purpose ◽

Whole Genome ◽

Command Line ◽

Variant Annotation ◽

Link Type ◽

Key Innovation ◽

Genome Scale

AbstractAccurately selecting relevant alleles in large sequencing experiments remains technically challenging. Bystro (https://bystro.io/) is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for terabyte-sized whole-genome experiments containing thousands of samples. Its key innovation is a general-purpose, natural-language search engine that enables users to identify and export alleles and samples of interest in milliseconds. The search engine dramatically simplifies complex filtering tasks that previously required programming experience or specialty command-line programs. Critically, Bystro’s annotation and filtering capabilities are orders of magnitude faster than previous solutions, saving weeks of processing time for large experiments.

Download Full-text

TRTools: a toolkit for genome-wide analysis of tandem repeats

10.1101/2020.03.17.996033 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nima Mousavi ◽

Jonathan Margoliash ◽

Neha Pusarla ◽

Shubham Saini ◽

Richard Yanicky ◽

...

Keyword(s):

Quality Control ◽

Tandem Repeats ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Genome Wide Analysis ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Downstream Analysis

AbstractSummaryA rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and a suite of command-line tools for filtering, merging, and quality control of TR genotype files. TRTools utilizes an internal harmonization module making it compatible with outputs from a wide range of TR genotypers.AvailabilityTRTools is freely available at https://github.com/gymreklab/[email protected] informationSupplementary data are available at bioRxiv.

Download Full-text

UPS-indel: a Universal Positioning System for Indels

10.1101/133553 ◽

2017 ◽

Cited By ~ 3

Author(s):

Mohammad Shabbir Hasan ◽

Xiaowei Wu ◽

Layne T. Watson ◽

Zhiyi Li ◽

Liqing Zhang

Keyword(s):

State Of The Art ◽

Online Version ◽

Positioning System ◽

Command Line ◽

Human Chromosomes ◽

Link Type ◽

Indel Calling ◽

Downstream Analysis ◽

Command Line Version ◽

New System

AbstractBackgroundIndels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools.ResultsUPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets.ConclusionsUPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the command line version is freely available to download at http://ups-indel.sourceforge.net. The online version of UPS-indel is available at http://bench.cs.vt.edu/ups-indel/.

Download Full-text

Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data

10.1101/647958 ◽

2019 ◽

Cited By ~ 3

Author(s):

Lucas Czech ◽

Pierre Barbera ◽

Alexandros Stamatakis

Keyword(s):

Phylogenetic Trees ◽

Command Line ◽

Computationally Efficient ◽

Data Types ◽

Low Level ◽

Phylogenetic Placement ◽

Link Type ◽

Phylogenetic Data ◽

Command Line Tool ◽

High Level

SummaryWe present GENESIS, a library for working with phylogenetic data, and GAPPA, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested, and field-proven.Availability and ImplementationBoth GENESIS and GAPPA are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/[email protected] and [email protected].

Download Full-text

BiSulfite Bolt: A bisulfite sequencing analysis platform

GigaScience ◽

10.1093/gigascience/giab033 ◽

2021 ◽

Vol 10 (5) ◽

Author(s):

Colin Farrell ◽

Michael Thompson ◽

Anela Tosevska ◽

Adewale Oyetunde ◽

Matteo Pellegrini

Keyword(s):

Data Aggregation ◽

Bisulfite Sequencing ◽

Low Complexity ◽

Sequencing Analysis ◽

Command Line ◽

Sequencing Data ◽

Bisulfite Sequencing Data ◽

Analysis Platform ◽

Python Package ◽

Bisulfite Sequencing Analysis

Abstract Background Bisulfite sequencing is commonly used to measure DNA methylation. Processing bisulfite sequencing data is often challenging owing to the computational demands of mapping a low-complexity, asymmetrical library and the lack of a unified processing toolset to produce an analysis-ready methylation matrix from read alignments. To address these shortcomings, we have developed BiSulfite Bolt (BSBolt), a fast and scalable bisulfite sequencing analysis platform. BSBolt performs a pre-alignment sequencing read assessment step to improve efficiency when handling asymmetrical bisulfite sequencing libraries. Findings We evaluated BSBolt against simulated and real bisulfite sequencing libraries. We found that BSBolt provides accurate and fast bisulfite sequencing alignments and methylation calls. We also compared BSBolt to several existing bisulfite alignment tools and found BSBolt outperforms Bismark, BSSeeker2, BISCUIT, and BWA-Meth based on alignment accuracy and methylation calling accuracy. Conclusion BSBolt offers streamlined processing of bisulfite sequencing data through an integrated toolset that offers support for simulation, alignment, methylation calling, and data aggregation. BSBolt is implemented as a Python package and command line utility for flexibility when building informatics pipelines. BSBolt is available at https://github.com/NuttyLogic/BSBolt under an MIT license.

Download Full-text