scholarly journals BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

2016 ◽  
Author(s):  
Stephen R. Bond ◽  
Karl E. Keat ◽  
Sofia N. Barreira ◽  
Andreas D. Baxevanis

AbstractThe ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.

2017 ◽  
Author(s):  
James E. Hicks

AbstractThe development of software for working with data from population genetics or genetic epidemiology often requires substantial time spent implementing common procedures. Pydigree is a cross-platform Python 3 library that contains efficient, user friendly implementations for many of these common functions, and support for input from common file formats. Developers can combine the functions and data structures to rapidly implement programs handling genetic data. Pydigree presents a useful environment for development of applications for genetic data or rapid prototyping before reimplementation in a higher-performance language.Pydigree is freely available under an open source license. Stable sources can be found in the Python Package Index at https://pypi.python.org/pypi/pydigree/, and development sources can be downloaded at https://github.com/jameshicks/pydigree/


2018 ◽  
Author(s):  
Franziska Metge ◽  
Robert Sehlke ◽  
Jorge Boucas

AbstractSummary:AGEpy is a Python package focused on the transformation of interpretable data into biological meaning. It is designed to support high-throughput analysis of pre-processed biological data using either local Python based processing or Python based API calls to local or remote servers. In this application note we describe its different Python modules as well as its command line accessible toolsaDiff,abed,blasto,david, andobo2tsv.Availability:The open source AGEpy Python package is freely available at:https://github.com/mpg-age-bioinformatics/AGEpy.Contact:[email protected]


Author(s):  
Jakob McBroome ◽  
Bryan Thornlow ◽  
Angie S Hinrichs ◽  
Alexander Kramer ◽  
Nicola De Maio ◽  
...  

Abstract The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils—a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.


2021 ◽  
Vol 7 (11) ◽  
Author(s):  
Oliver Schwengers ◽  
Lukas Jelonek ◽  
Marius Alfred Dieckmann ◽  
Sebastian Beyvers ◽  
Jochen Blom ◽  
...  

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2903
Author(s):  
Martin Ryberg

Phylogenetics is an intrinsic part of many analyses in evolutionary biology and ecology, and as the amount of data available for these analyses is increasing rapidly the need for automated pipelines to deal with the data also increases. Phylommand is a package of four programs to create, manipulate, and/or analyze phylogenetic trees or pairwise alignments. It is built to be easily implemented in software workflows, both directly on the command prompt, and executed using scripts. Inputs can be taken from standard input or a file, and the behavior of the programs can be changed through switches. By using standard file formats for phylogenetic analyses, such as newick, nexus, phylip, and fasta, phylommand is widely compatible with other software.


2017 ◽  
Author(s):  
Alex V. Kotlar ◽  
Cristina E. Trevino ◽  
Michael E. Zwick ◽  
David J. Cutler ◽  
Thomas S. Wingo

AbstractAccurately selecting relevant alleles in large sequencing experiments remains technically challenging. Bystro (https://bystro.io/) is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for terabyte-sized whole-genome experiments containing thousands of samples. Its key innovation is a general-purpose, natural-language search engine that enables users to identify and export alleles and samples of interest in milliseconds. The search engine dramatically simplifies complex filtering tasks that previously required programming experience or specialty command-line programs. Critically, Bystro’s annotation and filtering capabilities are orders of magnitude faster than previous solutions, saving weeks of processing time for large experiments.


Author(s):  
Nima Mousavi ◽  
Jonathan Margoliash ◽  
Neha Pusarla ◽  
Shubham Saini ◽  
Richard Yanicky ◽  
...  

AbstractSummaryA rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and a suite of command-line tools for filtering, merging, and quality control of TR genotype files. TRTools utilizes an internal harmonization module making it compatible with outputs from a wide range of TR genotypers.AvailabilityTRTools is freely available at https://github.com/gymreklab/[email protected] informationSupplementary data are available at bioRxiv.


2017 ◽  
Author(s):  
Mohammad Shabbir Hasan ◽  
Xiaowei Wu ◽  
Layne T. Watson ◽  
Zhiyi Li ◽  
Liqing Zhang

AbstractBackgroundIndels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools.ResultsUPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets.ConclusionsUPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the command line version is freely available to download at http://ups-indel.sourceforge.net. The online version of UPS-indel is available at http://bench.cs.vt.edu/ups-indel/.


2019 ◽  
Author(s):  
Lucas Czech ◽  
Pierre Barbera ◽  
Alexandros Stamatakis

SummaryWe present GENESIS, a library for working with phylogenetic data, and GAPPA, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested, and field-proven.Availability and ImplementationBoth GENESIS and GAPPA are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/[email protected] and [email protected].


GigaScience ◽  
2021 ◽  
Vol 10 (5) ◽  
Author(s):  
Colin Farrell ◽  
Michael Thompson ◽  
Anela Tosevska ◽  
Adewale Oyetunde ◽  
Matteo Pellegrini

Abstract Background Bisulfite sequencing is commonly used to measure DNA methylation. Processing bisulfite sequencing data is often challenging owing to the computational demands of mapping a low-complexity, asymmetrical library and the lack of a unified processing toolset to produce an analysis-ready methylation matrix from read alignments. To address these shortcomings, we have developed BiSulfite Bolt (BSBolt), a fast and scalable bisulfite sequencing analysis platform. BSBolt performs a pre-alignment sequencing read assessment step to improve efficiency when handling asymmetrical bisulfite sequencing libraries. Findings We evaluated BSBolt against simulated and real bisulfite sequencing libraries. We found that BSBolt provides accurate and fast bisulfite sequencing alignments and methylation calls. We also compared BSBolt to several existing bisulfite alignment tools and found BSBolt outperforms Bismark, BSSeeker2, BISCUIT, and BWA-Meth based on alignment accuracy and methylation calling accuracy. Conclusion BSBolt offers streamlined processing of bisulfite sequencing data through an integrated toolset that offers support for simulation, alignment, methylation calling, and data aggregation. BSBolt is implemented as a Python package and command line utility for flexibility when building informatics pipelines. BSBolt is available at https://github.com/NuttyLogic/BSBolt under an MIT license.


Sign in / Sign up

Export Citation Format

Share Document