GToTree: a user-friendly workflow for phylogenomics

Mapping Intimacies ◽

10.1101/512491 ◽

2019 ◽

Cited By ~ 8

Author(s):

Michael D. Lee

Keyword(s):

Markov Models ◽

Link Type ◽

File Formats ◽

Evolutionary Inference ◽

Computational Work ◽

Command Line Tool ◽

Genome Level ◽

User Friendly ◽

Reference Genomes

AbstractSummaryGenome-level evolutionary inference (i.e., phylogenomics) is becoming an increasingly essential step in many biologists’ work - such as in the characterization of newly recovered genomes, or in leveraging available reference genomes to guide evolutionary questions. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required - such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together, etc. - can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files, and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on the specified singlecopy gene (SCG) set. While GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ~12,000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees.AvailabilityGToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTreeDocumentationgithub.com/AstrobioMike/GToTree/wikiImplementationGToTree is implemented primarily in bash, with helper scripts written in [email protected]

Download Full-text

GToTree: a user-friendly workflow for phylogenomics

Bioinformatics ◽

10.1093/bioinformatics/btz188 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4162-4164 ◽

Cited By ~ 38

Author(s):

Michael D Lee

Keyword(s):

Markov Models ◽

Single Copy ◽

Supplementary Information ◽

File Formats ◽

Evolutionary Inference ◽

Copy Gene ◽

Computational Work ◽

Command Line Tool ◽

Genome Level ◽

User Friendly

Abstract Summary Genome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists’ work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required—such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.—can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees. Availability and implementation GToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree. It is implemented primarily in bash with helper scripts written in python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes

10.1101/2021.04.29.441969 ◽

2021 ◽

Author(s):

Pengchuan Sun ◽

Beibei Jiao ◽

Yongzhi Yang ◽

Lanxing Shan ◽

Ting Li ◽

...

Keyword(s):

Detection Algorithm ◽

Integrated Analysis ◽

Whole Genome ◽

Whole Genome Duplications ◽

Genome Duplications ◽

Command Line Tool ◽

Genome Analyses ◽

Life On Earth ◽

User Friendly

Evidence of whole-genome duplications (WGDs) and subsequent karyotype changes has been detected in most major lineages of life on Earth. To clarify the complex resulting multiple-layered patterns of gene collinearity in genome analyses there is a need for convenient and accurate toolkits. To meet this need, we introduce here WGDI (Whole-Genome Duplication Integrated analysis), a Python-based command-line tool that facilitates comprehensive analysis of recursive polyploidizations and cross-species genome alignments. WGDI supports three main workflows (polyploid inference, hierarchical inference of genomic homology, and ancestral chromosomal karyotyping) that can improve detection of WGD and characterization of related events. It incorporates a more sensitive and accurate collinearity detection algorithm than previous softwares, and can accelerate WGD-related karyotype research. As a freely available toolkit at GitHub (https://github.com/SunPengChuan/wgdi), WGDI outperforms similar tools in terms of efficiency, flexibility and scalability. In an illustrative example of its application, WGDI convincingly clarified karyotype evolution in Aquilegia coerulea and Vitis vinifera following WGDs and rejected the hypothesis that Aquilegia contributed as a parental lineage to the allopolyploid origin of core dicots.

Download Full-text

Pydigree: a python library for manipulation and forward-time simulation and of genetic datasets

10.1101/213413 ◽

2017 ◽

Author(s):

James E. Hicks

Keyword(s):

Population Genetics ◽

Data Structures ◽

Genetic Epidemiology ◽

Genetic Data ◽

Link Type ◽

File Formats ◽

Time Simulation ◽

Cross Platform ◽

User Friendly ◽

Python Package

AbstractThe development of software for working with data from population genetics or genetic epidemiology often requires substantial time spent implementing common procedures. Pydigree is a cross-platform Python 3 library that contains efficient, user friendly implementations for many of these common functions, and support for input from common file formats. Developers can combine the functions and data structures to rapidly implement programs handling genetic data. Pydigree presents a useful environment for development of applications for genetic data or rapid prototyping before reimplementation in a higher-performance language.Pydigree is freely available under an open source license. Stable sources can be found in the Python Package Index at https://pypi.python.org/pypi/pydigree/, and development sources can be downloaded at https://github.com/jameshicks/pydigree/

Download Full-text

SeqAcademy: an educational pipeline for RNA-Seq and ChIP-Seq analysis

F1000Research ◽

10.12688/f1000research.14880.4 ◽

2020 ◽

Vol 7 ◽

pp. 628

Author(s):

Syed Hussain Ather ◽

Olaitan Igbagbo Awe ◽

Thomas J. Butler ◽

Tamiru Denka ◽

Stephen Andrew Semick ◽

...

Keyword(s):

Chromatin Immunoprecipitation ◽

Gene Transcript ◽

Rna Seq ◽

Link Type ◽

Educational Pipeline ◽

Biological Insight ◽

Chromatin Immunoprecipitation Sequencing ◽

Critical Components ◽

User Friendly

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in markdown language, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.

Download Full-text

CancerVar: a web server for improved evidence-based clinical interpretation of cancer somatic mutations and copy number abnormalities

10.1101/2020.10.06.323162 ◽

2020 ◽

Author(s):

Quan Li ◽

Zilin Ren ◽

Yunyun Zhou ◽

Kai Wang

Keyword(s):

Clinical Significance ◽

Copy Number ◽

Somatic Mutations ◽

Relevant Information ◽

Web Interface ◽

Clinical Interpretation ◽

Link Type ◽

Command Line Tool ◽

Hotspot Mutations ◽

User Friendly

ABSTRACTSeveral knowledgebases, such as CIViC, CGI and OncoKB, have been manually curated to support clinical interpretations of somatic mutations and copy number abnormalities (CNAs) in cancer. However, these resources focus on known hotspot mutations, and discrepancies or even conflicting interpretations have been observed between these knowledgebases. To standardize clinical interpretation, AMP/ASCO/CAP/ACMG/CGC jointly published consensus guidelines for the interpretations of somatic mutations and CNAs in 2017 and 2019, respectively. Based on these guidelines, we developed a standardized, semi-automated interpretation tool called CancerVar (Cancer Variants interpretation), with a user-friendly web interface to assess the clinical impacts of somatic variants. Using a semi-supervised method, CancerVar interpret the clinical impacts of cancer variants as four tiers: strong clinical significance, potential clinical significance, unknown clinical significance, benign/likely benign. CancerVar also allows users to specify criteria or adjust scoring weights as a customized interpretation strategy, and allows phenotype-driven scoring for specific types of cancer. Importantly, CancerVar generates automated texts to summarize clinical evidence on somatic variants, which greatly reduces manual workload to write interpretations that include relevant information from harmonized knowledgebases. CancerVar can be accessed at http://cancervar.wglab.org and it is open to all users without login requirements. The command line tool is also available at https://github.com/WGLab/CancerVar.

Download Full-text

MegaGO: a fast yet powerful approach to assess functional similarity across meta-omics data sets

10.1101/2020.11.16.384834 ◽

2020 ◽

Author(s):

Pieter Verschaffelt ◽

Tim Van Den Bossche ◽

Wassim Gabriel ◽

Michał Burdukiewicz ◽

Alessio Soggiu ◽

...

Keyword(s):

Web Application ◽

Functional Similarity ◽

Data Sets ◽

Link Type ◽

Large Sets ◽

Powerful Approach ◽

Command Line Tool ◽

Complete Set ◽

User Friendly ◽

Go Terms

AbstractThe study of microbiomes has gained in importance over the past few years, and has led to the fields of metagenomics, metatranscriptomics and metaproteomics. While initially focused on the study of biodiversity within these communities the emphasis has increasingly shifted to the study of (changes in) the complete set of functions available in these communities. A key tool to study this functional complement of a microbiome is Gene Ontology (GO) term analysis. However, comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching. To solve this problem, we here present MegaGO, a user-friendly tool that relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO is highly performant: each set can contain thousands of GO terms, and results are calculated in a matter of seconds. MegaGO is available as a web application at https://megago.ugent.be and installable via pip as a standalone command line tool and reusable software library. All code is open source under the MIT license, and is available at https://github.com/MEGA-GO/.

Download Full-text

SeqAcademy: an educational pipeline for RNA-Seq and ChIP-Seq analysis

F1000Research ◽

10.12688/f1000research.14880.2 ◽

2018 ◽

Vol 7 ◽

pp. 628

Author(s):

Syed Hussain Ather ◽

Olaitan Igbagbo Awe ◽

Thomas J. Butler ◽

Tamiru Denka ◽

Stephen Andrew Semick ◽

...

Keyword(s):

Chromatin Immunoprecipitation ◽

Gene Transcript ◽

Rna Seq ◽

Link Type ◽

Educational Pipeline ◽

Biological Insight ◽

Chromatin Immunoprecipitation Sequencing ◽

Critical Components ◽

User Friendly

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.

Download Full-text

GRAFIMO: Variant and haplotype aware motif scanning on pangenome graphs

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009444 ◽

2021 ◽

Vol 17 (9) ◽

pp. e1009444

Author(s):

Manuel Tognon ◽

Vincenzo Bonnici ◽

Erik Garrison ◽

Rosalba Giugno ◽

Luca Pinello

Keyword(s):

Dna Sequences ◽

Binding Sites ◽

Specific Binding ◽

Genomic Variation ◽

Link Type ◽

Scanning Procedure ◽

Expression Of Genes ◽

Potential Binding ◽

Command Line Tool ◽

Reference Genomes

Transcription factors (TFs) are proteins that promote or reduce the expression of genes by binding short genomic DNA sequences known as transcription factor binding sites (TFBS). While several tools have been developed to scan for potential occurrences of TFBS in linear DNA sequences or reference genomes, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their variants in a single, compact data structure. Because VGs can losslessly compress large pangenomes, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by considering variations and alternative haplotypes encoded in a VG. Using GRAFIMO on a VG based on individuals from the 1000 Genomes project we recover several potential binding sites that are enhanced, weakened or missed when scanning only the reference genome, and which could constitute individual-specific binding events. GRAFIMO is available as an open-source tool, under the MIT license, at https://github.com/pinellolab/GRAFIMO and https://github.com/InfOmics/GRAFIMO.

Download Full-text

SeqAcademy: an educational pipeline for RNA-Seq and ChIP-Seq analysis

F1000Research ◽

10.12688/f1000research.14880.1 ◽

2018 ◽

Vol 7 ◽

pp. 628

Author(s):

Syed Hussain Ather ◽

Olaitan Igbagbo Awe ◽

Thomas J. Butler ◽

Tamiru Denka ◽

Stephen Andrew Semick ◽

...

Keyword(s):

Chromatin Immunoprecipitation ◽

Gene Transcript ◽

Rna Seq ◽

Link Type ◽

Educational Pipeline ◽

Biological Insight ◽

Chromatin Immunoprecipitation Sequencing ◽

Critical Components ◽

User Friendly

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.

Download Full-text

metabolisHMM: Phylogenomic analysis for exploration of microbial phylogenies and metabolic pathways

10.1101/2019.12.20.884627 ◽

2019 ◽

Cited By ~ 1

Author(s):

E.A. McDaniel ◽

K. Anantharaman ◽

K.D. McMahon

Keyword(s):

High Throughput Sequencing ◽

Markov Models ◽

Marker Gene ◽

Phylogenomic Analysis ◽

Metagenomic Sequencing ◽

Metabolic Characteristics ◽

Link Type ◽

Sequencing Technologies ◽

Single Marker ◽

User Friendly

AbstractSummaryAdvances in high-throughput sequencing technologies and bioinformatic pipelines have exponentially increased the amount of data that can be obtained from uncultivated microbial lineages inhabiting diverse ecosystems. Various annotation tools and databases currently exist for predicting the functional potential of sequenced genomes or microbial communities based upon sequence identity. However, intuitive, reproducible, and user-friendly tools for further exploring and visualizing functional guilds of microbial community metagenomic sequencing datasets remains lacking. Here, we present metabolisHMM, a series of workflows for visualizing the distribution of curated and user-provided Hidden Markov Models (HMMs) to understand metabolic characteristics and evolutionary histories of microbial lineages. metabolisHMM performs functional annotations with a set of curated or user-defined HMMs to 1) construct ribosomal protein and single marker gene phylogenies, 2) summarize the presence/absence of metabolic pathway markers, and 3) create heatmap visualizations of presence/absence summaries.Availability and ImplementationmetabolisHMM is freely available on Github at https://github.com/elizabethmcd/metabolisHMM and on PyPi at https://pypi.org/project/metabolisHMM/ under the GNU General Public License v3.0.

Download Full-text