scholarly journals GToTree: a user-friendly workflow for phylogenomics

2019 ◽  
Author(s):  
Michael D. Lee

AbstractSummaryGenome-level evolutionary inference (i.e., phylogenomics) is becoming an increasingly essential step in many biologists’ work - such as in the characterization of newly recovered genomes, or in leveraging available reference genomes to guide evolutionary questions. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required - such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together, etc. - can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files, and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on the specified singlecopy gene (SCG) set. While GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ~12,000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees.AvailabilityGToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTreeDocumentationgithub.com/AstrobioMike/GToTree/wikiImplementationGToTree is implemented primarily in bash, with helper scripts written in [email protected]

2019 ◽  
Vol 35 (20) ◽  
pp. 4162-4164 ◽  
Author(s):  
Michael D Lee

Abstract Summary Genome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists’ work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required—such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.—can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees. Availability and implementation GToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree. It is implemented primarily in bash with helper scripts written in python. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Pengchuan Sun ◽  
Beibei Jiao ◽  
Yongzhi Yang ◽  
Lanxing Shan ◽  
Ting Li ◽  
...  

Evidence of whole-genome duplications (WGDs) and subsequent karyotype changes has been detected in most major lineages of life on Earth. To clarify the complex resulting multiple-layered patterns of gene collinearity in genome analyses there is a need for convenient and accurate toolkits. To meet this need, we introduce here WGDI (Whole-Genome Duplication Integrated analysis), a Python-based command-line tool that facilitates comprehensive analysis of recursive polyploidizations and cross-species genome alignments. WGDI supports three main workflows (polyploid inference, hierarchical inference of genomic homology, and ancestral chromosomal karyotyping) that can improve detection of WGD and characterization of related events. It incorporates a more sensitive and accurate collinearity detection algorithm than previous softwares, and can accelerate WGD-related karyotype research. As a freely available toolkit at GitHub (https://github.com/SunPengChuan/wgdi), WGDI outperforms similar tools in terms of efficiency, flexibility and scalability. In an illustrative example of its application, WGDI convincingly clarified karyotype evolution in Aquilegia coerulea and Vitis vinifera following WGDs and rejected the hypothesis that Aquilegia contributed as a parental lineage to the allopolyploid origin of core dicots.


2017 ◽  
Author(s):  
James E. Hicks

AbstractThe development of software for working with data from population genetics or genetic epidemiology often requires substantial time spent implementing common procedures. Pydigree is a cross-platform Python 3 library that contains efficient, user friendly implementations for many of these common functions, and support for input from common file formats. Developers can combine the functions and data structures to rapidly implement programs handling genetic data. Pydigree presents a useful environment for development of applications for genetic data or rapid prototyping before reimplementation in a higher-performance language.Pydigree is freely available under an open source license. Stable sources can be found in the Python Package Index at https://pypi.python.org/pypi/pydigree/, and development sources can be downloaded at https://github.com/jameshicks/pydigree/


F1000Research ◽  
2020 ◽  
Vol 7 ◽  
pp. 628
Author(s):  
Syed Hussain Ather ◽  
Olaitan Igbagbo Awe ◽  
Thomas J. Butler ◽  
Tamiru Denka ◽  
Stephen Andrew Semick ◽  
...  

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in markdown language, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.


2020 ◽  
Author(s):  
Quan Li ◽  
Zilin Ren ◽  
Yunyun Zhou ◽  
Kai Wang

ABSTRACTSeveral knowledgebases, such as CIViC, CGI and OncoKB, have been manually curated to support clinical interpretations of somatic mutations and copy number abnormalities (CNAs) in cancer. However, these resources focus on known hotspot mutations, and discrepancies or even conflicting interpretations have been observed between these knowledgebases. To standardize clinical interpretation, AMP/ASCO/CAP/ACMG/CGC jointly published consensus guidelines for the interpretations of somatic mutations and CNAs in 2017 and 2019, respectively. Based on these guidelines, we developed a standardized, semi-automated interpretation tool called CancerVar (Cancer Variants interpretation), with a user-friendly web interface to assess the clinical impacts of somatic variants. Using a semi-supervised method, CancerVar interpret the clinical impacts of cancer variants as four tiers: strong clinical significance, potential clinical significance, unknown clinical significance, benign/likely benign. CancerVar also allows users to specify criteria or adjust scoring weights as a customized interpretation strategy, and allows phenotype-driven scoring for specific types of cancer. Importantly, CancerVar generates automated texts to summarize clinical evidence on somatic variants, which greatly reduces manual workload to write interpretations that include relevant information from harmonized knowledgebases. CancerVar can be accessed at http://cancervar.wglab.org and it is open to all users without login requirements. The command line tool is also available at https://github.com/WGLab/CancerVar.


2020 ◽  
Author(s):  
Pieter Verschaffelt ◽  
Tim Van Den Bossche ◽  
Wassim Gabriel ◽  
Michał Burdukiewicz ◽  
Alessio Soggiu ◽  
...  

AbstractThe study of microbiomes has gained in importance over the past few years, and has led to the fields of metagenomics, metatranscriptomics and metaproteomics. While initially focused on the study of biodiversity within these communities the emphasis has increasingly shifted to the study of (changes in) the complete set of functions available in these communities. A key tool to study this functional complement of a microbiome is Gene Ontology (GO) term analysis. However, comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching. To solve this problem, we here present MegaGO, a user-friendly tool that relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO is highly performant: each set can contain thousands of GO terms, and results are calculated in a matter of seconds. MegaGO is available as a web application at https://megago.ugent.be and installable via pip as a standalone command line tool and reusable software library. All code is open source under the MIT license, and is available at https://github.com/MEGA-GO/.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 628
Author(s):  
Syed Hussain Ather ◽  
Olaitan Igbagbo Awe ◽  
Thomas J. Butler ◽  
Tamiru Denka ◽  
Stephen Andrew Semick ◽  
...  

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.


2021 ◽  
Vol 17 (9) ◽  
pp. e1009444
Author(s):  
Manuel Tognon ◽  
Vincenzo Bonnici ◽  
Erik Garrison ◽  
Rosalba Giugno ◽  
Luca Pinello

Transcription factors (TFs) are proteins that promote or reduce the expression of genes by binding short genomic DNA sequences known as transcription factor binding sites (TFBS). While several tools have been developed to scan for potential occurrences of TFBS in linear DNA sequences or reference genomes, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their variants in a single, compact data structure. Because VGs can losslessly compress large pangenomes, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by considering variations and alternative haplotypes encoded in a VG. Using GRAFIMO on a VG based on individuals from the 1000 Genomes project we recover several potential binding sites that are enhanced, weakened or missed when scanning only the reference genome, and which could constitute individual-specific binding events. GRAFIMO is available as an open-source tool, under the MIT license, at https://github.com/pinellolab/GRAFIMO and https://github.com/InfOmics/GRAFIMO.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 628
Author(s):  
Syed Hussain Ather ◽  
Olaitan Igbagbo Awe ◽  
Thomas J. Butler ◽  
Tamiru Denka ◽  
Stephen Andrew Semick ◽  
...  

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy (source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.


Author(s):  
E.A. McDaniel ◽  
K. Anantharaman ◽  
K.D. McMahon

AbstractSummaryAdvances in high-throughput sequencing technologies and bioinformatic pipelines have exponentially increased the amount of data that can be obtained from uncultivated microbial lineages inhabiting diverse ecosystems. Various annotation tools and databases currently exist for predicting the functional potential of sequenced genomes or microbial communities based upon sequence identity. However, intuitive, reproducible, and user-friendly tools for further exploring and visualizing functional guilds of microbial community metagenomic sequencing datasets remains lacking. Here, we present metabolisHMM, a series of workflows for visualizing the distribution of curated and user-provided Hidden Markov Models (HMMs) to understand metabolic characteristics and evolutionary histories of microbial lineages. metabolisHMM performs functional annotations with a set of curated or user-defined HMMs to 1) construct ribosomal protein and single marker gene phylogenies, 2) summarize the presence/absence of metabolic pathway markers, and 3) create heatmap visualizations of presence/absence summaries.Availability and ImplementationmetabolisHMM is freely available on Github at https://github.com/elizabethmcd/metabolisHMM and on PyPi at https://pypi.org/project/metabolisHMM/ under the GNU General Public License v3.0.


Sign in / Sign up

Export Citation Format

Share Document