PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria

Mapping Intimacies ◽

10.1101/598391 ◽

2019 ◽

Author(s):

Sion C. Bayliss ◽

Harry A. Thorpe ◽

Nicola M. Coyle ◽

Samuel K. Sheppard ◽

Edward J. Feil

Keyword(s):

Allelic Variation ◽

Sequence Similarity ◽

Sequence Divergence ◽

Orthologous Gene ◽

Gene Families ◽

Iterative Refinement ◽

Supplementary Information ◽

Bacterial Populations ◽

Wide Range ◽

Or Gene

AbstractCataloguing the distribution of genes within natural bacterial populations is essential for understanding evolutionary processes and the genetic basis of adaptation. Here we present a pangenomics toolbox, PIRATE (Pangenome Iterative Refinement And Threshold Evaluation), which identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds. PIRATE builds upon recent scalable software developments to allow for the rapid interrogation of thousands of isolates. PIRATE clusters genes (or other annotated features) over a wide range of amino-acid or nucleotide identity thresholds and uses the clustering information to rapidly classify paralogous gene families into either putative fission/fusion events or gene duplications. Furthermore, PIRATE orders the pangenome using a directed graph, provides a measure of allelic variation and estimates sequence divergence for each gene family. We demonstrate that PIRATE scales linearly with both number of samples and computation resources, allowing for analysis of large genomic datasets, and compares favorably to other popular tools. PIRATE provides a robust framework for analysing bacterial pangenomes, from largely clonal to panmictic species.AvailabilityPIRATE is implemented in Perl and is freely available under an GNU GPL 3 open source license fromhttps://github.com/SionBayliss/PIRATE.Supplementary InformationSupplementary data is available online.

Download Full-text

PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria

GigaScience ◽

10.1093/gigascience/giz119 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 18

Author(s):

Sion C Bayliss ◽

Harry A Thorpe ◽

Nicola M Coyle ◽

Samuel K Sheppard ◽

Edward J Feil

Keyword(s):

Allelic Variation ◽

Sequence Similarity ◽

Sequence Divergence ◽

Orthologous Gene ◽

Gene Families ◽

Iterative Refinement ◽

Bacterial Populations ◽

Sequencing Technologies ◽

Wide Range ◽

Similarity Thresholds

Abstract Background Cataloguing the distribution of genes within natural bacterial populations is essential for understanding evolutionary processes and the genetic basis of adaptation. Advances in whole genome sequencing technologies have led to a vast expansion in the amount of bacterial genomes deposited in public databases. There is a pressing need for software solutions which are able to cluster, catalogue and characterise genes, or other features, in increasingly large genomic datasets. Results Here we present a pangenomics toolbox, PIRATE (Pangenome Iterative Refinement and Threshold Evaluation), which identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds. PIRATE builds upon recent scalable software developments to allow for the rapid interrogation of thousands of isolates. PIRATE clusters genes (or other annotated features) over a wide range of amino acid or nucleotide identity thresholds and uses the clustering information to rapidly identify paralogous gene families and putative fission/fusion events. Furthermore, PIRATE orders the pangenome using a directed graph, provides a measure of allelic variation, and estimates sequence divergence for each gene family. Conclusions We demonstrate that PIRATE scales linearly with both number of samples and computation resources, allowing for analysis of large genomic datasets, and compares favorably to other popular tools. PIRATE provides a robust framework for analysing bacterial pangenomes, from largely clonal to panmictic species.

Download Full-text

CAARS: comparative assembly and annotation of RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/bty903 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2199-2207 ◽

Cited By ~ 1

Author(s):

Carine Rey ◽

Philippe Veber ◽

Bastien Boussau ◽

Marie Sémon

Keyword(s):

Gene Family ◽

De Novo ◽

Sequence Similarity ◽

Gene Families ◽

Supplementary Information ◽

Model Organisms ◽

Difficult Case ◽

Rna Seq ◽

Comparative Analyses ◽

Family Reconstruction

Abstract Motivation RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction. Results We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses. Availability and implementation CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyMod 3: a complete suite for structural bioinformatics in PyMOL

Bioinformatics ◽

10.1093/bioinformatics/btaa849 ◽

2020 ◽

Author(s):

Giacomo Janson ◽

Alessandro Paiardini

Keyword(s):

Phylogenetic Trees ◽

Model Building ◽

Sequence Similarity ◽

Structural Bioinformatics ◽

Structure Alignment ◽

Supplementary Information ◽

Large Set ◽

Loop Modeling ◽

Multiple Sequence ◽

Wide Range

Abstract Summary The PyMod project is designed to act as a fully integrated interface between the popular molecular graphics viewer PyMOL, and some of the most frequently used tools for structural bioinformatics, e.g. BLAST, HMMER, Clustal, MUSCLE, PSIPRED, DOPE and MODELLER. Here we report its latest release, PyMod 3, which has been completely renewed with a graphical interface written in PyQt, to make it compatible with the most recent PyMOL versions, and has been extended with a large set of new functionalities compared to its predecessor, i.e. PyMod 2. Starting from the amino acid sequence of a target protein, users can take advantage of PyMod 3 to carry out all the steps of the homology modeling process (i.e. template searching, target–template sequence alignment, model building and quality assessment). Additionally, the integrated tools in PyMod 3 may also be used alone, in order to extend PyMOL with a wide range of capabilities. Sequence similarity searches, multiple sequence/structure alignment building, phylogenetic trees and evolutionary conservation analyses, domain parsing, single/multiple chains and loop modeling can be performed in the PyMod 3/PyMOL environment. Availability and implementation A cross-platform PyMod 3 installer package for Windows, Linux and Mac OS X and a complete user guide with tutorials, are available at https://github.com/pymodproject/pymod Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepGOPlus: improved protein function prediction from sequence

Bioinformatics ◽

10.1093/bioinformatics/btz595 ◽

2019 ◽

Cited By ~ 17

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Protein Function ◽

Drug Targets ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Function Prediction ◽

Supplementary Information ◽

Protein Protein Interaction ◽

Wide Range ◽

Protein Functions ◽

Novel Method

Abstract Motivation Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. Results We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. Availability and implementation http://deepgoplus.bio2vec.net/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Recombination of ecologically and evolutionarily significant loci maintains genetic cohesion in thePseudomonas syringaespecies complex

10.1101/227413 ◽

2017 ◽

Cited By ~ 8

Author(s):

Marcus M. Dillon ◽

Shalabh Thakur ◽

Renan N.D. Almeida ◽

David S. Guttman

Keyword(s):

Pseudomonas Syringae ◽

Species Complex ◽

Bacterial Species ◽

Orthologous Gene ◽

Gene Families ◽

Biological Species ◽

Genetic Exchange ◽

Crop Species ◽

Pan Genome ◽

Wide Range

ABSTRACTPseudomonas syringaeis a highly diverse bacterial species complex capable of causing a wide range of serious diseases on numerous agronomically important crop species. Here, we examine the evolutionary relationships of 391 agricultural and environmental strains from theP. syringaespecies complex using whole-genome sequencing and evolutionary genomic analyses. Our collection includes strains from 11 of the 13 previously described phylogroups isolated off of over 90 hosts. We describe the phylogenetic distribution of all orthologous gene families in theP. syringaepan-genome, reconstruct the phylogeny ofP. syringaeusing a core genome alignment and a hierarchical clustering analysis of pan-genome content, predict ecologically and evolutionary relevant loci, and establish the forces of molecular evolution operating on each gene family. We find that the common ancestor of the species complex likely carried a Rhizobium-like type III secretion system (TTSS) and later acquired the canonical TTSS. The phylogenetic analysis also showed that the species complex is subdivided into primary and secondary phylogroups based on genetic diversity and rates of genetic exchange. The primary phylogroups, which largely consist of agricultural isolates, are no more divergent than a number of other bacterial species, while the secondary phylogroups, which largely consists of environmental isolates, have levels of diversity more in line with multiple distinct species within a genus. An analysis of rates of recombination within and between phylogroups revealed a higher rate of recombination within primary phylogroups than between primary and secondary phylogroups. We also found that “ecologically significant” virulence-associated loci and “evolutionary significant” loci under positive selection are over-represented among loci that undergo inter-phylogroup genetic exchange. These results indicate that while inter-phylogroup recombination occurs relatively rarely in the species complex, it is an important force of genetic cohesion, particularly among the strains in the primary phylogroups. This level of genetic cohesion and the shared plant-associated niche argues for considering the primary phylogroups as a true biological species.

Download Full-text

A comparative study of odorant binding protein genes: differential expression of the PBP1-GOBP2 gene cluster inManduca sexta(Lepidoptera) and the organization of OBP genes inDrosophila melanogaster(Diptera)

Journal of Experimental Biology ◽

10.1242/jeb.205.6.719 ◽

2002 ◽

Vol 205 (6) ◽

pp. 719-744 ◽

Cited By ~ 4

Author(s):

Richard G. Vogt ◽

Matthew E. Rogers ◽

Marie-dominique Franco ◽

Ming Sun

Keyword(s):

Gene Family ◽

Gene Cluster ◽

Sequence Similarity ◽

Cell Communication ◽

Gene Families ◽

Cell Types ◽

Amino Acid Sequence Similarity ◽

Olfactory Sensilla ◽

Wide Range ◽

Odorant Binding

SUMMARYInsects discriminate odors using sensory organs called olfactory sensilla, which display a wide range of phenotypes. Sensilla express ensembles of proteins, including odorant binding proteins (OBPs), olfactory receptors (ORs) and odor degrading enzymes (ODEs); odors are thought to be transported to ORs by OBPs and subsequently degraded by ODEs. These proteins belong to multigene families. The unique combinatorial expression of specific members of each of these gene families determines, in part, the phenotype of a sensillum and what odors it can detect. Furthermore, OBPs, ORs and ODEs are expressed in different cell types, suggesting the need for cell–cell communication to coordinate their expression. This report examines the OBP gene family. In Manduca sexta, the genes encoding PBP1Msex and GOBP2Msex are sequenced, shown to be adjacent to one another, and characterized together with OBP gene structures of other lepidoptera and Drosophila melanogaster. Expression of PBP1Msex, GOBP1Msex and GOBP2Msex is characterized in adult male and female antenna and in larval antenna and maxilla. The genomic organization of 25 D. melanogaster OBPs are characterized with respect to gene locus, gene cluster, amino acid sequence similarity, exon conservation and proximity to OR loci, and their sequences are compared with 14 M. sexta OBPs. Sensilla serve as portals of important behavioral information, and genes supporting sensilla function are presumably under significant evolutionary selective pressures. This study provides a basis for studying the evolution of the OBP gene family, the regulatory mechanisms governing the coordinated expression of OBPs, ORs and ODEs, and the processes that determine specific sensillum phenotypes.

Download Full-text

BHap: a novel approach for bacterial haplotype reconstruction

Bioinformatics ◽

10.1093/bioinformatics/btz280 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4624-4631 ◽

Cited By ~ 1

Author(s):

Xin Li ◽

Samaneh Saadat ◽

Haiyan Hu ◽

Xiaoman Li

Keyword(s):

Supplementary Information ◽

Next Generation Sequencing Data ◽

Accurate Estimation ◽

Haplotype Reconstruction ◽

Sequencing Data ◽

Bacterial Populations ◽

Sequencing Errors ◽

Novel Approach ◽

Wide Range ◽

Low Coverage

Abstract Motivation The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. Results In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. Availability and implementation The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Microsatellite and Chromosome Evolution of Parthenogenetic Sitobion Aphids in Australia

Genetics ◽

10.1093/genetics/144.2.747 ◽

1996 ◽

Vol 144 (2) ◽

pp. 747-756 ◽

Cited By ~ 4

Author(s):

Paul Sunnucks ◽

Phillip R England ◽

Andrea C Taylor ◽

Dinah F Hales

Keyword(s):

Sexual Reproduction ◽

Chromosome Evolution ◽

Genetic Recombination ◽

Allelic Variation ◽

Repeat Unit ◽

Chromosomal Evolution ◽

Microsatellite Variation ◽

Wide Range ◽

Sexual Recombination ◽

Minimum Numbers

Abstract Single-locus microsatellite variation correlated perfectly with chromosome number in Sitobion miscanthi aphids. The microsatellites were highly heterozygous, with up to 10 alleles per locus in this species. Despite this considerable allelic variation, only seven different S. miscanthi genotypes were discovered in 555 individuals collected from a wide range of locations, hosts and sampling periods. Relatedness between genotypes suggests only two successful colonizations of Australia. There was no evidence for genetic recombination in 555 S. miscanthi so the occurrence of recent sexual reproduction must be near zero. Thus diversification is by mutation and chromosomal rearrangement alone. Since the aphids showed no sexual recombination, microsatellites can mutate without meiosis. Five of seven microsatellite differences were a single repeat unit, and one larger jump is likely. The minimum numbers of changes between karyotypes corresponded roughly one-to-one with microsatellite allele changes, which suggests very rapid chromosomal evolution. A chromosomal fission occurred in a cultured line, and a previously unknown chromosomal race was detected. All 121 diverse S. near fragariae were heterozygous but revealed only one genotype. This species too must have a low rate of sexual reproduction and few colonizations of Australia.

Download Full-text

Epidemiological modeling in StochSS Live!

Bioinformatics ◽

10.1093/bioinformatics/btab061 ◽

2021 ◽

Author(s):

Richard Jiang ◽

Bruno Jacob ◽

Matthew Geiger ◽

Sean Matthew ◽

Bryan Rumsey ◽

...

Keyword(s):

Stochastic Model ◽

Epidemiological Model ◽

Supplementary Information ◽

Supplementary Data ◽

Web Based ◽

Epidemiological Modeling ◽

Modeling Simulation ◽

Wide Range ◽

Biochemical Systems

Abstract Summary We present StochSS Live!, a web-based service for modeling, simulation and analysis of a wide range of mathematical, biological and biochemical systems. Using an epidemiological model of COVID-19, we demonstrate the power of StochSS Live! to enable researchers to quickly develop a deterministic or a discrete stochastic model, infer its parameters and analyze the results. Availability and implementation StochSS Live! is freely available at https://live.stochss.org/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text