Piggy: A Rapid, Large-Scale Pan-Genome Analysis Tool for Intergenic Regions in Bacteria

Mapping Intimacies ◽

10.1101/179515 ◽

2017 ◽

Cited By ~ 3

Author(s):

Harry A. Thorpe ◽

Sion C. Bayliss ◽

Samuel K. Sheppard ◽

Edward J. Feil

Keyword(s):

Large Scale ◽

Reference Database ◽

Analysis Tool ◽

Protein Coding ◽

Coding Sequences ◽

Large Genome ◽

Pan Genome ◽

Overwhelming Evidence ◽

Intergenic Regions ◽

Genome Analyses

AbstractDespite overwhelming evidence that variation in intergenic regions (IGRs) in bacteria impacts on phenotypes, most current approaches for analysing pan-genomes focus exclusively on protein-coding sequences. To address this we present Piggy, a novel pipeline that emulates Roary except that it is based only on IGRs. We demonstrate the use of Piggy for pan-genome analyses of Staphylococcus aureus and Escherichia coli using large genome datasets. For S. aureus, we show that highly divergent (“switched”) IGRs are associated with differences in gene expression, and we establish a multi-locus reference database of IGR alleles (igMLST; implemented in BIGSdb). Piggy is available at https://github.com/harry-thorpe/piggy.

Download Full-text

Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria

GigaScience ◽

10.1093/gigascience/giy015 ◽

2018 ◽

Vol 7 (4) ◽

Cited By ~ 21

Author(s):

Harry A Thorpe ◽

Sion C Bayliss ◽

Samuel K Sheppard ◽

Edward J Feil

Keyword(s):

Genome Analysis ◽

Large Scale ◽

Analysis Tool ◽

Pan Genome ◽

Intergenic Regions

Download Full-text

SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier

GigaScience ◽

10.1093/gigascience/giz118 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 7

Author(s):

Xiao Hu ◽

Iddo Friedberg

Keyword(s):

Protein Function ◽

Large Scale ◽

Homology Search ◽

Comparative Genomic ◽

Data Sets ◽

Analysis Tool ◽

Memory Usage ◽

Spaced Seeds ◽

Speed Up ◽

Genome Analyses

Abstract Background Gene homology type classification is required for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. Consequently, a large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic data sets, these tools require high memory and CPU usage, typically available only in computational clusters. Findings Here we present a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. SwiftOrtho uses long k-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long k-mers. In addition, it uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups. In our tests, SwiftOrtho was the only tool that completed orthology analysis of proteins from 1,760 bacterial genomes on a computer with only 4 GB RAM. Using various standard orthology data sets, we also show that SwiftOrtho has a high accuracy. Conclusions SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low-memory computers. SwiftOrtho is available at https://github.com/Rinoahu/SwiftOrtho

Download Full-text

SwiftOrtho: a Fast, Memory-Efficient, Multiple Genome Orthology Classifier

10.1101/543223 ◽

2019 ◽

Author(s):

Xiao Hu ◽

Iddo Friedberg

Keyword(s):

Protein Function ◽

Large Scale ◽

Comparative Genomic ◽

Analysis Tool ◽

Bacterial Genomes ◽

Function Annotation ◽

Large Scale Data ◽

Protein Function Annotation ◽

Genome Analyses ◽

Memory Efficient

AbstractIntroductionGene homology type classification is a requisite for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. A large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic datasets, these tools require high memory and CPU usage, typically available only in costly computational clusters. To address this problem, we developed a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data.ResultsIn our tests, SwiftOrtho is the only tool that completed orthology analysis of 1,760 bacterial genomes on a computer with only 4GB RAM. Using various standard orthology datasets, we also show that SwiftOrtho has a high accuracy. SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low memory computers.Availabilityhttps://github.com/Rinoahu/SwiftOrtho

Download Full-text

De novoemergence of adaptive membrane proteins from thymine-rich intergenic sequences

10.1101/621532 ◽

2019 ◽

Author(s):

Nikolaos Vakirlis ◽

Omer Acar ◽

Brian Hsu ◽

Nelson Castilho Coelho ◽

S. Branden Van Oss ◽

...

Keyword(s):

De Novo ◽

Transmembrane Proteins ◽

Protein Coding ◽

Coding Sequences ◽

Beneficial Effects ◽

Protein Coding Genes ◽

Evolutionary Innovation ◽

Intergenic Sequences ◽

Intergenic Regions ◽

Novel Protein

SummaryRecent evidence demonstrates that novel protein-coding genes can arisede novofrom intergenic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of intergenic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Do intergenic translation events yield polypeptides with useful biochemical capacities? The answer to this question remains controversial. Here, we systematically characterized howde novoemerging coding sequences impact fitness. In budding yeast, overexpression of these sequences was enriched in beneficial effects, while their disruption was generally inconsequential. We found that beneficial emerging sequences have a strong tendency to encode putative transmembrane proteins, which appears to stem from a cryptic propensity for transmembrane signals throughout thymine-rich intergenic regions of the genome. These findings suggest that novel genes with useful biochemical capacities, such as transmembrane domains, tend to evolvede novowithin intergenic loci that already harbored a blueprint for these capacities.

Download Full-text

MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

10.1101/851964 ◽

2019 ◽

Cited By ~ 1

Author(s):

Eli Levy Karin ◽

Milot Mirdita ◽

Johannes Söding

Keyword(s):

High Throughput ◽

Large Scale ◽

Sequence Similarity ◽

Direct Sequencing ◽

Metagenomic Data ◽

Reference Database ◽

Protein Coding ◽

Protein Coding Genes ◽

Highly Sensitive ◽

Computational Procedures

AbstractBackgroundMetagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.ResultsMetaEuk is a toolkit for high-throughput, reference-based discovery and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk’s power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in eight days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.ConclusionThe open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation.

Download Full-text

Rapid protein sequence evolution via compensatory frameshift is widespread in RNA virus genomes

BMC Bioinformatics ◽

10.1186/s12859-021-04182-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dongbin Park ◽

Yoonsoo Hahn

Keyword(s):

Amino Acid ◽

Large Scale ◽

Rna Viruses ◽

Rna Virus ◽

Phylogenetic Analyses ◽

Sequence Evolution ◽

Protein Coding ◽

Coding Sequences ◽

Reading Frame ◽

Nucleotide Insertions

Abstract Background RNA viruses possess remarkable evolutionary versatility driven by the high mutability of their genomes. Frameshifting nucleotide insertions or deletions (indels), which cause the premature termination of proteins, are frequently observed in the coding sequences of various viral genomes. When a secondary indel occurs near the primary indel site, the open reading frame can be restored to produce functional proteins, a phenomenon known as the compensatory frameshift. Results In this study, we systematically analyzed publicly available viral genome sequences and identified compensatory frameshift events in hundreds of viral protein-coding sequences. Compensatory frameshift events resulted in large-scale amino acid differences between the compensatory frameshift form and the wild type even though their nucleotide sequences were almost identical. Phylogenetic analyses revealed that the evolutionary distance between proteins with and without a compensatory frameshift were significantly overestimated because amino acid mismatches caused by compensatory frameshifts were counted as substitutions. Further, this could cause compensatory frameshift forms to branch in different locations in the protein and nucleotide trees, which may obscure the correct interpretation of phylogenetic relationships between variant viruses. Conclusions Our results imply that the compensatory frameshift is one of the mechanisms driving the rapid protein evolution of RNA viruses and potentially assisting their host-range expansion and adaptation.

Download Full-text

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

PeerJ Computer Science ◽

10.7717/peerj-cs.118 ◽

2017 ◽

Vol 3 ◽

pp. e118 ◽

Cited By ~ 10

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J. O’Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Protein Coding ◽

Coding Sequences ◽

A Genome ◽

Pressure Analysis

Background Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA.

Download Full-text

Faculty Opinions recommendation of Role of low-complexity sequences in the formation of novel protein coding sequences.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718030532.793494763 ◽

2014 ◽

Author(s):

Erich Bornberg-Bauer ◽

Magdalena Heberlein

Keyword(s):

Low Complexity ◽

Protein Coding ◽

Coding Sequences ◽

Novel Protein

Download Full-text

PlncRNADB: A Repository of Plant lncRNAs and lncRNA-RBP Protein Interactions

Current Bioinformatics ◽

10.2174/1574893614666190131161002 ◽

2019 ◽

Vol 14 (7) ◽

pp. 621-627 ◽

Cited By ~ 3

Author(s):

Youhuang Bai ◽

Xiaozhuan Dai ◽

Tiantian Ye ◽

Peijing Zhang ◽

Xu Yan ◽

...

Keyword(s):

Protein Interactions ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Populus Trichocarpa ◽

Noncoding Rnas ◽

Reference Database ◽

Protein Coding ◽

Arabidopsis Lyrata ◽

User Friendly

Background: Long noncoding RNAs (lncRNAs) are endogenous noncoding RNAs, arbitrarily longer than 200 nucleotides, that play critical roles in diverse biological processes. LncRNAs exist in different genomes ranging from animals to plants. Objective: PlncRNADB is a searchable database of lncRNA sequences and annotation in plants. Methods: We built a pipeline for lncRNA prediction in plants, providing a convenient utility for users to quickly distinguish potential noncoding RNAs from protein-coding transcripts. Results: More than five thousand lncRNAs are collected from four plant species (Arabidopsis thaliana, Arabidopsis lyrata, Populus trichocarpa and Zea mays) in PlncRNADB. Moreover, our database provides the relationship between lncRNAs and various RNA-binding proteins (RBPs), which can be displayed through a user-friendly web interface. Conclusion: PlncRNADB can serve as a reference database to investigate the lncRNAs and their interaction with RNA-binding proteins in plants. The PlncRNADB is freely available at http://bis.zju.edu.cn/PlncRNADB/.

Download Full-text

Draft Genome Sequence of Urease-Producing Pseudorhodobacter sp. Strain E13, Isolated from the Yellow Sea in Gunsan, South Korea

Microbiology Resource Announcements ◽

10.1128/mra.00189-19 ◽

2019 ◽

Vol 8 (23) ◽

Author(s):

Si Chul Kim ◽

Hyo Jung Lee

Keyword(s):

South Korea ◽

Genome Sequence ◽

Yellow Sea ◽

Draft Genome ◽

The Yellow Sea ◽

Draft Genome Sequence ◽

Protein Coding ◽

Coding Sequences ◽

Gram Negative ◽

Content Type

Here, we report the draft genome sequence of Pseudorhodobacter sp. strain E13, a Gram-negative, aerobic, nonflagellated, and rod-shaped bacterium which was isolated from the Yellow Sea in South Korea. The assembled genome sequence is 3,878,578 bp long with 3,646 protein-coding sequences in 159 contigs.

Download Full-text