VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families

Bioinformatics ◽

10.1093/bioinformatics/btab026 ◽

2021 ◽

Author(s):

Joan Carles Pons ◽

David Paez-Espino ◽

Gabriel Riera ◽

Natalia Ivanova ◽

Nikos C Kyrpides ◽

...

Keyword(s):

Viral Protein ◽

Confidence Score ◽

Taxonomic Classification ◽

Global Ocean ◽

Supplementary Information ◽

Protein Families ◽

Taxonomic Assignment ◽

Genus Level ◽

Key Steps ◽

Viral Sequences

Abstract Motivation Two key steps in the analysis of uncultured viruses recovered from metagenomes are the taxonomic classification of the viral sequences and the identification of putative host(s). Both steps rely mainly on the assignment of viral proteins to orthologs in cultivated viruses. Viral Protein Families (VPFs) can be used for the robust identification of new viral sequences in large metagenomics datasets. Despite the importance of VPF information for viral discovery, VPFs have not yet been explored for determining viral taxonomy and host targets. Results In this work, we classified the set of VPFs from the IMG/VR database and developed VPF-Class. VPF-Class is a tool that automates the taxonomic classification and host prediction of viral contigs based on the assignment of their proteins to a set of classified VPFs. Applying VPF-Class on 731K uncultivated virus contigs from the IMG/VR database, we were able to classify 363K contigs at the genus level and predict the host of over 461K contigs. In the RefSeq database, VPF-class reported an accuracy of nearly 100% to classify dsDNA, ssDNA and retroviruses, at the genus level, considering a membership ratio and a confidence score of 0.2. The accuracy in host prediction was 86.4%, also at the genus level, considering a membership ratio of 0.3 and a confidence score of 0.5. And, in the prophages dataset, the accuracy in host prediction was 86% considering a membership ratio of 0.6 and a confidence score of 0.8. Moreover, from the Global Ocean Virome dataset, over 817K viral contigs out of 1 million were classified. Availability and implementation The implementation of VPF-Class can be downloaded from https://github.com/biocom-uib/vpf-tools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Bioinformatics ◽

10.1093/bioinformatics/btab451 ◽

2021 ◽

Author(s):

Ahmed A Zayed ◽

Dominik Lücking ◽

Mohamed Mohssen ◽

Dylan Cronin ◽

Ben Bolduc ◽

...

Keyword(s):

Viral Protein ◽

Global Ocean ◽

Supplementary Information ◽

Protein Families ◽

Marine Habitats ◽

Elemental Cycling ◽

Protein Functions ◽

Minimum Zone ◽

Viral Sequences ◽

Oxygen Minimum

Abstract Motivation Viruses infect, reprogram, and kill microbes, leading to profound ecosystem consequences, from elemental cycling in oceans and soils to microbiome-modulated diseases in plants and animals. Although metagenomic datasets are increasingly available, identifying viruses in them is challenging due to poor representation and annotation of viral sequences in databases. Results Here we establish efam, an expanded collection of Hidden Markov Model (HMM) profiles that represent viral protein families conservatively identified from the Global Ocean Virome 2.0 dataset. This resulted in 240,311 HMM profiles, each with at least 2 protein sequences, making efam >7-fold larger than the next largest, pan-ecosystem viral HMM profile database. Adjusting the criteria for viral contig confidence from “conservative” to “eXtremely Conservative” resulted in 37,841 HMM profiles in our efam-XC database. To assess the value of this resource, we integrated efam-XC into VirSorter viral discovery software to discover viruses from less-studied, ecologically distinct oxygen minimum zone (OMZ) marine habitats. This expanded database led to an increase in viruses recovered from every tested OMZ virome by ∼24% on average (up to ∼42%) and especially improved the recovery of often-missed shorter contigs (<5 kb). Additionally, to help elucidate lesser-known viral protein functions, we annotated the profiles using multiple databases from the DRAM pipeline and virion-associated metaproteomic data, which doubled the number of annotations obtainable by standard, single-database annotation approaches. Together, these marine resources (efam and efam-XC) are provided as searchable, compressed HMM databases that will be updated bi-annually to help maximize viral sequence discovery and study from any ecosystem. Availability The resources are available on the iVirus platform at (doi.org/10.25739/9vze-4143). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics

BMC Biology ◽

10.1186/s12915-020-00938-6 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Congyu Lu ◽

Zheng Zhang ◽

Zena Cai ◽

Zhaozhong Zhu ◽

Ye Qiu ◽

...

Keyword(s):

Prediction Accuracy ◽

Functional Characterization ◽

Gaussian Model ◽

Software Tool ◽

Biological Properties ◽

Rapid Identification ◽

Taxonomic Assignment ◽

Genus Level ◽

Alignment Free ◽

Archaeal Viruses

Abstract Background Viruses are ubiquitous biological entities, estimated to be the largest reservoirs of unexplored genetic diversity on Earth. Full functional characterization and annotation of newly discovered viruses requires tools to enable taxonomic assignment, the range of hosts, and biological properties of the virus. Here we focus on prokaryotic viruses, which include phages and archaeal viruses, and for which identifying the viral host is an essential step in characterizing the virus, as the virus relies on the host for survival. Currently, the method for determining the viral host is either to culture the virus, which is low-throughput, time-consuming, and expensive, or to computationally predict the viral hosts, which needs improvements at both accuracy and usability. Here we develop a Gaussian model to predict hosts for prokaryotic viruses with better performances than previous computational methods. Results We present here Prokaryotic virus Host Predictor (PHP), a software tool using a Gaussian model, to predict hosts for prokaryotic viruses using the differences of k-mer frequencies between viral and host genomic sequences as features. PHP gave a host prediction accuracy of 34% (genus level) on the VirHostMatcher benchmark dataset and a host prediction accuracy of 35% (genus level) on a new dataset containing 671 viruses and 60,105 prokaryotic genomes. The prediction accuracy exceeded that of two alignment-free methods (VirHostMatcher and WIsH, 28–34%, genus level). PHP also outperformed these two alignment-free methods much (24–38% vs 18–20%, genus level) when predicting hosts for prokaryotic viruses which cannot be predicted by the BLAST-based or the CRISPR-spacer-based methods alone. Requiring a minimal score for making predictions (thresholding) and taking the consensus of the top 30 predictions further improved the host prediction accuracy of PHP. Conclusions The Prokaryotic virus Host Predictor software tool provides an intuitive and user-friendly API for the Gaussian model described herein. This work will facilitate the rapid identification of hosts for newly identified prokaryotic viruses in metagenomic studies.

Download Full-text

Detecting and correcting misclassified sequences in the large-scale public databases

Bioinformatics ◽

10.1093/bioinformatics/btaa586 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4699-4705

Author(s):

Hamid Bagheri ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Large Scale ◽

Sequence Similarity ◽

Heuristic Method ◽

Simulated Data ◽

Supplementary Information ◽

Small Subset ◽

Taxonomic Assignment ◽

User Input ◽

Public Repositories ◽

Taxonomic Assignments

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Taxonomic classification methods reveal a new subgenus in the paramyxovirus subfamily Orthoparamyxovirinae

10.1101/2021.10.12.464153 ◽

2021 ◽

Author(s):

Heather L Wells ◽

Elizabeth Loh ◽

Alessandra Nava ◽

Mei Ho Lee ◽

Jimmy Lee ◽

...

Keyword(s):

Full Length ◽

Taxonomic Classification ◽

Sequence Length ◽

South American ◽

Phenotypic Data ◽

Vast Number ◽

New Subgenus ◽

Public Repositories ◽

Viral Sequences

As part of a broad One Health surveillance effort to detect novel viruses in wildlife and people, we report several paramyxoviruses sequenced primarily from bats during 2013 and 2014 in Brazil and Malaysia, including seven from which we recovered full-length genomes. Of these, six represent the first full-length paramyxovirus genomes sequenced from the Americas, including two sequences which are the first full-length bat morbillivirus genomes published to date. Our findings add to the vast number of viral sequences in public repositories that have been increasing considerably in recent years due to the rising accessibility of metagenomics. Taxonomic classification of these sequences in the absence of phenotypic data has been a significant challenge, particularly in the paramyxovirus subfamily Orthoparamyxovirinae, where the rate of discovery of novel sequences has been substantial. Using pairwise amino acid sequence classification (PASC), we describe a novel genus within this subfamily tentatively named Jeishaanvirus, which we propose should include as subgenera Jeilongvirus, Shaanvirus, and a novel South American subgenus Cadivirus. We also highlight inconsistencies in the classification of Tupaia virus and Mojiang virus using the same demarcation criteria and show that members of the proposed subgenus Shaanvirus are paraphyletic. Importantly, this study underscores the critical importance of sequence length in PASC analysis as well as the importance of biological characteristics such as genome organization in the taxonomic classification of viral sequences.

Download Full-text

Can We Explain RNA-Mediated Virus Resistance by Homology-Dependent Gene Silencing?

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi.1998.11.7.717 ◽

1998 ◽

Vol 11 (7) ◽

pp. 717-723 ◽

Cited By ~ 40

Author(s):

Tom van den Boogaart ◽

George P. Lomonossoff ◽

Jeffrey W. Davies

Keyword(s):

Gene Silencing ◽

Virus Resistance ◽

Viral Protein ◽

Transgene Expression ◽

Standard Technique ◽

Protein S ◽

The Other ◽

Critical View ◽

Homologous Sequences ◽

Viral Sequences

The use of viral sequences to produce virus-resistant plants is now almost a standard technique. A variety of sequences from a large number of viruses have been used but the mechanisms remain largely unknown. There are probably at least two distinct types of mechanisms operating: one requiring the expression of the viral protein(s) and the other dependent only on the presence of transgene-derived RNA. In this review, we will discuss this RNA-mediated resistance and its similarities with cosuppression, a recently described phenomenon leading to suppression of transgene expression and homologous sequences. We present a critical view of the current models available to explain this type of resistance.

Download Full-text

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

PLoS Biology ◽

10.1371/journal.pbio.0050016 ◽

2007 ◽

Vol 5 (3) ◽

pp. e16 ◽

Cited By ~ 580

Author(s):

Shibu Yooseph ◽

Granger Sutton ◽

Douglas B Rusch ◽

Aaron L Halpern ◽

Shannon J Williamson ◽

...

Keyword(s):

Global Ocean ◽

Protein Families ◽

Global Ocean Sampling ◽

Ocean Sampling ◽

The Universe

Download Full-text

A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity

Bioinformatics ◽

10.1093/bioinformatics/btab130 ◽

2021 ◽

Author(s):

Raghvendra Mall ◽

Abdurrahman Elbasir ◽

Hossam Almeer ◽

Zeyaul Islam ◽

Prasanna R Kolatkar ◽

...

Keyword(s):

Viral Protein ◽

Mean Squared Error ◽

De Novo ◽

Pearson Correlation ◽

Drug Repurposing ◽

Viral Proteins ◽

Binding Energies ◽

Supplementary Information ◽

Protein Activity ◽

Docking Simulations

Abstract Motivation A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming, and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases. Model We propose a machine learning representation framework that uses deep learning induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model in-turn uses a consensus framework to rank approved compounds against viral proteins of interest. Results Our consensus framework achieves a highmean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigationalhuman compounds.We performadditional molecular docking simulations to demonstrate thatmajority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus. Availability All the source code and data is available at: https://github.com/raghvendra5688/Drug-Repurposing and https://dx.doi.org/10.17632/8rrwnbcgmx.3. We also implemented a web-server at: https://machinelearning-protein.qcri.org/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CHEER: hierarCHical taxonomic classification for viral mEtagEnomic data via deep leaRning

10.1101/2020.03.26.009001 ◽

2020 ◽

Author(s):

Jiayu Shang ◽

Yanni Sun

Keyword(s):

New Species ◽

Rna Virus ◽

Hierarchical Classification ◽

Large Data ◽

Taxonomic Classification ◽

Classification Model ◽

Metagenomic Data ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Taxonomic Analysis

ABSTRARCTThe fast accumulation of viral metagenomic data has contributed significantly to new RNA virus discovery. However, the short read size, complex composition, and large data size can all make taxonomic analysis difficult. In particular, commonly used alignment-based methods are not ideal choices for detecting new viral species. In this work, we present a novel hierarchical classification model named CHEER, which can conduct read-level taxonomic classification from order to genus for new species. By combining k-mer embedding-based encoding, hierarchically organized CNNs, and carefully trained rejection layer, CHEER is able to assign correct taxonomic labels for reads from new species. We tested CHEER on both simulated and real sequencing data. The results show that CHEER can achieve higher accuracy than popular alignment-based and alignment-free taxonomic assignment tools. The source code, scripts, and pre-trained parameters for CHEER are available via GitHub: https://github.com/KennthShang/CHEER.

Download Full-text

CheckV assesses the quality and completeness of metagenome-assembled viral genomes

Nature Biotechnology ◽

10.1038/s41587-020-00774-7 ◽

2020 ◽

Cited By ~ 1

Author(s):

Stephen Nayfach ◽

Antonio Pedro Camargo ◽

Frederik Schulz ◽

Emiley Eloe-Fadrosh ◽

Simon Roux ◽

...

Keyword(s):

Global Ocean ◽

Large Database ◽

High Quality ◽

Accurate Identification ◽

Short Read ◽

Viral Genomes ◽

Metabolic Genes ◽

Automated Pipeline ◽

Auxiliary Metabolic Genes ◽

Viral Sequences

AbstractMillions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.

Download Full-text

TreeSAPP: the Tree-based Sensitive and Accurate Phylogenetic Profiler

Bioinformatics ◽

10.1093/bioinformatics/btaa588 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4706-4713

Author(s):

Connor Morgan-Lang ◽

Ryan McLaughlin ◽

Zachary Armstrong ◽

Grace Zhang ◽

Kevin Chan ◽

...

Keyword(s):

Biogeochemical Cycles ◽

Supplementary Information ◽

Biological Sequence ◽

Missing Information ◽

Taxonomic Assignment ◽

Taxonomic Rank ◽

Coding Sequences ◽

Global Biogeochemical Cycles ◽

Taxonomic Groups ◽

Python Package

Abstract Motivation Microbial communities drive matter and energy transformations integral to global biogeochemical cycles, yet many taxonomic groups facilitating these processes remain poorly represented in biological sequence databases. Due to this missing information, taxonomic assignment of sequences from environmental genomes remains inaccurate. Results We present the Tree-based Sensitive and Accurate Phylogenetic Profiler (TreeSAPP) software for functionally and taxonomically classifying genes, reactions and pathways from genomes of cultivated and uncultivated microorganisms using reference packages representing coding sequences mediating multiple globally relevant biogeochemical cycles. TreeSAPP uses linear regression of evolutionary distance on taxonomic rank to improve classifications, assigning both closely related and divergent query sequences at the appropriate taxonomic rank. TreeSAPP is able to provide quantitative functional and taxonomic classifications for both assembled and unassembled sequences and files supporting interactive tree of life visualizations. Availability and implementation TreeSAPP was developed in Python 3 as an open-source Python package and is available on GitHub at https://github.com/hallamlab/TreeSAPP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text