TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

Bioinformatics ◽

10.1093/bioinformatics/btab198 ◽

2021 ◽

Author(s):

Yue Cao ◽

Yang Shen

Keyword(s):

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Sequence Similarity ◽

Directed Graphs ◽

Training Data ◽

Supplementary Information ◽

Sequence Information ◽

Function Annotation ◽

Protein Function Annotation

Abstract Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. Results To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability. Availability The data, source codes and models are available at https://github.com/Shen-Lab/TALE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

10.1101/2020.09.27.315937 ◽

2020 ◽

Author(s):

Yue Cao ◽

Yang Shen

Keyword(s):

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Sequence Similarity ◽

Directed Graphs ◽

Training Data ◽

Supplementary Information ◽

Function Annotation ◽

Source Codes ◽

Protein Function Annotation

AbstractMotivationFacing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on data besides sequences, or lack generalizability to novel sequences, species and functions.ResultsTo overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizbility to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions, we also embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low homology and never/rarely annotated novel species or functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability.AvailabilityThe data, source codes and models are available at https://github.com/Shen-Lab/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

SOME INTRIGUING HIGH-THROUGHPUT DNA SEQUENCE VARIANTS PREDICTION OVER PROTEIN FUNCTIONALITY

Jurnal Teknologi ◽

10.11113/jt.v78.8967 ◽

2016 ◽

Vol 78 (6-4) ◽

Author(s):

Atabak Kheirkhah ◽

Salwani Mohd Daud ◽

Noor Azurati Ahmad @ Salleh ◽

Suriani Mohd Sam ◽

Hafiza Abas ◽

...

Keyword(s):

Genetic Algorithm ◽

Dna Sequence ◽

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Fundamental Problem ◽

Sequence Information ◽

Sequence Variants ◽

Sequence Prediction ◽

Dna Sequence Variants

This paper intends to review computational methods and high throughput automated tools for precisely prediction various functionalities of uncharacterized proteins based on their desired DNA sequence information alone. Then proposes a hybrid weighted network and Genetic Algorithm to improve prediction purpose. The main advantage of the method is the protein function and DNA sequence prediction can be computed precisely using best fitness parent in genetic algorithm. With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased exponentially and the pace is much slower in determining their biological attributes. The gap between DNA sequence variants and their functionalities has become increasingly large. However, detection of sequences based on protein data bank has become benchmark for many researchers. As amount of DNA sequence data continues to increase, the fundamental problem stay at the front of genome analysis. In the course of developing these methods, the following matters were often needed to consider: benchmark dataset construction, gene sequence prediction, operating algorithm, anticipated accuracy, gene recommender and functional integrations. In this review, we are to discuss each of them, with a different focus on operational algorithms and how to increase the accuracy of DNA sequence variants prediction.

Download Full-text

PIPA: A High-Throughput Pipeline for Protein Function Annotation

2008 DoD HPCMP Users Group Conference ◽

10.1109/dod.hpcmp.ugc.2008.24 ◽

2008 ◽

Author(s):

Chenggang Yu ◽

Valmik Desai ◽

Nela Zavaljevski ◽

Jaques Reifman

Keyword(s):

High Throughput ◽

Protein Function ◽

Function Annotation ◽

Protein Function Annotation

Download Full-text

A combined approach for genome wide protein function annotation/prediction

Proteome Science ◽

10.1186/1477-5956-11-s1-s1 ◽

2013 ◽

Vol 11 (Suppl 1) ◽

pp. S1 ◽

Cited By ~ 17

Author(s):

Alfredo Benso ◽

Stefano Di Carlo ◽

Hafeez ur Rehman ◽

Gianfranco Politano ◽

Alessandro Savino ◽

...

Keyword(s):

Protein Function ◽

Combined Approach ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide

Download Full-text

Protein function annotation using protein domain family resources

Methods ◽

10.1016/j.ymeth.2015.09.029 ◽

2016 ◽

Vol 93 ◽

pp. 24-34 ◽

Cited By ~ 17

Author(s):

Sayoni Das ◽

Christine A. Orengo

Keyword(s):

Protein Function ◽

Protein Domain ◽

Domain Family ◽

Family Resources ◽

Function Annotation ◽

Protein Function Annotation ◽

Protein Domain Family

Download Full-text

TE-greedy-nester: structure-based detection of LTR retrotransposons and their nesting

Bioinformatics ◽

10.1093/bioinformatics/btaa632 ◽

2020 ◽

Vol 36 (20) ◽

pp. 4991-4999

Author(s):

Matej Lexa ◽

Pavel Jedlicka ◽

Ivan Vanat ◽

Michal Cervenansky ◽

Eduard Kejnovsky

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Recursive Algorithm ◽

Complex Mixture ◽

Computation Time ◽

Full Length ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Process Error ◽

Transposon Evolution

Abstract Motivation Transposable elements (TEs) in eukaryotes often get inserted into one another, forming sequences that become a complex mixture of full-length elements and their fragments. The reconstruction of full-length elements and the order in which they have been inserted is important for genome and transposon evolution studies. However, the accumulation of mutations and genome rearrangements over evolutionary time makes this process error-prone and decreases the efficiency of software aiming to recover all nested full-length TEs. Results We created software that uses a greedy recursive algorithm to mine increasingly fragmented copies of full-length LTR retrotransposons in assembled genomes and other sequence data. The software called TE-greedy-nester considers not only sequence similarity but also the structure of elements. This new tool was tested on a set of natural and synthetic sequences and its accuracy was compared to similar software. We found TE-greedy-nester to be superior in a number of parameters, namely computation time and full-length TE recovery in highly nested regions. Availability and implementation http://gitlab.fi.muni.cz/lexa/nested. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Bioinformatics ◽

10.1093/bioinformatics/btaa701 ◽

2020 ◽

Cited By ~ 1

Author(s):

Amelia Villegas-Morcillo ◽

Stavros Makrodimitris ◽

Roeland C H J van Ham ◽

Angel M Gomez ◽

Victoria Sanchez ◽

...

Keyword(s):

Protein Function ◽

Prediction Models ◽

Protein Function Prediction ◽

3D Structure ◽

Function Prediction ◽

Feature Representation ◽

Training Data ◽

Supplementary Information ◽

Molecular Function ◽

Structure Information

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences

Bioinformatics ◽

10.1093/bioinformatics/bty704 ◽

2018 ◽

Vol 35 (5) ◽

pp. 753-759 ◽

Cited By ~ 8

Author(s):

Aashish Jain ◽

Daisuke Kihara

Keyword(s):

Protein Function ◽

Transfer Functions ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Prediction Method ◽

Query Protein ◽

Function Prediction ◽

Homology Search ◽

Supplementary Information ◽

Phylogenetic Distance

Abstract Motivation Function annotation of proteins is fundamental in contemporary biology across fields including genomics, molecular biology, biochemistry, systems biology and bioinformatics. Function prediction is indispensable in providing clues for interpreting omics-scale data as well as in assisting biologists to build hypotheses for designing experiments. As sequencing genomes is now routine due to the rapid advancement of sequencing technologies, computational protein function prediction methods have become increasingly important. A conventional method of annotating a protein sequence is to transfer functions from top hits of a homology search; however, this approach has substantial short comings including a low coverage in genome annotation. Results Here we have developed Phylo-PFP, a new sequence-based protein function prediction method, which mines functional information from a broad range of similar sequences, including those with a low sequence similarity identified by a PSI-BLAST search. To evaluate functional similarity between identified sequences and the query protein more accurately, Phylo-PFP reranks retrieved sequences by considering their phylogenetic distance. Compared to the Phylo-PFP’s predecessor, PFP, which was among the top ranked methods in the second round of the Critical Assessment of Functional Annotation (CAFA2), Phylo-PFP demonstrated substantial improvement in prediction accuracy. Phylo-PFP was further shown to outperform prediction programs to date that were ranked top in CAFA2. Availability and implementation Phylo-PFP web server is available for at http://kiharalab.org/phylo_pfp.php. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

BMC Genomics ◽

10.1186/1471-2164-9-s2-s2 ◽

2008 ◽

Vol 9 (Suppl 2) ◽

pp. S2 ◽

Cited By ~ 24

Author(s):

Inbal Halperin ◽

Dariya S Glazer ◽

Shirley Wu ◽

Russ B Altman

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Novel Applications

Download Full-text

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

BMC Bioinformatics ◽

10.1186/1471-2105-9-52 ◽

2008 ◽

Vol 9 (1) ◽

pp. 52 ◽

Cited By ~ 23

Author(s):

Chenggang Yu ◽

Nela Zavaljevski ◽

Valmik Desai ◽

Seth Johnson ◽

Fred J Stevens ◽

...

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide ◽

Automated Pipeline

Download Full-text