MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information

Sitao Wu; Yang Zhang

doi:10.1002/prot.21945

Identification of ligand-binding residues using protein sequence profile alignment and query-specific support vector machine model

Analytical Biochemistry ◽

10.1016/j.ab.2020.113799 ◽

2020 ◽

Vol 604 ◽

pp. 113799

Author(s):

Jun Hu ◽

Liang Rao ◽

Xueqiang Fan ◽

Guijun Zhang

Keyword(s):

Support Vector Machine ◽

Protein Sequence ◽

Support Vector Machine Model ◽

Support Vector ◽

Sequence Profile ◽

Machine Model ◽

Specific Support ◽

Binding Residues ◽

Protein Sequence Profile ◽

Profile Alignment

Download Full-text

Protein sequence profile prediction using ProtAlbert transformer1

10.1101/2021.09.23.461475 ◽

2021 ◽

Author(s):

Fatemeh Zare-Mirakabad ◽

Armin Behjati ◽

Seyed Shahriar Arab ◽

Abbas Nowzari-Dalini

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Nearest Neighbor ◽

Tertiary Structure ◽

Query Sequence ◽

Protein Secondary Structure ◽

Protein Sequences ◽

Family Characteristics ◽

Sequence Profile ◽

Protein Sequence Profile

Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.

Download Full-text

A comparison of scoring functions for protein sequence profile alignment

Bioinformatics ◽

10.1093/bioinformatics/bth090 ◽

2004 ◽

Vol 20 (8) ◽

pp. 1301-1308 ◽

Cited By ~ 76

Author(s):

R. C. Edgar ◽

K. Sjolander

Keyword(s):

Protein Sequence ◽

Scoring Functions ◽

Sequence Profile ◽

Protein Sequence Profile ◽

Profile Alignment

Download Full-text

To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.9b00438 ◽

2019 ◽

Vol 60 (1) ◽

pp. 391-399 ◽

Cited By ~ 3

Author(s):

Sheng Chen ◽

Zhe Sun ◽

Lihua Lin ◽

Zifeng Liu ◽

Xun Liu ◽

...

Keyword(s):

Protein Sequence ◽

Image Captioning ◽

Sequence Profile ◽

Distance Map ◽

Protein Sequence Profile

Download Full-text

To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map

10.1101/628917 ◽

2019 ◽

Author(s):

Sheng Chen ◽

Zhe Sun ◽

Zifeng Liu ◽

Xun Liu ◽

Yutian Chong ◽

...

Keyword(s):

Protein Sequence ◽

Network Architecture ◽

Structural Information ◽

3D Structure ◽

Previous Method ◽

Image Captioning ◽

Sequence Profile ◽

Distance Map ◽

3D Structures ◽

Protein Sequence Profile

ABSTRACTProtein sequence profile prediction aims to generate multiple sequences from structural information to advance the protein design. Protein sequence profile can be computationally predicted by energy-based method or fragment-based methods. By integrating these methods with neural networks, our previous method, SPIN2 has achieved a sequence recovery rate of 34%. However, SPIN2 employed only one dimensional (1D) structural properties that are not sufficient to represent 3D structures. In this study, we represented 3D structures by 2D maps of pairwise residue distances. and developed a new method (SPROF) to predict protein sequence profile based on an image captioning learning frame. To our best knowledge, this is the first method to employ 2D distance map for predicting protein properties. SPROF achieved 39.8% in sequence recovery of residues on the independent test set, representing a 5.2% improvement over SPIN2. We also found the sequence recovery increased with the number of their neighbored residues in 3D structural space, indicating that our method can effectively learn long range information from the 2D distance map. Thus, such network architecture using 2D distance map is expected to be useful for other 3D structure-based applications, such as binding site prediction, protein function prediction, and protein interaction prediction.

Download Full-text

Molecular insights on CALX-CBD12 inter-domain dynamics from MD simulations, RDCs and SAXS

10.1101/2020.12.18.423531 ◽

2020 ◽

Author(s):

Maximilia F. de Souza Degenhardt ◽

Phelipe A. M. Vitale ◽

Layara A. Abiko ◽

Martin Zacharias ◽

Michael Sattler ◽

...

Keyword(s):

Transmembrane Domain ◽

Md Simulations ◽

Bound State ◽

Conformational Ensemble ◽

Regulation Mechanism ◽

Multiple Sources ◽

Intracellular Loop ◽

Saxs Data ◽

Structure Information ◽

Apo State

ABSTRACTNa+/Ca2+ exchangers (NCX) are secondary active transporters that couple the translocation of Na+ with the transport of Ca2+ in the opposite direction. The exchanger is an essential Ca2+ extrusion mechanism in excitable cells. It consists of a transmembrane domain and a large intracellular loop that contains two Ca2+-binding domains, CBD1 and CBD2. The two CBDs are adjacent to each other and form a two-domain Ca2+-sensor called CBD12. Binding of intracellular Ca2+ to CBD12 activates the NCX but inhibits the Na+/Ca2+ exchanger of Drosophila, CALX. NMR spectroscopy and SAXS studies showed that CALX and NCX CBD12 constructs display significant inter-domain flexibility in the Apo state, but assume rigid inter-domain arrangements in the Ca2+-bound state. However, detailed structure information on CBD12 in the Apo state is missing. Structural characterization of proteins formed by two or more domains connected by flexible linkers is notoriously challenging and requires the combination of orthogonal information from multiple sources. As an attempt to characterize the conformational ensemble of CALX-CBD12 in the Apo state, we applied molecular dynamics (MD) simulations, NMR (1H-15N RDCs) and Small-Angle X-Ray Scattering (SAXS) data in a combined modelling strategy that generated atomistic information on the most representative conformations. This joint approach demonstrated that CALX-CBD12 preferentially samples closed conformations, while the wide-open inter-domain arrangement characteristic of the Ca2+-bound state is less frequently sampled. These results are consistent with the view that Ca2+ binding shifts the CBD12 conformational ensemble towards extended conformers, which could be a key step in the Na+/Ca2+ exchangers’ allosteric regulation mechanism. The present strategy, combining MD with NMR and SAXS, provides a powerful approach to select representative structures from ensembles of conformations, which could be applied to other flexible multi-domain systems.SIGNIFICANCEThe conformational ensemble of CALX-CBD12, the main Ca2+-sensor of Drosophila’s Na+/Ca2+ exchanger, was characterized by a combination of MD simulations with SAXS and NMR data using the EOM approach. This analysis showed that this two-domain construct experiences opening-closing motions, providing molecular information about CALX-CBD12 in the Apo state. Ca2+-binding shifts the conformational ensemble towards extended conformers. These findings are consistent with a model according to which Ca2+ modulation of CBD12 plasticity is a key step in the Ca2+-regulation mechanism of the full-length exchanger.

Download Full-text

fRMSDAlign: PROTEIN SEQUENCE ALIGNMENT USING PREDICTED LOCAL STRUCTURE INFORMATION FOR PAIRS WITH LOW SEQUENCE IDENTITY

Proceedings of the 6th Asia-Pacific Bioinformatics Conference ◽

10.1142/9781848161092_0014 ◽

2007 ◽

Cited By ~ 1

Author(s):

HUZEFA RANGWALA ◽

GEORGE KARYPIS

Keyword(s):

Sequence Alignment ◽

Protein Sequence ◽

Local Structure ◽

Structure Information ◽

Protein Sequence Alignment ◽

Sequence Identity

Download Full-text

A homology identification method that combines protein sequence and structure information

Protein Science ◽

10.1002/pro.5560071203 ◽

1998 ◽

Vol 7 (12) ◽

pp. 2499-2510 ◽

Cited By ~ 18

Author(s):

Lihua Yu ◽

James V. White ◽

Temple F. Smith

Keyword(s):

Protein Sequence ◽

Structure Information ◽

Identification Method

Download Full-text

FIK Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery

Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) ◽

10.1109/bibe.2006.253311 ◽

2006 ◽

Cited By ~ 11

Author(s):

Bernard Chen ◽

Phang Tai ◽

Robert Harrison ◽

Yi Pan

Keyword(s):

Protein Sequence ◽

Granular Computing ◽

Sequence Motifs ◽

Information Discovery ◽

Structure Information ◽

Computing Model ◽

Protein Sequence Motifs

Download Full-text

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

10.21203/rs.3.rs-884099/v1 ◽

2021 ◽

Author(s):

Emidio Capriotti ◽

Piero Fariselli

Keyword(s):

Protein Sequence ◽

Prediction Models ◽

Evolutionary Information ◽

Large Set ◽

Primary Role ◽

Sequence Profile ◽

Sequence Alignments ◽

Multiple Sequence ◽

Missense Variants ◽

The Impact

Abstract Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. This observation indicates that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

Download Full-text