Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models

Mapping Intimacies ◽

10.1101/028936 ◽

2015 ◽

Cited By ~ 2

Author(s):

Hugo Jacquin ◽

Amy Gilson ◽

Eugene Shakhnovich ◽

Simona Cocco ◽

Rémi Monasson

Keyword(s):

Protein Structure ◽

Structural Information ◽

Sequence Data ◽

Careful Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Pairwise Models ◽

Statistical Approaches ◽

And Function

Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of `true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons of the power of inverse approaches to the modelling of proteins from sequence data, and their limitations; we show, in particular, that their success crucially depend on the accurate inference of the Potts pairwise couplings.

Download Full-text

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

10.1101/2021.12.20.473431 ◽

2021 ◽

Author(s):

Liang Hong ◽

Siqi Sun ◽

Liangzhen Zheng ◽

Qingxiong Tan ◽

Yu Li

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Structure And Function ◽

Sequence Alignments ◽

Protein Structure And Function ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.

Download Full-text

Multiple Sequence Alignments as Tools for Protein Structure and Function Prediction

Comparative and Functional Genomics ◽

10.1002/cfg.313 ◽

2003 ◽

Vol 4 (4) ◽

pp. 424-427 ◽

Cited By ~ 4

Author(s):

Alfonso Valencia

Keyword(s):

Protein Structure ◽

Protein Interactions ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Structure Evolution ◽

Protein Protein Interactions ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Multiple sequence alignments have much to offer to the understanding of protein structure, evolution and function. We are developing approaches to use this information in predicting protein-binding specificity, intra-protein and protein-protein interactions, and in reconstructing protein interaction networks.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Epistatic contributions promote the unification of incompatible models of neutral molecular evolution

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1913071117 ◽

2020 ◽

Vol 117 (11) ◽

pp. 5873-5882 ◽

Cited By ~ 1

Author(s):

Jose Alberto de la Paz ◽

Charisse M. Nartey ◽

Monisha Yuvaraj ◽

Faruck Morcos

Keyword(s):

Structural Information ◽

Stokes Shift ◽

Neutral Evolution ◽

Emergent Properties ◽

Sequence Evolution ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Analysis Methodology ◽

The Relationship

We introduce a model of amino acid sequence evolution that accounts for the statistical behavior of real sequences induced by epistatic interactions. We base the model dynamics on parameters derived from multiple sequence alignments analyzed by using direct coupling analysis methodology. Known statistical properties such as overdispersion, heterotachy, and gamma-distributed rate-across-sites are shown to be emergent properties of this model while being consistent with neutral evolution theory, thereby unifying observations from previously disjointed evolutionary models of sequences. The relationship between site restriction and heterotachy is characterized by tracking the effective alphabet dynamics of sites. We also observe an evolutionary Stokes shift in the fitness of sequences that have undergone evolution under our simulation. By analyzing the structural information of some proteins, we corroborate that the strongest Stokes shifts derive from sites that physically interact in networks near biochemically important regions. Perspectives on the implementation of our model in the context of the molecular clock are discussed.

Download Full-text

A minimum reporting standard for multiple sequence alignments

10.1101/2020.01.15.907733 ◽

2020 ◽

Author(s):

Thomas KF Wong ◽

Subha Kalyaanamoorthy ◽

Karen Meusemann ◽

David K Yeates ◽

Bernhard Misof ◽

...

Keyword(s):

Amino Acids ◽

Sequence Data ◽

Pivotal Role ◽

Sequence Alignments ◽

Reporting Standard ◽

Multiple Sequence ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Multiple Sequence Alignments

ABSTRACTMultiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely-specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

A minimum reporting standard for multiple sequence alignments

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa024 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 8

Author(s):

Thomas K F Wong ◽

Subha Kalyaanamoorthy ◽

Karen Meusemann ◽

David K Yeates ◽

Bernhard Misof ◽

...

Keyword(s):

Amino Acids ◽

Sequence Data ◽

Pivotal Role ◽

Sequence Alignments ◽

Reporting Standard ◽

Multiple Sequence ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Multiple Sequence Alignments

Abstract Multiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns

Bioinformatics ◽

10.1093/bioinformatics/btt360 ◽

2013 ◽

Vol 29 (17) ◽

pp. 2112-2121 ◽

Cited By ~ 31

Author(s):

Francisco M. Ortuño ◽

Olga Valenzuela ◽

Fernando Rojas ◽

Hector Pomares ◽

Javier P. Florido ◽

...

Keyword(s):

Genetic Algorithm ◽

Structural Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

The Haemophilus influenzae hFbpABC Fe3+ Transporter: Analysis of the Membrane Permease and Development of a Gallium-Based Screen for Mutants

Journal of Bacteriology ◽

10.1128/jb.00145-07 ◽

2007 ◽

Vol 189 (14) ◽

pp. 5130-5141 ◽

Cited By ~ 12

Author(s):

Damon S. Anderson ◽

Pratima Adhikari ◽

Katherine D. Weaver ◽

Alvin L. Crumbliss ◽

Timothy A. Mietzner

Keyword(s):

Haemophilus Influenzae ◽

Binding Protein ◽

Essential Element ◽

Transport Activity ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Discernible Effect ◽

Iron Binding Protein ◽

And Function

ABSTRACT The obligate human pathogen Haemophilus influenzae utilizes a siderophore-independent (free) Fe3+ transport system to obtain this essential element from the host iron-binding protein transferrin. The hFbpABC transporter is a binding protein-dependent ABC transporter that functions to shuttle (free) Fe3+ through the periplasm and across the inner membrane of H. influenzae. This investigation focuses on the structure and function of the hFbpB membrane permease component of the transporter, a protein that has eluded prior characterization. Based on multiple-sequence alignments between permease orthologs, a series of site-directed mutations targeted at residues within the two conserved permease motifs were generated. The hFbpABC transporter was expressed in a siderophore-deficient Escherichia coli background, and effects of mutations were analyzed using growth rescue and radiolabeled 55Fe3+ transport assays. Results demonstrate that mutation of the invariant glycine (G418A) within motif 2 led to attenuated transport activity, while mutation of the invariant glycine (G155A/V/E) within motif 1 had no discernible effect on activity. Individual mutations of well-conserved leucines (L154D and L417D) led to attenuated and null transport activities, respectively. As a complement to site-directed methods, a mutant screen based on resistance to the toxic iron analog gallium, an hFbpABC inhibitor, was devised. The screen led to the identification of several significant hFbpB mutations; V497I, I174F, and S475I led to null transport activities, while S146Y resulted in attenuated activity. Significant residues were mapped to a topological model of the hFbpB permease, and the implications of mutations are discussed in light of structural and functional data from related ABC transporters.

Download Full-text

Bioinformatic analysis of riboswitch structures uncovers variant classes with altered ligand specificity

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1619581114 ◽

2017 ◽

Vol 114 (11) ◽

pp. E2077-E2085 ◽

Cited By ~ 42

Author(s):

Zasha Weinberg ◽

James W. Nelson ◽

Christina E. Lünse ◽

Madeline E. Sherlock ◽

Ronald R. Breaker

Keyword(s):

Structural Information ◽

Bioinformatic Analysis ◽

Flavin Mononucleotide ◽

Ligand Specificity ◽

Sequence Alignments ◽

Multiple Sequence ◽

Form Complex ◽

Multiple Sequence Alignments ◽

Bacterial Signaling ◽

Folded Structures

Riboswitches are RNAs that form complex, folded structures that selectively bind small molecules or ions. As with certain groups of protein enzymes and receptors, some riboswitch classes have evolved to change their ligand specificity. We developed a procedure to systematically analyze known riboswitch classes to find additional variants that have altered their ligand specificity. This approach uses multiple-sequence alignments, atomic-resolution structural information, and riboswitch gene associations. Among the discoveries are unique variants of the guanine riboswitch class that most tightly bind the nucleoside 2′-deoxyguanosine. In addition, we identified variants of the glycine riboswitch class that no longer recognize this amino acid, additional members of a rare flavin mononucleotide (FMN) variant class, and also variants of c-di-GMP-I and -II riboswitches that might recognize different bacterial signaling molecules. These findings further reveal the diverse molecular sensing capabilities of RNA, which highlights the potential for discovering a large number of additional natural riboswitch classes.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text