Link Your Sites (LYS) Scripts: Automated search of protein structures and mapping of sites under positive selection detected by PAML

Mapping Intimacies ◽

10.1101/540229 ◽

2019 ◽

Author(s):

Lys Sanz Moreta ◽

Rute R. da Fonseca

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Positive Selection ◽

Protein Structures ◽

Comparative Genomic ◽

Functional Domain ◽

Homologous Proteins ◽

Functional Impact ◽

Codon Substitution ◽

The Impact

ABSTRACTThe visualization of the molecular context of an amino acid mutation in a protein structure is crucial for the assessment of its functional impact and to understand its evolutionary implications. Currently, searches for fast evolving amino acid positions using codon substitution models like those implemented in PAML [1] are done in almost complete proteomes, generating large numbers of candidate proteins that require individual structural analyses. Here we present two python wrapper scripts as the package Link Your Sites (LYS). The first one i) mines the RCSB database [10] using the blast alignment tool to find the best matching homologous sequences, ii) fetches their domain positions by using Prosites [3,8,9], iii) parses the output of PAML extracting the positional information of fast-evolving sites and transform them into the coordinate system of the protein structure, iv) outputs a file per gene with the positions correlations to its homologous sequence. The second script uses the output of the first one to generate the protein’s graphical assessment. LYS can therefore generate figures to be used in publication highlighting the positively selected sites mapped on regions that are known to have functional relevance and/or be used to reduce the number of targets that will be further analyzed by providing a list of those for which structural information can be retrieved.MotivationAutomatizing the search for protein structures to assess the functional impact of sites found to be under positive selection by codeml, implemented in PAML [1]. Building publication-quality figures highlighting the sites on a protein structure model that are within and outside functional domains. reduces the workload associated with selecting proteins for which a functional assessment of the impact of mutations can be done using a protein structure. This is especially relevant when analyzing almost complete proteomes which is the case of large comparative genomic studies.SoftwareLYS scripts are executed in the command line. They automatically search for homologous proteins at the RSCB database [10], determine the functional domain locations and correlate the positions pointed by the M8 model [1], and output a data frame that can be used as the input by PyMOL [7] to generate a visualization of the results.AvailabilityLYS is easy to install and implement and they are available at https://github.com/LysSanzMoreta/LYSAutomaticSearch

Download Full-text

Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2002660117 ◽

2020 ◽

Vol 117 (45) ◽

pp. 28201-28211

Author(s):

Sumaiya Iqbal ◽

Eduardo Pérez-Palma ◽

Jakob B. Jespersen ◽

Patrick May ◽

David Hoksza ◽

...

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Molecular Mechanisms ◽

Amino Acid Level ◽

Protein Structures ◽

Point Mutations ◽

Independent Set ◽

Clinical Genetics ◽

Missense Variants ◽

The Impact

Interpretation of the colossal number of genetic variants identified from sequencing applications is one of the major bottlenecks in clinical genetics, with the inference of the effect of amino acid-substituting missense variations on protein structure and function being especially challenging. Here we characterize the three-dimensional (3D) amino acid positions affected in pathogenic and population variants from 1,330 disease-associated genes using over 14,000 experimentally solved human protein structures. By measuring the statistical burden of variations (i.e., point mutations) from all genes on 40 3D protein features, accounting for the structural, chemical, and functional context of the variations’ positions, we identify features that are generally associated with pathogenic and population missense variants. We then perform the same amino acid-level analysis individually for 24 protein functional classes, which reveals unique characteristics of the positions of the altered amino acids: We observe up to 46% divergence of the class-specific features from the general characteristics obtained by the analysis on all genes, which is consistent with the structural diversity of essential regions across different protein classes. We demonstrate that the function-specific 3D features of the variants match the readouts of mutagenesis experiments for BRCA1 and PTEN, and positively correlate with an independent set of clinically interpreted pathogenic and benign missense variants. Finally, we make our results available through a web server to foster accessibility and downstream research. Our findings represent a crucial step toward translational genetics, from highlighting the impact of mutations on protein structure to rationalizing the variants’ pathogenicity in terms of the perturbed molecular mechanisms.

Download Full-text

Link Your Sites (LYS.py): Coupling your PAML codeml results and homologous protein structures in PyMOL

10.1101/380394 ◽

2018 ◽

Author(s):

Lys Sanz Moreta ◽

Rute Andreia Rodrigues da Fonseca

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Protein Structures ◽

Visualization Tool ◽

Amino Acid Mutation ◽

Homologous Protein ◽

Codon Substitution ◽

Large Numbers ◽

Molecular Context ◽

Positively Selected Sites

Download Full-text

CoRINs: A tool to compare residue interaction networks from homologous proteins and conformers

10.1101/2020.06.29.178541 ◽

2020 ◽

Author(s):

Felipe V. da Fonseca ◽

Romildo O. Souza Júnior ◽

Marília V. A. de Almeida ◽

Thiago D. Soares ◽

Diego A. A. Morais ◽

...

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Conformational Changes ◽

Protein Function ◽

Protein Structures ◽

Software Tool ◽

Interaction Networks ◽

Homologous Proteins ◽

Residue Interaction ◽

And Function

ABSTRACTMotivationA useful approach to evaluate protein structure and quickly visualize crucial physicochemical interactions related to protein function is to construct Residue Interactions Networks (RINs). By using this application of graphs theory, the amino acid residues constitute the nodes, and the edges represent their interactions with other structural elements. Although several tools that construct RINs are available, many of them do not compare RINs from distinct protein structures. This comparison can give valuable insights into the understanding of conformational changes and the effects of amino acid substitutions in protein structure and function. With that in mind, we present CoRINs (Comparator of Residue Interaction Networks), a software tool that extensively compares RINs. The program has an accessible and user-friendly web interface, which summarizes the differences in several network parameters using interactive plots and tables. As a usage example of CoRINs, we compared RINs from conformers of two cancer-associated proteins.AvailabilityThe program is available at https://github.com/LasisUFRN/CoRINs.

Download Full-text

Codon harmonization reduces amino acid misincorporation in bacterially expressed P. falciparum proteins and improves their immunogenicity

AMB Express ◽

10.1186/s13568-019-0890-6 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Neeraja Punde ◽

Jennifer Kooken ◽

Dagmar Leary ◽

Patricia M. Legler ◽

Evelina Angov

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Codon Usage ◽

Dna Sequences ◽

Structural Integrity ◽

Host Cells ◽

Loss Of Function ◽

Species Specific ◽

And Function ◽

The Impact

Abstract Codon usage frequency influences protein structure and function. The frequency with which codons are used potentially impacts primary, secondary and tertiary protein structure. Poor expression, loss of function, insolubility, or truncation can result from species-specific differences in codon usage. “Codon harmonization” more closely aligns native codon usage frequencies with those of the expression host particularly within putative inter-domain segments where slower rates of translation may play a role in protein folding. Heterologous expression of Plasmodium falciparum genes in Escherichia coli has been a challenge due to their AT-rich codon bias and the highly repetitive DNA sequences. Here, codon harmonization was applied to the malarial antigen, CelTOS (Cell-traversal protein for ookinetes and sporozoites). CelTOS is a highly conserved P. falciparum protein involved in cellular traversal through mosquito and vertebrate host cells. It reversibly refolds after thermal denaturation making it a desirable malarial vaccine candidate. Protein expressed in E. coli from a codon harmonized sequence of P. falciparum CelTOS (CH-PfCelTOS) was compared with protein expressed from the native codon sequence (N-PfCelTOS) to assess the impact of codon usage on protein expression levels, solubility, yield, stability, structural integrity, recognition with CelTOS-specific mAbs and immunogenicity in mice. While the translated proteins were expected to be identical, the translated products produced from the codon-harmonized sequence differed in helical content and showed a smaller distribution of polypeptides in mass spectra indicating lower heterogeneity of the codon harmonized version and fewer amino acid misincorporations. Substitutions of hydrophobic-to-hydrophobic amino acid were observed more commonly than any other. CH-PfCelTOS induced significantly higher antibody levels compared with N-PfCelTOS; however, no significant differences in either IFN-γ or IL-4 cellular responses were detected between the two antigens.

Download Full-text

Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites

Genetics ◽

10.1093/genetics/155.1.431 ◽

2000 ◽

Vol 155 (1) ◽

pp. 431-449 ◽

Cited By ~ 41

Author(s):

Ziheng Yang ◽

Rasmus Nielsen ◽

Nick Goldman ◽

Anne-Mette Krabbe Pedersen

Keyword(s):

Amino Acid ◽

Positive Selection ◽

Selective Pressure ◽

Acid Sites ◽

Data Sets ◽

Protein Coding ◽

Important Indicator ◽

Diversifying Selection ◽

Codon Substitution ◽

Neutral Mutations

AbstractComparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (ω = dN/dS) is an important indicator of selective pressure at the protein level, with ω = 1 meaning neutral mutations, ω < 1 purifying selection, and ω > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying ω ratios. We develop models that account for heterogeneous ω ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of ω among sites. In all data sets analyzed, the selective pressure indicated by the ω ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average ω ratio across sites is <1, but in which some sites are clearly under diversifying selection with ω > 1. Genes undergoing positive selection include the β-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for ω and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of ω among sites from real data sets.

Download Full-text

Modeling to Understand Plant Protein Structure-Function Relationships—Implications for Seed Storage Proteins

Molecules ◽

10.3390/molecules25040873 ◽

2020 ◽

Vol 25 (4) ◽

pp. 873 ◽

Cited By ~ 3

Author(s):

Faiza Rasheed ◽

Joel Markgren ◽

Mikael Hedenqvist ◽

Eva Johansson

Keyword(s):

Protein Structure ◽

Structure Function ◽

Storage Proteins ◽

Protein Structures ◽

Seed Storage ◽

Seed Storage Proteins ◽

Plant Proteins ◽

Modeling Tools ◽

Intrinsically Disordered ◽

The Impact

Proteins are among the most important molecules on Earth. Their structure and aggregation behavior are key to their functionality in living organisms and in protein-rich products. Innovations, such as increased computer size and power, together with novel simulation tools have improved our understanding of protein structure-function relationships. This review focuses on various proteins present in plants and modeling tools that can be applied to better understand protein structures and their relationship to functionality, with particular emphasis on plant storage proteins. Modeling of plant proteins is increasing, but less than 9% of deposits in the Research Collaboratory for Structural Bioinformatics Protein Data Bank come from plant proteins. Although, similar tools are applied as in other proteins, modeling of plant proteins is lagging behind and innovative methods are rarely used. Molecular dynamics and molecular docking are commonly used to evaluate differences in forms or mutants, and the impact on functionality. Modeling tools have also been used to describe the photosynthetic machinery and its electron transfer reactions. Storage proteins, especially in large and intrinsically disordered prolamins and glutelins, have been significantly less well-described using modeling. These proteins aggregate during processing and form large polymers that correlate with functionality. The resulting structure-function relationships are important for processed storage proteins, so modeling and simulation studies, using up-to-date models, algorithms, and computer tools are essential for obtaining a better understanding of these relationships.

Download Full-text

PanFunPro: PAN-genome analysis based on FUNctional PROfiles

F1000Research ◽

10.12688/f1000research.2-265.v1 ◽

2013 ◽

Vol 2 ◽

pp. 265 ◽

Cited By ~ 15

Author(s):

Oksana Lukjancenko ◽

Martin Christen Thomsen ◽

Mette Voldby Larsen ◽

David Wayne Ussery

Keyword(s):

Genome Analysis ◽

Markov Models ◽

Epidemiological Studies ◽

Comparative Genomic ◽

Functional Domain ◽

Homologous Proteins ◽

Pan Genome ◽

Genomic Study ◽

Comparative Genomic Study ◽

Functional Profiles

PanFunPro is a tool for pan-genome analysis that integrates functional domains from three Hidden Markov Models (HMM) collections, and uses this information to group homologous proteins into families based on functional domain content. We use PanFunPro to compare a set of Lactobacillus and Streptococcus genomes. The example demonstrates that this method can provide analysis of differences and similarities in protein content within user-defined sets of genomes. PanFunPro can find various applications in a comparative genomic study, starting with the basic comparison of newly sequenced isolates to already existing strains, and an estimation of shared and specific genomic content. Furthermore, it can potentially be used in the determination of target sequences for in silico bacterial identification, as well as for epidemiological studies.

Download Full-text

Hermes: an ensemble machine learning architecture for protein secondary structure prediction

10.1101/640656 ◽

2019 ◽

Author(s):

Larry Bliss ◽

Ben Pascoe ◽

Samuel K Sheppard

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Secondary Structure ◽

Structure Prediction ◽

Cross Validation ◽

Secondary Structure Prediction ◽

Protein Structures ◽

Lower Boundary ◽

Protein Secondary Structure ◽

Homologous Proteins

AbstractMotivationProtein structure predictions, that combine theoretical chemistry and bioinformatics, are an increasingly important technique in biotechnology and biomedical research, for example in the design of novel enzymes and drugs. Here, we present a new ensemble bi-layered machine learning architecture, that directly builds on ten existing pipelines providing rapid, high accuracy, 3-State secondary structure prediction of proteins.ResultsAfter training on 1348 solved protein structures, we evaluated the model with four independent datasets: JPRED4 - compiled by the authors of the successful predictor with the same name, and CASP11, CASP12 & CASP13 - assembled by the Critical Assessment of protein Structure Prediction consortium who run biannual experiments focused on objective testing of predictors. These rigorous, pre-established protocols included 7-fold cross-validation and blind testing. This led to a mean Hermes accuracy of 95.5%, significantly (p<0.05) better than the ten previously published models analysed in this paper. Furthermore, Hermes yielded a reduction in standard deviation, lower boundary outliers, and reduced dependency on solved structures of homologous proteins, as measured by NEFF score. This architecture provides advantages over other pipelines, while remaining accessible to users at any level of bioinformatics experience.Availability and ImplementationThe source code for Hermes is freely available at: https://github.com/HermesPrediction/Hermes. This page also includes the cross-validation with corresponding models, and all training/testing data presented in this study with predictions and accuracy.

Download Full-text

PREMONITION - Preprocessing motifs in protein structures for search acceleration

F1000Research ◽

10.12688/f1000research.5166.1 ◽

2014 ◽

Vol 3 ◽

pp. 217 ◽

Cited By ~ 3

Author(s):

Sandeep Chakraborty ◽

Basuthkar J. Rao ◽

Bjarni Asgeirsson ◽

Ravindra Venkatramani ◽

Abhaya M. Dandekar

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Active Site ◽

Active Sites ◽

Protein Structures ◽

3D Structure ◽

Search Space ◽

Computational Method ◽

Computational Time ◽

Active Site Residues

The remarkable diversity in biological systems is rooted in the ability of the twenty naturally occurring amino acids to perform multifarious catalytic functions by creating unique structural scaffolds known as the active site. Finding such structrual motifs within the protein structure is a key aspect of many computational methods. The algorithm for obtaining combinations of motifs of a certain length, although polynomial in complexity, runs in non-trivial computer time. Also, the search space expands considerably if stereochemically equivalent residues are allowed to replace an amino acid in the motif. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION). PREMONITION rolls a sphere of radius R along the protein fold centered at the C atom of each residue, and all possible motifs are extracted within this sphere. The number of residues that can occur within a sphere centered around a residue is bounded by physical constraints, thus setting an upper limit on the processing times. After such a pre-compilation step, the computational time required for querying a protein structure with multiple motifs is considerably reduced. Previously, we had proposed a computational method to estimate the promiscuity of proteins with known active site residues and 3D structure using a database of known active sites in proteins (CSA) by querying each protein with the active site motif of every other residue. The runtimes for such a comparison is reduced from days to hours using the PREMONITION methodology.

Download Full-text

Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution

10.1101/2021.04.13.439703 ◽

2021 ◽

Author(s):

Chris Papadopoulos ◽

Isabelle Callebaut ◽

Jean-Christophe Gelly ◽

Isabelle Hatin ◽

Olivier Namy ◽

...

Keyword(s):

Protein Structure ◽

Amino Acid ◽

De Novo ◽

Protein Structures ◽

Structural Diversity ◽

Building Blocks ◽

Amino Acid Sequences ◽

Novel Genes ◽

Noncoding Sequences ◽

De Novo Gene

The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences' properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural diversity of canonical proteins with strikingly the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by identifying intergenic ORFs with a strong translation signal in ribosome profiling experiments and by reconstructing the ancestral sequences of 70 yeast de novo genes. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and the one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.

Download Full-text