scholarly journals Inferring Protein Domain Semantic Roles Using word2vec

2019 ◽  
Author(s):  
Daniel Buchan ◽  
David Jones

AbstractIn this paper, using word2vec, we demonstrate that proteins domains may have semantic “meaning” in the context of multi-domain proteins. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a vector space. In this work we treat multi-domain proteins as “sentences” where domain identifiers are tokens which may be considered as “words”. Using all Interpro (Finn, Attwood et al. 2017) eukaryotic proteins as a corpus of “sentences” we demonstrate that Word2vec creates functionally meaningful embeddings of protein domains. We additionally show how this can be applied to identifying the putative functional roles for Pfam (Finn, Coggill et al. 2016) Domains of Unknown Function.

2021 ◽  
Author(s):  
A.S.M. Zisanur Rahman ◽  
Lukas Timmerman ◽  
Flyn Gallardo ◽  
Silvia T. Cardona

Abstract A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term “essential” to whole genes rather than the protein domain sequences that confer the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as “essential domain-containing” (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.


2018 ◽  
Author(s):  
Stefania Daghino ◽  
Luigi Di Vietro ◽  
Luca Petiti ◽  
Elena Martino ◽  
Cristina Dallabona ◽  
...  

AbstractProtein domains are structurally and functionally distinct units responsible for particular protein functions or interactions. Although protein domains contribute to the overall protein function(s) and can be used for protein classification, about 20% of protein domains are currently annotated as “domains of an unknown function” (DUFs). DUF 614, a cysteine-rich domain better known as PLAC8 (Placenta-Specific Gene 8), occurs in proteins found in the majority of Eukaryotes. PLAC8-containing proteins play important yet diverse roles in different organisms, such as control of cell proliferation in animals and plants or heavy metal resistance in plants and fungi. For example, Onzin from Mus musculus is a key regulator of cell proliferation, whereas FCR1 from the ascomycete Oidiodendron maius confers cadmium resistance. Onzin and FCR1 are small, single-domain PLAC8 proteins and we hypothesized that, despite their apparently different role, a common molecular function of these proteins may be linked to the PLAC8 domain. To address this hypothesis, we compared these two PLAC8-containing proteins by heterologous expression in the PLAC8-free yeast Saccharomyces cerevisiae. When expressed in yeast, both Onzin and FCR1 improved cadmium resistance, reduced cadmium-induced DNA mutagenesis, localized in the nucleus and induced similar transcriptional changes. Our results support the hypothesis of a common ancestral function of the PLAC8 domain that may link some mitochondrial biosynthetic pathways (i.e. leucine biosynthesis and Fe-S cluster biogenesis) with the control of DNA damage, thus opening new perspectives to understand the role of this protein domain in the cellular biology of Eukaryotes.Author SummaryProtein domains are the functional units of proteins and typically have distinct structure and function. However, many widely distributed protein domains are currently annotated as “domains of unknown function” (DUFs). We have focused on DUF 614, a protein domain found in many Eukaryotes and better known as PLAC8 (Placenta-Specific Gene 8). The functional role of DUF 614 is unclear because PLAC8 proteins seem to play important yet different roles in taxonomically distant organisms such as animals, plants and fungi. We used S. cerevisiae to test whether these apparently different functions, namely in cell proliferation and metal tolerance, respectively reported for the murine Onzin and the fungal FCR1, are mediated by the same molecular mechanisms. Our data demonstrate that the two PLAC8 proteins induced the same growth phenotype and transcriptional changes in S. cerevisiae. In particular, they both induced the biosynthesis of the amino acid leucine and of the iron-sulfur cluster, one of the most ancient protein cofactors. These similarities support the hypothesis of an ancestral function of the DUF 164 domain, whereas the transcriptomic data open new perspectives to understand the role of PLAC8-proteins in Eukaryotes.


1994 ◽  
Vol 6 (4) ◽  
pp. 487-500 ◽  
Author(s):  
T W McNellis ◽  
A G von Arnim ◽  
T Araki ◽  
Y Komeda ◽  
S Miséra ◽  
...  

2019 ◽  
Vol 17 (2) ◽  
pp. 161-171
Author(s):  
M. Thoihidul Islam ◽  
Mohammad Rashid Arif ◽  
Arif Hasan Khan Robin

Wheat blast is a devastating disease which is baffling scientists from its inception. This study characterized the blast resistance related protein domains with a view to develop molecular markers to identify resistant wheat genotypes against Blast fungus Magnaporthe oryzae. A genome browse analysis detected that the candidate resistance gene against blast could be located in several different chromosomes. An in silico analysis was collected with fifty nucleotide-binding site leucine-rich repeat (NBS-LRR), leucine-rich repeat (LRR), pathogenesis and resistance protein-encoding accessions on the basis of the previous resistance report. The phylogenetic tree of those putative resistance accessions, bearing resistance related protein-encoding domains, showed that an NBS-LRR accession JP957107.1 has 67% similarity with the disease resistance protein domain encoding accession of Brazilian resistant cultivar Thatcher. By contrast, the rice blast resistance Pita gene has 72% similarity with 18 pathogenesis protein domain encoding accessions. Among putative protein domains, disease resistance protein of Thatcher has 78% similarity with two NBS-LRR protein domains AAZ99757.1 and AAZ99757.1. Eighteen microsatellite markers were designed from eighteen putative NBS-LRR protein encoding accessions along with Piz3 marker. The 19 markers were unable to separate resistant and susceptible genotypes. Diffused versus conspicuous bands indicated either presence of insertion/deletion (InDel) or single nucleotide polymorphism (SNP) among wheat genotypes. Detection of InDel or SNP markers is a subject of further investigation. Additional markers are needed to be designed using new NBS-LRR, pathogenesis, coiled-coil (CC), translocated intimin receptor (TIR) resistance protein encoding accessions to find out markers specific for blast resistance. J. Bangladesh Agril. Univ. 17(2): 161–171, June 2019


2013 ◽  
Vol 9 (4) ◽  
pp. 20130268 ◽  
Author(s):  
Chia-Hsin Hsu ◽  
Chien-Kuo Chen ◽  
Ming-Jing Hwang

Protein domain architectures (PDAs), in which single domains are linked to form multiple-domain proteins, are a major molecular form used by evolution for the diversification of protein functions. However, the design principles of PDAs remain largely uninvestigated. In this study, we constructed networks to connect domain architectures that had grown out from the same single domain for every single domain in the Pfam-A database and found that there are three main distinctive types of these networks, which suggests that evolution can exploit PDAs in three different ways. Further analysis showed that these three different types of PDA networks are each adopted by different types of protein domains, although many networks exhibit the characteristics of more than one of the three types. Our results shed light on nature's blueprint for protein architecture and provide a framework for understanding architectural design from a network perspective.


2005 ◽  
Vol 59 (1) ◽  
pp. 1-6 ◽  
Author(s):  
Mingzhu Zheng ◽  
Krzysztof Ginalski ◽  
Leszek Rychlewski ◽  
Nick V. Grishin

1999 ◽  
Vol 181 (12) ◽  
pp. 3688-3694 ◽  
Author(s):  
Ralf Koebnik

ABSTRACT The N-terminal domain of the OmpA protein from Escherichia coli, consisting of 170 amino acid residues, is embedded in the outer membrane, in the form of an antiparallel β-barrel whose eight transmembrane β-strands are connected by three short periplasmic turns and four relatively large surface-exposed hydrophilic loops. This protein domain serves as a paradigm for the study of membrane assembly of integral β-structured membrane proteins. In order to dissect the structural and functional roles of the surface-exposed loops, they were shortened separately and in all possible combinations. All 16 loop deletion mutants assembled into the outer membrane with high efficiency and adopted the wild-type membrane topology. This systematic approach proves the absence of topogenic signals (e.g., in the form of loop sizes or charge distributions) in these loops. The shortening of surface-exposed loops did not reduce the thermal stability of the protein. However, none of the mutant proteins, with the exception of the variant with the fourth loop shortened, served as a receptor for the OmpA-specific bacteriophage K3. Furthermore, all loops were necessary for the OmpA protein to function in the stabilization of mating aggregates during F conjugation. An OmpA deletion variant with all four loops shortened, consisting of only 135 amino acid residues, constitutes the smallest β-structured integral membrane protein known to date. These results represent a further step toward the development of artificial outer membrane proteins.


2015 ◽  
Author(s):  
Martin L Miller ◽  
Ed Reznik ◽  
Nicholas P Gauthier ◽  
Bülent Arman Aksoy ◽  
Anil Korkut ◽  
...  

In cancer genomics, frequent recurrence of mutations in independent tumor samples is a strong indication of functional impact. However, rare functional mutations can escape detection by recurrence analysis for lack of statistical power. We address this problem by extending the notion of recurrence of mutations from single genes to gene families that share homologous protein domains. In addition to lowering the threshold of detection, this sharpens the functional interpretation of the impact of mutations, as protein domains more succinctly embody function than entire genes. Mapping mutations in 22 different tumor types to equivalent positions in multiple sequence alignments of protein domains, we confirm well-known functional mutation hotspots and make two types of discoveries: 1) identification and functional interpretation of uncharacterized rare variants in one gene that are equivalent to well-characterized mutations in canonical cancer genes, such as uncharacterizedERBB4(S303F) mutations that are analogous to canonicalERRB2(S310F) mutations in the furin-like domain, and 2) detection of previously unknown mutation hotspots with novel functional implications. With the rapid expansion of cancer genomics projects, protein domain hotspot analysis is likely to provide many more leads linking mutations in proteins to the cancer phenotype.


2017 ◽  
Author(s):  
Arli A. Parikesit ◽  
Peter F. Stadler ◽  
Sonja J. Prohaska

AbstractThe genomic inventory of protein domains is an important indicator of an organism’s regulatory and metabolic capabilities. Existing gene annotations, however, can be plagued by substantial ascertainment biases that make it difficult to obtain and compare quantitative domain data. We find that quantitative trends across the Eukarya can be investigated based on a combination of gene prediction and standard domain annotation pipelines. Species-specific training is required, however, to account for the genomic peculiarities in many lineages. In contrast to earlier studies we find wide-spread statistically significant avoidance of protein domains associated with distinct functional high-level gene-ontology terms.1998 ACM Subject Classification J.3 Life and Medical Sciences


Sign in / Sign up

Export Citation Format

Share Document