scholarly journals PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Rezarta Islamaj ◽  
W John Wilbur ◽  
Natalie Xie ◽  
Noreen R Gonzales ◽  
Narmada Thanki ◽  
...  

AbstractThis study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.

Author(s):  
Andrew F. Neuwald

AbstractHierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed


2011 ◽  
Vol 31 (3) ◽  
pp. 159-168 ◽  
Author(s):  
Mitsunori Fukuda

The TBC (Tre-2/Bub2/Cdc16) domain was originally identified as a conserved domain among the tre-2 oncogene product and the yeast cell cycle regulators Bub2 and Cdc16, and it is now widely recognized as a conserved protein motif that consists of approx. 200 amino acids in all eukaryotes. Since the TBC domain of yeast Gyps [GAP (GTPase-activating protein) for Ypt proteins] has been shown to function as a GAP domain for small GTPase Ypt/Rab, TBC domain-containing proteins (TBC proteins) in other species are also expected to function as a certain Rab-GAP. More than 40 different TBC proteins are present in humans and mice, and recent accumulating evidence has indicated that certain mammalian TBC proteins actually function as a specific Rab-GAP. Some mammalian TBC proteins {e.g. TBC1D1 [TBC (Tre-2/Bub2/Cdc16) domain family, member 1] and TBC1D4/AS160 (Akt substrate of 160 kDa)} play an important role in homoeostasis in mammals, and defects in them are directly associated with mouse and human diseases (e.g. leanness in mice and insulin resistance in humans). The present study reviews the structure and function of mammalian TBC proteins, especially in relation to Rab small GTPases.


2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Xuejing Li ◽  
Ling Wang ◽  
Qian Chen ◽  
Yongsheng Hu ◽  
Jubao Du ◽  
...  

Objective. To investigate the reorganization of insular subregions in individuals suffering from neuropathic pain (NP) after incomplete spinal cord injury (ISCI) and further to disclose the underlying mechanism of NP. Method. The 3D high-resolution T1-weighted structural images and resting-state functional magnetic resonance imaging (rs-fMRI) of all individuals were obtained using a 3.0 Tesla MRI system. A comparative analysis of structure and function connectivity (FC) with insular subareas as seeds in 10 ISCI individuals with below-level NP (ISCI-P), 11 ISCI individuals without NP (ISCI-N), and 25 healthy controls (HCs) was conducted. Associations between the structural and functional alteration of insula subregions and visual analog scale (VAS) scores were analyzed using the Pearson correlation in SPSS 20. Results. Compared with ISCI-N patients, when the left posterior insula as the seed, ISCI-P showed increased FC in right cerebellum VIIb and cerebellum VIII, Brodmann 37 (BA 37). When the left ventral anterior insula as the seed, ISCI-P indicated enhanced FC in right BA18 compared with ISCI-N patients. These increased FCs positively correlated with VAS scores. Relative to HCs, ISCI-P presented increased FC in the left hippocampus when the left dorsal anterior insula was determined as the seed. There was no statistical difference in the volume of insula subregions among the three groups. Conclusion. Our study indicated that distinctive patterns of FC in each subregion of insula suggest that the insular subareas participate in the NP processing through different FC following ISCI. Further, insula subregions could serve as a therapeutic target for NP following ISCI.


2021 ◽  
Vol 8 ◽  
Author(s):  
Kai-Lu Zhang ◽  
Jian-Li Zhou ◽  
Jing-Fang Yang ◽  
Yu-Zhen Zhao ◽  
Debatosh Das ◽  
...  

As a pivotal regulator of 5’ splice site recognition, U1 small nuclear ribonucleoprotein (U1 snRNP)-specific protein C (U1C) regulates pre-mRNA splicing by interacting with other components of the U1 snRNP complex. Previous studies have shown that U1 snRNP and its components are linked to a variety of diseases, including cancer. However, the phylogenetic relationships and expression profiles of U1C have not been studied systematically. To this end, we identified a total of 110 animal U1C genes and compared them to homologues from yeast and plants. Bioinformatics analysis shows that the structure and function of U1C proteins is relatively conserved and is found in multiple copies in a few members of the U1C gene family. Furthermore, the expression patterns reveal that U1Cs have potential roles in cancer progression and human development. In summary, our study presents a comprehensive overview of the animal U1C gene family, which can provide fundamental data and potential cues for further research in deciphering the molecular function of this splicing regulator.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Kyra Erckert ◽  
...  

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2021 ◽  
Author(s):  
Roshan Rao ◽  
Jason Liu ◽  
Robert Verkuil ◽  
Joshua Meier ◽  
John F. Canny ◽  
...  

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.


2018 ◽  
Vol 100-B (4) ◽  
pp. 480-484 ◽  
Author(s):  
B. Kadum ◽  
C. Inngul ◽  
R. Ihrman ◽  
G. O. Sjödén ◽  
A. S. Sayed-Noor

Aims The aims of this study were to investigate any possible relationship between a preoperative sensitivity to pain and the degree of pain at rest and on exertion with postoperative function in patients who underwent stemless total shoulder arthroplasty (TSA). Patients and Methods In this prospective study, we included 63 patients who underwent stemless TSA and were available for evaluation one year postoperatively. There were 31 women and 32 men; their mean age was 71 years (53 to 89). The pain threshold, which was measured using a Pain Matcher (PM) unit, the degree of pain (visual analogue scale at rest and on exertion, and function using the short version of the Disabilities of the Arm, Shoulder and Hand questionnaire (QuickDASH), were recorded preoperatively, as well as three and 12 months postoperatively. Results We found an inverse relationship between both the preoperative PM threshold and pain (VAS) at rest and the 12-month postoperative QuickDASH score (Pearson correlation coefficient (r) ≥ 0.4, p < 0.05). A linear regression analysis showed that the preoperative PM threshold on the affected side and preoperative pain (VAS) at rest were the only factors associated with the QuickDASH score at 12 months. Conclusion These findings indicate the importance of central sensitization in the restoration of function after TSA. Further studies are required to investigate whether extra analgesia and rehabilitation could influence the outcome in at risk patients. Cite this article: Bone Joint J 2018;100-B:480–4.


2020 ◽  
Vol 21 (2) ◽  
pp. 655
Author(s):  
Jieyang Jin ◽  
Shangrui Zhang ◽  
Mingyue Zhao ◽  
Tingting Jing ◽  
Na Zhang ◽  
...  

Terpenoids play vital roles in tea aroma quality and plants defense performance determination, whereas the scenarios of genes to metabolites of terpenes pathway remain uninvestigated in tea plants. Here, we report the use of an integrated approach combining metabolites, target gene transcripts and function analyses to reveal a gene-to-terpene network in tea plants. Forty-one terpenes including 26 monoterpenes, 14 sesquiterpenes and one triterpene were detected and 82 terpenes related genes were identified from five tissues of tea plants. Pearson correlation analysis resulted in genes to metabolites network. One terpene synthases whose expression positively correlated with farnesene were selected and its function was confirmed involved in the biosynthesis of α-farnesene, β-ocimene and β-farnesene, a very important and conserved alarm pheromone in response to aphids by both in vitro enzymatic assay in planta function analysis. In summary, we provided the first reliable gene-to-terpene network for novel genes discovery.


Sign in / Sign up

Export Citation Format

Share Document