Gene ontology concept recognition using named concept: understanding the various presentations of the gene functions in biomedical literature

Automatic consistency assurance for literature-based gene ontology annotation

BMC Bioinformatics ◽

10.1186/s12859-021-04479-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jiyu Chen ◽

Nicholas Geard ◽

Justin Zobel ◽

Karin Verspoor

Keyword(s):

Gene Ontology ◽

High Precision ◽

State Of The Art ◽

Biological Database ◽

Research Papers ◽

Go Annotation ◽

Human In The Loop ◽

Gene Functions ◽

Different Types ◽

Biological Literature

Abstract Background Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. Results In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Conclusions Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Download Full-text

Faculty Opinions recommendation of Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1007548.95105 ◽

2002 ◽

Author(s):

Raymond Dingledine

Keyword(s):

Gene Ontology ◽

Maximum Entropy ◽

Biomedical Literature ◽

Entropy Analysis

Download Full-text

Prediction of optimal gene functions for osteosarcoma using gene ontology and microarray profiles

Journal of Bone Oncology ◽

10.1016/j.jbo.2017.04.003 ◽

2017 ◽

Vol 7 ◽

pp. 18-22 ◽

Cited By ~ 3

Author(s):

Xinrang Chen

Keyword(s):

Gene Ontology ◽

Gene Functions

Download Full-text

Parallel selection on ecologically relevant gene functions in the transcriptomes of highly diversifying salmonids

BMC Genomics ◽

10.1186/s12864-019-6361-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Kevin Schneider ◽

Colin E. Adams ◽

Kathryn R. Elmer

Keyword(s):

Gene Ontology ◽

Transcriptional Regulation ◽

Relaxed Selection ◽

Diversifying Selection ◽

Creatine Uptake ◽

Gene Functions ◽

Significant Enrichment ◽

Lipid Metabolism Gene ◽

High Level ◽

Metabolism Gene

Abstract Background Salmonid fishes are characterised by a very high level of variation in trophic, ecological, physiological, and life history adaptations. Some salmonid taxa show exceptional potential for fast, within-lake diversification into morphologically and ecologically distinct variants, often in parallel; these are the lake-resident charr and whitefish (several species in the genera Salvelinus and Coregonus). To identify selection on genes and gene categories associated with such predictable diversifications, we analysed 2702 orthogroups (4.82 Mbp total; average 4.77 genes/orthogroup; average 1783 bp/orthogroup). We did so in two charr and two whitefish species and compared to five other salmonid lineages, which do not evolve in such ecologically predictable ways, and one non-salmonid outgroup. Results All selection analyses are based on Coregonus and Salvelinus compared to non-diversifying taxa. We found more orthogroups were affected by relaxed selection than intensified selection. Of those, 122 were under significant relaxed selection, with trends of an overrepresentation of serine family amino acid metabolism and transcriptional regulation, and significant enrichment of behaviour-associated gene functions. Seventy-eight orthogroups were under significant intensified selection and were enriched for signalling process and transcriptional regulation gene ontology terms and actin filament and lipid metabolism gene sets. Ninety-two orthogroups were under diversifying/positive selection. These were enriched for signal transduction, transmembrane transport, and pyruvate metabolism gene ontology terms and often contained genes involved in transcriptional regulation and development. Several orthogroups showed signs of multiple types of selection. For example, orthogroups under relaxed and diversifying selection contained genes such as ap1m2, involved in immunity and development, and slc6a8, playing an important role in muscle and brain creatine uptake. Orthogroups under intensified and diversifying selection were also found, such as genes syn3, with a role in neural processes, and ctsk, involved in bone remodelling. Conclusions Our approach pinpointed relevant genomic targets by distinguishing among different kinds of selection. We found that relaxed, intensified, and diversifying selection affect orthogroups and gene functions of ecological relevance in salmonids. Because they were found consistently and robustly across charr and whitefish and not other salmonid lineages, we propose these genes have a potential role in the replicated ecological diversifications.

Download Full-text

Learnability-based further prediction of gene functions in Gene Ontology

Genomics ◽

10.1016/j.ygeno.2004.08.005 ◽

2004 ◽

Vol 84 (6) ◽

pp. 922-928 ◽

Cited By ~ 16

Author(s):

Kang Tu ◽

Hui Yu ◽

Zheng Guo ◽

Xia Li

Keyword(s):

Gene Ontology ◽

Gene Functions

Download Full-text

Crowdsourcing biocuration: the Community Assessment of Community Annotation with Ontologies (CACAO)

10.1101/2021.04.30.440339 ◽

2021 ◽

Author(s):

Jolene Ramsey ◽

Brenley McIntosh ◽

Daniel Renfro ◽

Suzanne A Aleksander ◽

Sandra LaBonte ◽

...

Keyword(s):

Gene Ontology ◽

Gene Function ◽

Scientific Literature ◽

Tracking System ◽

Model Organisms ◽

Community Assessment ◽

Standard Format ◽

Primary Literature ◽

Gene Functions ◽

Community Annotation

Experimental data about known gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a ten-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills. Significance Statement: The primary scientific literature catalogs the results from publicly funded scientific research about gene function in human-readable format. Information captured from those studies in a widely adopted, machine-readable standard format comes in the form of Gene Ontology annotations about gene functions from all domains of life. Manual annotations based on inferences directly from the scientific literature, including the evidence used to make such inferences, represents the best return on investment by improving data accessibility across the biological sciences. To supplement professional curation, our CACAO project enabled annotation of the scientific literature by community annotators, in this case undergraduates, which resulted in contribution of thousands of validated entries to public resources. These annotations are now being used by scientists worldwide.

Download Full-text

Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature

Genome Research ◽

10.1101/gr.199701 ◽

2002 ◽

Vol 12 (1) ◽

pp. 203-214 ◽

Cited By ~ 100

Author(s):

S. Raychaudhuri

Keyword(s):

Gene Ontology ◽

Maximum Entropy ◽

Biomedical Literature ◽

Entropy Analysis

Download Full-text

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

10.1101/2021.05.26.445910 ◽

2021 ◽

Author(s):

Jiyu Chen ◽

Nicholas Geard ◽

Justin Zobel ◽

Karin Verspoor

Keyword(s):

Gene Ontology ◽

High Precision ◽

State Of The Art ◽

Biological Database ◽

Research Papers ◽

Go Annotation ◽

Human In The Loop ◽

Gene Functions ◽

Different Types ◽

Biological Literature

Background: Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. Method: In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Results and Conclusion: Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios

Download Full-text

GAIL: An interactive webserver for inference and dynamic visualization of gene-gene associations based on gene ontology guided mining of biomedical literature

PLoS ONE ◽

10.1371/journal.pone.0219195 ◽

2019 ◽

Vol 14 (7) ◽

pp. e0219195 ◽

Cited By ~ 2

Author(s):

Daniel Couch ◽

Zhenning Yu ◽

Jin Hyun Nam ◽

Carter Allen ◽

Paula S. Ramos ◽

...

Keyword(s):

Gene Ontology ◽

Biomedical Literature ◽

Dynamic Visualization ◽

Gene Associations

Download Full-text

Comparison of natural language processing tools for automatic gene ontology annotation of scientific literature

10.7287/peerj.preprints.27028v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lucas Beasley ◽

Prashanti Manda

Keyword(s):

Gene Ontology ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Scientific Literature ◽

Biomedical Literature ◽

Automated Annotation ◽

Manual Curation ◽

Gold Standard Reference

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.

Download Full-text