scholarly journals The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines

2014 ◽  
Vol 5 ◽  
Author(s):  
Gaston K. Mazandu ◽  
Nicola J. Mulder
2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Aaron Ayllon-Benitez ◽  
Romain Bourqui ◽  
Patricia Thébault ◽  
Fleur Mougin

Abstract The revolution in new sequencing technologies is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data that are grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and may suffer from focusing on the most studied genes that represent a limited coverage of annotated genes within a gene set. Semantic similarity measures have shown great results within the pairwise gene comparison by making advantage of the underlying structure of the Gene Ontology. We developed GSAn, a novel gene set annotation method that uses semantic similarity measures to synthesize a priori Gene Ontology annotation terms. The originality of our approach is to identify the best compromise between the number of retained annotation terms that has to be drastically reduced and the number of related genes that has to be as large as possible. Moreover, GSAn offers interactive visualization facilities dedicated to the multi-scale analysis of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.


2021 ◽  
Author(s):  
Gaston K. Mazandu ◽  
Kenneth Opap ◽  
Funmilayo Makinde ◽  
Victoria Nembaware ◽  
Francis Agamah ◽  
...  

Abstract During the last decade, we witnessed an exponential rise of datasets from heterogeneous sources. Ontologies are playing an essential role in consistently describing domain concepts, data harmonization and integration to support large-scale integrative analysis and semantic interoperability in knowledge sharing. Several semantic similarity (SS) measures have been suggested to enable the integration of rich ontology structures into automated reasoning and inference. However, there is no tool that exhaustively implements these measures and existing tools are generally Gene Ontology specific, do not implement several models suggested in the WordNet context and are not equipped to properly deal with frequent ontology updates. We introduce a Python SS measure library (PySML), which tackles issues related to current SS tools, providing a portable and expandable tool to a broad computational audience. This empowers users to manipulate SS scores from several applications for any ontology version and file format. PySML is a flexible tool enabling the implementation of all existing semantic similarity models, resolving issues related to computation, reproducibility and re-usability of SS scores.


2012 ◽  
Vol 13 (S4) ◽  
Author(s):  
Marco Falda ◽  
Stefano Toppo ◽  
Alessandro Pescarolo ◽  
Enrico Lavezzo ◽  
Barbara Di Camillo ◽  
...  

2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Gaston K. Mazandu ◽  
Nicola J. Mulder

Several approaches have been proposed for computing term information content (IC) and semantic similarity scores within the gene ontology (GO) directed acyclic graph (DAG). These approaches contributed to improving protein analyses at the functional level. Considering the recent proliferation of these approaches, a unified theory in a well-defined mathematical framework is necessary in order to provide a theoretical basis for validating these approaches. We review the existing IC-based ontological similarity approaches developed in the context of biomedical and bioinformatics fields to propose a general framework and unified description of all these measures. We have conducted an experimental evaluation to assess the impact of IC approaches, different normalization models, and correction factors on the performance of a functional similarity metric. Results reveal that considering only parents or only children of terms when assessing information content or semantic similarity scores negatively impacts the approach under consideration. This study produces a unified framework for current and future GO semantic similarity measures and provides theoretical basics for comparing different approaches. The experimental evaluation of different approaches based on different term information content models paves the way towards a solution to the issue of scoring a term’s specificity in the GO DAG.


2018 ◽  
Author(s):  
Madhusudan Paul ◽  
Ashish Anand

AbstractThe large-scale protein-protein interaction (PPI) data has the potential to play a significant role in the endeavor of understanding cellular processes. However, the presence of a considerable fraction of false positives is a bottleneck in realizing this potential. There have been continuous efforts to utilize complementary resources for scoring confidence of PPIs in a manner that false positive interactions get a low confidence score. Gene Ontology (GO), a taxonomy of biological terms to represent the properties of gene products and their relations, has been widely used for this purpose. We utilize GO to introduce a new set of specificity measures: Relative Depth Specificity (RDS), Relative Node-based Specificity (RNS), and Relative Edge-based Specificity (RES), leading to a new family of similarity measures. We use these similarity measures to obtain a confidence score for each PPI. We evaluate the new measures using four different benchmarks. We show that all the three measures are quite effective. Notably, RNS and RES more effectively distinguish true PPIs from false positives than the existing alternatives. RES also shows a robust set-discriminating power and can be useful for protein functional clustering as well.


Sign in / Sign up

Export Citation Format

Share Document