scholarly journals Integrating image caption information into biomedical document classification in support of biocuration

Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Xiangying Jiang ◽  
Pengyuan Li ◽  
James Kadin ◽  
Judith A Blake ◽  
Martin Ringwald ◽  
...  

Abstract Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:

2021 ◽  
Author(s):  
Martin Ringwald ◽  
Joel E. Richardson ◽  
Richard M. Baldarelli ◽  
Judith A. Blake ◽  
James A. Kadin ◽  
...  

AbstractThe Mouse Genome Informatics (MGI) database system combines multiple expertly curated community data resources into a shared knowledge management ecosystem united by common metadata annotation standards. MGI’s mission is to facilitate the use of the mouse as an experimental model for understanding the genetic and genomic basis of human health and disease. MGI is the authoritative source for mouse gene, allele, and strain nomenclature and is the primary source of mouse phenotype annotations, functional annotations, developmental gene expression information, and annotations of mouse models with human diseases. MGI maintains mouse anatomy and phenotype ontologies and contributes to the development of the Gene Ontology and Disease Ontology and uses these ontologies as standard terminologies for annotation. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are MGI’s two major knowledgebases. Here, we highlight some of the recent changes and enhancements to MGD and GXD that have been implemented in response to changing needs of the biomedical research community and to improve the efficiency of expert curation. MGI can be accessed freely at http://www.informatics.jax.org.


2003 ◽  
Vol 4 (2) ◽  
Author(s):  
Yunxia Zhu ◽  
Benjamin L King ◽  
Babak Parvizi ◽  
Brian P Brunk ◽  
Christian J Stoeckert ◽  
...  

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Xiangying Jiang ◽  
Martin Ringwald ◽  
Judith A Blake ◽  
Cecilia Arighi ◽  
Gongbo Zhang ◽  
...  

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Ana Claudia Sima ◽  
Tarcisio Mendes de Farias ◽  
Erich Zbinden ◽  
Maria Anisimova ◽  
Manuel Gil ◽  
...  

Abstract Motivation: Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases. Results: We introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: (i) Bgee, a gene expression relational database; (ii) Orthologous Matrix (OMA), a Hierarchical Data Format 5 orthology DS; and (iii) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialized RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.


2014 ◽  
Vol 165 (2) ◽  
pp. 163-193 ◽  
Author(s):  
Felice Dell’Orletta ◽  
Simonetta Montemagni ◽  
Giulia Venturi

In this paper, we tackle three underresearched issues of the automatic readability assessment literature, namely the evaluation of text readability in less resourced languages, with respect to sentences (as opposed to documents) as well as across textual genres. Different solutions to these issues have been tested by using and refining READ‑IT, the first advanced readability assessment tool for Italian, which combines traditional raw text features with lexical, morpho-syntactic and syntactic information. In READ‑IT readability assessment is carried out with respect to both documents and sentences, with the latter constituting an important novelty of the proposed approach: READ‑IT shows a high accuracy in the document classification task and promising results in the sentence classification scenario. By comparing the results of two versions of READ‑IT, adopting a classification‑ versus ranking-based approach, we also show that readability assessment is strongly influenced by textual genre; for this reason a genre-oriented notion of readability is needed. With classification-based approaches, reliable results can only be achieved with genre-specific models: Since this is far from being a workable solution, especially for less resourced languages, a new ranking method for readability assessment is proposed, based on the notion of distance.


2008 ◽  
Vol 33 (3) ◽  
pp. 301-311 ◽  
Author(s):  
Elin Grundberg ◽  
Helena Brändström ◽  
Kevin C. L. Lam ◽  
Scott Gurd ◽  
Bing Ge ◽  
...  

Osteoblasts are key players in bone remodeling. The accessibility of human primary osteoblast-like cells (HObs) from bone explants makes them a lucrative model for studying molecular physiology of bone turnover, for discovering novel anabolic therapeutics, and for mesenchymal cell biology in general. Relatively little is known about resting and dynamic expression profiles of HObs, and to date no studies have been conducted to systematically assess the osteoblast transcriptome. The aim of this study was to characterize HObs and investigate signaling cascades and gene networks with genomewide expression profiling in resting and bone morphogenic protein (BMP)-2- and dexamethasone-induced cells. In addition, we compared HOb gene expression with publicly available samples from the Gene Expression Omnibus. Our data show a vast number of genes and networks expressed predominantly in HObs compared with closely related cells such as fibroblasts or chondrocytes. For instance, genes in the insulin-like growth factor (IGF) signaling pathway were enriched in HObs ( P = 0.003) and included the binding proteins (IGFBP-1, -2, -5) and IGF-II and its receptor. Another HOb-specific expression pattern included leptin and its receptor ( P < 10−8). Furthermore, after stimulation of HObs with BMP-2 or dexamethasone, the expression of several interesting genes and pathways was observed. For instance, our data support the role of peripheral leptin signaling in bone cell function. In conclusion, we provide the landscape of tissue-specific and dynamic gene expression in HObs. This resource will allow utilization of osteoblasts as a model to study specific gene networks and gene families related to human bone physiology and diseases.


Sign in / Sign up

Export Citation Format

Share Document