Integrating image caption information into biomedical document classification in support of biocuration

Database ◽

10.1093/database/baaa024 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Xiangying Jiang ◽

Pengyuan Li ◽

James Kadin ◽

Judith A Blake ◽

Martin Ringwald ◽

...

Keyword(s):

Gene Expression ◽

Classification Scheme ◽

Mouse Genome Informatics ◽

Document Classification ◽

Publication Rate ◽

Classification Task ◽

Biological Databases ◽

Vast Number ◽

Pertinent Information ◽

Genome Informatics

Abstract Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:

Download Full-text

Mouse Genome Informatics (MGI): latest news from MGD and GXD

Mammalian Genome ◽

10.1007/s00335-021-09921-0 ◽

2021 ◽

Author(s):

Martin Ringwald ◽

Joel E. Richardson ◽

Richard M. Baldarelli ◽

Judith A. Blake ◽

James A. Kadin ◽

...

Keyword(s):

Gene Expression ◽

Mouse Genome ◽

Primary Source ◽

Mouse Genome Database ◽

Mouse Genome Informatics ◽

Genome Database ◽

Functional Annotations ◽

Developmental Gene Expression ◽

Health And Disease ◽

Genome Informatics

AbstractThe Mouse Genome Informatics (MGI) database system combines multiple expertly curated community data resources into a shared knowledge management ecosystem united by common metadata annotation standards. MGI’s mission is to facilitate the use of the mouse as an experimental model for understanding the genetic and genomic basis of human health and disease. MGI is the authoritative source for mouse gene, allele, and strain nomenclature and is the primary source of mouse phenotype annotations, functional annotations, developmental gene expression information, and annotations of mouse models with human diseases. MGI maintains mouse anatomy and phenotype ontologies and contributes to the development of the Gene Ontology and Disease Ontology and uses these ontologies as standard terminologies for annotation. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are MGI’s two major knowledgebases. Here, we highlight some of the recent changes and enhancements to MGD and GXD that have been implemented in response to changing needs of the biomedical research community and to improve the efficiency of expert curation. MGI can be accessed freely at http://www.informatics.jax.org.

Download Full-text

Mouse Genome Informatics (MGI): Resources for Mining Mouse Genetic, Genomic, and Biological Data in Support of Primary and Translational Research

Methods in Molecular Biology - Systems Genetics ◽

10.1007/978-1-4939-6427-7_3 ◽

2016 ◽

pp. 47-73 ◽

Cited By ~ 33

Author(s):

Janan T. Eppig ◽

Cynthia L. Smith ◽

Judith A. Blake ◽

Martin Ringwald ◽

James A. Kadin ◽

...

Keyword(s):

Translational Research ◽

Mouse Genome ◽

Mouse Genome Informatics ◽

Biological Data ◽

Genome Informatics ◽

Mouse Genetic

Download Full-text

Integrating computationally assembled mouse transcript sequences with the Mouse Genome Informatics (MGI) database

Genome Biology ◽

10.1186/gb-2003-4-2-r16 ◽

2003 ◽

Vol 4 (2) ◽

Cited By ~ 9

Author(s):

Yunxia Zhu ◽

Benjamin L King ◽

Babak Parvizi ◽

Brian P Brunk ◽

Christian J Stoeckert ◽

...

Keyword(s):

Mouse Genome ◽

Mouse Genome Informatics ◽

Mouse Transcript ◽

Genome Informatics

Download Full-text

Searching the Mouse Genome Informatics ( MGI ) Resources for Information on Mouse Biology from Genotype to Phenotype

Current Protocols in Bioinformatics ◽

10.1002/0471250953.bi0107s05 ◽

2004 ◽

Vol 5 (1) ◽

Cited By ~ 2

Author(s):

David Shaw

Keyword(s):

Mouse Genome ◽

Mouse Genome Informatics ◽

Genome Informatics

Download Full-text

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Database ◽

10.1093/database/baz045 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 4

Author(s):

Xiangying Jiang ◽

Martin Ringwald ◽

Judith A Blake ◽

Cecilia Arighi ◽

Gongbo Zhang ◽

...

Keyword(s):

Classification Scheme ◽

Class Imbalance ◽

Document Classification

Download Full-text

Enabling semantic queries across federated bioinformatics databases

Database ◽

10.1093/database/baz106 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 9

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Description Framework

Abstract Motivation: Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases. Results: We introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: (i) Bgee, a gene expression relational database; (ii) Orthologous Matrix (OMA), a Hierarchical Data Format 5 orthology DS; and (iii) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialized RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.

Download Full-text

Assessing document and sentence readability in less resourced languages and across textual genres

ITL - International Journal of Applied Linguistics ◽

10.1075/itl.165.2.03del ◽

2014 ◽

Vol 165 (2) ◽

pp. 163-193 ◽

Cited By ~ 4

Author(s):

Felice Dell’Orletta ◽

Simonetta Montemagni ◽

Giulia Venturi

Keyword(s):

Assessment Tool ◽

High Accuracy ◽

Document Classification ◽

Ranking Method ◽

Classification Task ◽

Text Readability ◽

Syntactic Information ◽

Text Features ◽

Sentence Classification ◽

Readability Assessment

In this paper, we tackle three underresearched issues of the automatic readability assessment literature, namely the evaluation of text readability in less resourced languages, with respect to sentences (as opposed to documents) as well as across textual genres. Different solutions to these issues have been tested by using and refining READ‑IT, the first advanced readability assessment tool for Italian, which combines traditional raw text features with lexical, morpho-syntactic and syntactic information. In READ‑IT readability assessment is carried out with respect to both documents and sentences, with the latter constituting an important novelty of the proposed approach: READ‑IT shows a high accuracy in the document classification task and promising results in the sentence classification scenario. By comparing the results of two versions of READ‑IT, adopting a classification‑ versus ranking-based approach, we also show that readability assessment is strongly influenced by textual genre; for this reason a genre-oriented notion of readability is needed. With classification-based approaches, reliable results can only be achieved with genre-specific models: Since this is far from being a workable solution, especially for less resourced languages, a new ranking method for readability assessment is proposed, based on the notion of distance.

Download Full-text

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Database ◽

10.1093/database/bax017 ◽

2017 ◽

Vol 2017 ◽

Cited By ~ 4

Author(s):

Xiangying Jiang ◽

Martin Ringwald ◽

Judith Blake ◽

Hagit Shatkay

Keyword(s):

Gene Expression ◽

Document Classification ◽

Mouse Gene ◽

Gene Expression Database ◽

Mouse Gene Expression

Download Full-text

Systematic assessment of the human osteoblast transcriptome in resting and induced primary cells

Physiological Genomics ◽

10.1152/physiolgenomics.00028.2008 ◽

2008 ◽

Vol 33 (3) ◽

pp. 301-311 ◽

Cited By ~ 23

Author(s):

Elin Grundberg ◽

Helena Brändström ◽

Kevin C. L. Lam ◽

Scott Gurd ◽

Bing Ge ◽

...

Keyword(s):

Gene Expression ◽

Cell Biology ◽

Gene Networks ◽

Cell Function ◽

Expression Profiles ◽

Bone Cell ◽

Specific Gene ◽

Molecular Physiology ◽

Specific Expression ◽

Vast Number

Osteoblasts are key players in bone remodeling. The accessibility of human primary osteoblast-like cells (HObs) from bone explants makes them a lucrative model for studying molecular physiology of bone turnover, for discovering novel anabolic therapeutics, and for mesenchymal cell biology in general. Relatively little is known about resting and dynamic expression profiles of HObs, and to date no studies have been conducted to systematically assess the osteoblast transcriptome. The aim of this study was to characterize HObs and investigate signaling cascades and gene networks with genomewide expression profiling in resting and bone morphogenic protein (BMP)-2- and dexamethasone-induced cells. In addition, we compared HOb gene expression with publicly available samples from the Gene Expression Omnibus. Our data show a vast number of genes and networks expressed predominantly in HObs compared with closely related cells such as fibroblasts or chondrocytes. For instance, genes in the insulin-like growth factor (IGF) signaling pathway were enriched in HObs ( P = 0.003) and included the binding proteins (IGFBP-1, -2, -5) and IGF-II and its receptor. Another HOb-specific expression pattern included leptin and its receptor ( P < 10−8). Furthermore, after stimulation of HObs with BMP-2 or dexamethasone, the expression of several interesting genes and pathways was observed. For instance, our data support the role of peripheral leptin signaling in bone cell function. In conclusion, we provide the landscape of tissue-specific and dynamic gene expression in HObs. This resource will allow utilization of osteoblasts as a model to study specific gene networks and gene families related to human bone physiology and diseases.

Download Full-text