scholarly journals An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Xiangying Jiang ◽  
Martin Ringwald ◽  
Judith A Blake ◽  
Cecilia Arighi ◽  
Gongbo Zhang ◽  
...  
Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Xiangying Jiang ◽  
Pengyuan Li ◽  
James Kadin ◽  
Judith A Blake ◽  
Martin Ringwald ◽  
...  

Abstract Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:


1966 ◽  
Vol 24 ◽  
pp. 188-189
Author(s):  
T. J. Deeming

If we make a set of measurements, such as narrow-band or multicolour photo-electric measurements, which are designed to improve a scheme of classification, and in particular if they are designed to extend the number of dimensions of classification, i.e. the number of classification parameters, then some important problems of analytical procedure arise. First, it is important not to reproduce the errors of the classification scheme which we are trying to improve. Second, when trying to extend the number of dimensions of classification we have little or nothing with which to test the validity of the new parameters.Problems similar to these have occurred in other areas of scientific research (notably psychology and education) and the branch of Statistics called Multivariate Analysis has been developed to deal with them. The techniques of this subject are largely unknown to astronomers, but, if carefully applied, they should at the very least ensure that the astronomer gets the maximum amount of information out of his data and does not waste his time looking for information which is not there. More optimistically, these techniques are potentially capable of indicating the number of classification parameters necessary and giving specific formulas for computing them, as well as pinpointing those particular measurements which are most crucial for determining the classification parameters.


1966 ◽  
Vol 24 ◽  
pp. 3-5
Author(s):  
W. W. Morgan

1. The definition of “normal” stars in spectral classification changes with time; at the time of the publication of theYerkes Spectral Atlasthe term “normal” was applied to stars whose spectra could be fitted smoothly into a two-dimensional array. Thus, at that time, weak-lined spectra (RR Lyrae and HD 140283) would have been considered peculiar. At the present time we would tend to classify such spectra as “normal”—in a more complicated classification scheme which would have a parameter varying with metallic-line intensity within a specific spectral subdivision.


1988 ◽  
Vol 102 ◽  
pp. 343-347
Author(s):  
M. Klapisch

AbstractA formal expansion of the CRM in powers of a small parameter is presented. The terms of the expansion are products of matrices. Inverses are interpreted as effects of cascades.It will be shown that this allows for the separation of the different contributions to the populations, thus providing a natural classification scheme for processes involving atoms in plasmas. Sum rules can be formulated, allowing the population of the levels, in some simple cases, to be related in a transparent way to the quantum numbers.


Author(s):  
J C Walmsley ◽  
A R Lang

Interest in the defects and impurities in natural diamond, which are found in even the most perfect stone, is driven by the fact that diamond growth occurs at a depth of over 120Km. They display characteristics associated with their origin and their journey through the mantle to the surface of the Earth. An optical classification scheme for diamond exists based largely on the presence and segregation of nitrogen. For example type Ia, which includes 98% of all natural diamonds, contain nitrogen aggregated into small non-paramagnetic clusters and usually contain sub-micrometre platelet defects on {100} planes. Numerous transmission electron microscope (TEM) studies of these platelets and associated features have been made e.g. . Some diamonds, however, contain imperfections and impurities that place them outside this main classification scheme. Two such types are described.First, coated-diamonds which possess gem quality cores enclosed by a rind that is rich in submicrometre sized mineral inclusions. The transition from core to coat is quite sharp indicating a sudden change in growth conditions, Figure 1. As part of a TEM study of the inclusions apatite has been identified as a major constituent of the impurity present in many inclusion cavities, Figure 2.


Sign in / Sign up

Export Citation Format

Share Document