Validation for 2D/3D registration I: A new gold standard data set

S. A. Pawiro; P. Markelj; F. Pernuš; C. Gendrin; M. Figl; C. Weber; F. Kainberger; I. Nöbauer-Huhmann; H. Bergmeister; M. Stock; D. Georg; H. Bergmann; W. Birkfellner

doi:10.1118/1.3553402

Validation for 2D/3D registration II: The comparison of intensity- and gradient-based merit functions using a new gold standard data set

Medical Physics ◽

10.1118/1.3553403 ◽

2011 ◽

Vol 38 (3) ◽

pp. 1491-1502 ◽

Cited By ~ 27

Author(s):

Christelle Gendrin ◽

Primož Markelj ◽

Supriyanto Ardjo Pawiro ◽

Jakob Spoerk ◽

Christoph Bloch ◽

...

Keyword(s):

Gold Standard ◽

Merit Functions ◽

3D Registration ◽

Data Set ◽

Standard Data ◽

Gradient Based

Download Full-text

Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision (Preprint)

10.2196/preprints.28219 ◽

2021 ◽

Author(s):

Qi Jia ◽

Dezheng Zhang ◽

Haifeng Xu ◽

Yonghong Xie

Keyword(s):

Chinese Medicine ◽

Gold Standard ◽

False Negative ◽

Named Entity Recognition ◽

Entity Recognition ◽

Data Set ◽

Standard Data ◽

Named Entity ◽

Medical Entity ◽

Clinical Records

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.

Download Full-text

What Proportion of Planned Missing Data Is Allowed for Unbiased Estimates of the Association Between Energy Intake and Body Weight Using Multiple Imputation?

Current Developments in Nutrition ◽

10.1093/cdn/nzaa056_014 ◽

2020 ◽

Vol 4 (Supplement_2) ◽

pp. 1167-1167

Author(s):

Keisuke Ejima ◽

Roger Zoh ◽

Carmen Tekwe ◽

David Allison ◽

Andrew Brown

Keyword(s):

Body Weight ◽

Multiple Imputation ◽

Energy Intake ◽

Gold Standard ◽

Self Report ◽

Full Data ◽

Data Set ◽

Gold Standard Method ◽

Standard Data ◽

Unbiased Estimates

Abstract Objectives A gold standard method to measure energy intake (EI) is doubly labeled water (DLW), but it is expensive and not feasible for large studies. EI from self-report (EISR) is prone to bias, but is still widely used due to convenience; however, estimated associations between EISR and outcomes are biased in many cases. Double sampling with multiple imputation (MI) involves obtaining gold standard (e.g., EIDLW) measurements on a random subsample, and proxy data (e.g., EISR) on the whole sample, and recovering missing gold standard information using MI. However, it is not known what proportion of missingness in EIDLW is acceptable to obtain unbiased estimates of associations between EI and outcomes. Methods We used body weight as an example outcome from the CALERIE Study (N = 218). We performed two regressions on the complete dataset: EIDLW as a predictor and body weight (kg) as an outcome to estimate the ‘true’ coefficient (denoted βDLW), or using EISR as the predictor (βSR). Random subsets of EIDLW were deleted (10% to 90% of full data in 10% increments) to simulate obtaining EIDLW data on only a subset of participants. Regressions were performed using the subset EIDLW data using two different approaches: complete case analysis of only the subset (βDLWsub) and MI informed by EISR on the full data set (βMI). Bias was estimated as the difference between βDLW and βSR, between βDLW and βDLWsub for each EIDLW subset, and between βDLW and βMI for each subset. Resampling was repeated 100 times to assess the uncertainty of the bias. Results Bias of EISR was substantial (∼50%). Bias of βDLWsub was not significantly different from zero for all proportions of missing EIDLW; 95% CIs increased as proportion of missingness increased (as expected). Bias for βMI was not significantly different from zero for missingness of EIDLW up to 80%. βMI was significantly negatively biased toward βSR when the proportion of missingness was 90%. 95%CIs of βMI estimates were narrower than those of βDLWsub for all amounts of missingness. Conclusions Unbiased, more precise estimates of the association between EI and body weight using MI were obtained with missing EIDLW as high as 80%. Obtaining gold standard data collection on subsets may allow for unbiased estimates using self-report data feasible in larger samples. Funding Sources NIH R25HL124208. JSPS KAKENHI 18K18146. Meiji Yasuda Foundation of Health and Welfare 2019.

Download Full-text

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00147 ◽

2020 ◽

pp. 383-391 ◽

Cited By ~ 1

Author(s):

Yalun Li ◽

Yung-Hung Luo ◽

Jason A. Wampfler ◽

Samuel M. Rubinstein ◽

Firat Tiryaki ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Data Extraction ◽

Complete Response ◽

Time Interval ◽

Data Set ◽

Clinical Notes ◽

Standard Data

PURPOSE Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Download Full-text

Occupational Injury Surveillance Methods Using Free Text Data and Machine Learning: Creating a Gold Standard Data Set

10.4135/9781529720488 ◽

2020 ◽

Author(s):

Liane Hirabayashi ◽

Erika Scott ◽

Paul Jenkins ◽

Nicole Krupa

Keyword(s):

Machine Learning ◽

Gold Standard ◽

Occupational Injury ◽

Injury Surveillance ◽

Free Text ◽

Text Data ◽

Data Set ◽

Standard Data

Download Full-text

Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision

JMIR Medical Informatics ◽

10.2196/28219 ◽

2021 ◽

Vol 9 (6) ◽

pp. e28219

Author(s):

Qi Jia ◽

Dezheng Zhang ◽

Haifeng Xu ◽

Yonghong Xie

Keyword(s):

Chinese Medicine ◽

Gold Standard ◽

False Negative ◽

Named Entity Recognition ◽

Entity Recognition ◽

Data Set ◽

Standard Data ◽

Named Entity ◽

Medical Entity ◽

Clinical Records

Background Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. Objective Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. Methods We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. Results We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. Conclusions We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.

Download Full-text

Predicting drug-target interactions using multi-label learning with community detection method (DTI-MLCD)

10.1101/2020.05.11.087734 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanyi Chu ◽

Xiaoqi Shan ◽

Dennis R. Salahub ◽

Yi Xiong ◽

Dong-Qing Wei

Keyword(s):

Machine Learning ◽

Community Detection ◽

Gold Standard ◽

Drug Target ◽

Drug Repositioning ◽

Binary Classification ◽

Predictive Performance ◽

Detection Methods ◽

Data Set ◽

Standard Data

AbstractIdentifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce heavily experiment cost, booming machine learning has been applied to this field and developed many computational methods, especially binary classification methods. However, there is still much room for improvement in the performance of current methods. Multi-label learning can reduce difficulties faced by binary classification learning with high predictive performance, and has not been explored extensively. The key challenge it faces is the exponential-sized output space, and considering label correlations can help it. Thus, we facilitate the multi-label classification by introducing community detection methods for DTIs prediction, named DTI-MLCD. On the other hand, we updated the gold standard data set proposed in 2008 and still in use today. The proposed DTI-MLCD is performed on the gold standard data set before and after the update, and shows the superiority than other classical machine learning methods and other benchmark proposed methods, which confirms the efficiency of it. The data and code for this study can be found at https://github.com/a96123155/DTI-MLCD.

Download Full-text

Development and use of a gold-standard data set for subjectivity classifications

Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics - ◽

10.3115/1034678.1034721 ◽

1999 ◽

Cited By ~ 135

Author(s):

Janyce M. Wiebe ◽

Rebecca F. Bruce ◽

Thomas P. O'Hara

Keyword(s):

Gold Standard ◽

Data Set ◽

Standard Data

Download Full-text

The Study of Multiple Classes Boosting Classification Method Based on Local Similarity

Algorithms ◽

10.3390/a14020037 ◽

2021 ◽

Vol 14 (2) ◽

pp. 37

Author(s):

Shixun Wang ◽

Qiang Chen

Keyword(s):

Image Retrieval ◽

Loss Function ◽

Single Mode ◽

Local Similarity ◽

Text And Image ◽

Data Set ◽

Standard Data ◽

Weak Learner ◽

Great Progress ◽

Data Points

Boosting of the ensemble learning model has made great progress, but most of the methods are Boosting the single mode. For this reason, based on the simple multiclass enhancement framework that uses local similarity as a weak learner, it is extended to multimodal multiclass enhancement Boosting. First, based on the local similarity as a weak learner, the loss function is used to find the basic loss, and the logarithmic data points are binarized. Then, we find the optimal local similarity and find the corresponding loss. Compared with the basic loss, the smaller one is the best so far. Second, the local similarity of the two points is calculated, and then the loss is calculated by the local similarity of the two points. Finally, the text and image are retrieved from each other, and the correct rate of text and image retrieval is obtained, respectively. The experimental results show that the multimodal multi-class enhancement framework with local similarity as the weak learner is evaluated on the standard data set and compared with other most advanced methods, showing the experience proficiency of this method.

Download Full-text

Elemental ratio measurements of organic compounds using aerosol mass spectrometry: characterization, improved calibration, and implications

Atmospheric Chemistry and Physics ◽

10.5194/acp-15-253-2015 ◽

2015 ◽

Vol 15 (1) ◽

pp. 253-272 ◽

Cited By ~ 418

Author(s):

M. R. Canagaratna ◽

J. L. Jimenez ◽

J. H. Kroll ◽

Q. Chen ◽

S. H. Kessler ◽

...

Keyword(s):

High Resolution ◽

Vacuum Ultraviolet ◽

Laboratory Data ◽

Detailed Examination ◽

Relative Difference ◽

Data Set ◽

Elemental Ratios ◽

Standard Data ◽

Aerosol Mass ◽

Relative Errors

Abstract. Elemental compositions of organic aerosol (OA) particles provide useful constraints on OA sources, chemical evolution, and effects. The Aerodyne high-resolution time-of-flight aerosol mass spectrometer (HR-ToF-AMS) is widely used to measure OA elemental composition. This study evaluates AMS measurements of atomic oxygen-to-carbon (O : C), hydrogen-to-carbon (H : C), and organic mass-to-organic carbon (OM : OC) ratios, and of carbon oxidation state (OS C) for a vastly expanded laboratory data set of multifunctional oxidized OA standards. For the expanded standard data set, the method introduced by Aiken et al. (2008), which uses experimentally measured ion intensities at all ions to determine elemental ratios (referred to here as "Aiken-Explicit"), reproduces known O : C and H : C ratio values within 20% (average absolute value of relative errors) and 12%, respectively. The more commonly used method, which uses empirically estimated H2O+ and CO+ ion intensities to avoid gas phase air interferences at these ions (referred to here as "Aiken-Ambient"), reproduces O : C and H : C of multifunctional oxidized species within 28 and 14% of known values. The values from the latter method are systematically biased low, however, with larger biases observed for alcohols and simple diacids. A detailed examination of the H2O+, CO+, and CO2+ fragments in the high-resolution mass spectra of the standard compounds indicates that the Aiken-Ambient method underestimates the CO+ and especially H2O+ produced from many oxidized species. Combined AMS–vacuum ultraviolet (VUV) ionization measurements indicate that these ions are produced by dehydration and decarboxylation on the AMS vaporizer (usually operated at 600 °C). Thermal decomposition is observed to be efficient at vaporizer temperatures down to 200 °C. These results are used together to develop an "Improved-Ambient" elemental analysis method for AMS spectra measured in air. The Improved-Ambient method uses specific ion fragments as markers to correct for molecular functionality-dependent systematic biases and reproduces known O : C (H : C) ratios of individual oxidized standards within 28% (13%) of the known molecular values. The error in Improved-Ambient O : C (H : C) values is smaller for theoretical standard mixtures of the oxidized organic standards, which are more representative of the complex mix of species present in ambient OA. For ambient OA, the Improved-Ambient method produces O : C (H : C) values that are 27% (11%) larger than previously published Aiken-Ambient values; a corresponding increase of 9% is observed for OM : OC values. These results imply that ambient OA has a higher relative oxygen content than previously estimated. The OS C values calculated for ambient OA by the two methods agree well, however (average relative difference of 0.06 OS C units). This indicates that OS C is a more robust metric of oxidation than O : C, likely since OS C is not affected by hydration or dehydration, either in the atmosphere or during analysis.

Download Full-text