GeoLink Data Set: A Complex Alignment Benchmark from Real-world Ontology

Lu Zhou; Michelle Cheatham; Adila Krisnadhi; Pascal Hitzler

doi:10.1162/dint_a_00054

GeoLink Data Set: A Complex Alignment Benchmark from Real-world Ontology

Data Intelligence ◽

10.1162/dint_a_00054 ◽

2020 ◽

Vol 2 (3) ◽

pp. 353-378 ◽

Cited By ~ 1

Author(s):

Lu Zhou ◽

Michelle Cheatham ◽

Adila Krisnadhi ◽

Pascal Hitzler

Keyword(s):

Real World ◽

Training Data ◽

Ontology Alignment ◽

Real World Data ◽

Data Set ◽

Domain Experts ◽

Road Map ◽

Modular Ontology ◽

Ontology Alignment Evaluation Initiative ◽

Complex Relationships

Ontology alignment has been studied for over a decade, and over that time many alignment systems and methods have been developed by researchers in order to find simple 1-to-1 equivalence matches between two ontologies. However, very few alignment systems focus on finding complex correspondences. One reason for this limitation may be that there are no widely accepted alignment benchmarks that contain such complex relationships. In this paper, we propose a real-world data set from the GeoLink project as a potential complex ontology alignment benchmark. The data set consists of two ontologies, the GeoLink Base Ontology (GBO) and the GeoLink Modular Ontology (GMO), as well as a manually created reference alignment that was developed in consultation with domain experts from different institutions. The alignment includes 1:1, 1:n, and m:n equivalence and subsumption correspondences, and is available in both Expressive and Declarative Ontology Alignment Language (EDOAL) and rule syntax. The benchmark has been expanded from its original version to contain real-world instance data from seven geoscience data providers that has been published according to both ontologies. This allows it to be used by extensional alignment systems or those that require training data. This benchmark has been incorporated into the Ontology Alignment Evaluation Initiative (OAEI) complex track to help researchers test their automated alignment systems and algorithms. This paper also analyzes the challenges inherent in effectively generating, detecting, and evaluating complex ontology alignments and provides a road map for future work on this topic.

Download Full-text

Performance of Machine Learning Algorithms and Diversity in Data

MATEC Web of Conferences ◽

10.1051/matecconf/201821004019 ◽

2018 ◽

Vol 210 ◽

pp. 04019 ◽

Cited By ~ 1

Author(s):

Hyontai SUG

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Real World ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Real World Data ◽

Random Data ◽

Data Set ◽

World Data

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.

Download Full-text

LEARNING WITH REAL-WORLD AND ARTIFICIAL DATA FOR IMPROVED VEHICLE DETECTION IN AERIAL IMAGERY

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-917-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 917-924

Author(s):

I. Weber ◽

J. Bongartz ◽

R. Roscher

Keyword(s):

Real World ◽

Research Question ◽

Vehicle Detection ◽

Training Data ◽

Aerial Images ◽

Data Sets ◽

Artificial Data ◽

Real World Data ◽

Data Set ◽

World Data

Abstract. Detecting objects in aerial images is an important task in different environmental and infrastructure-related applications. Deep learning object detectors like RetinaNet offer decent detection performance; however, they require a large amount of annotated training data. It is well known that the collection of annotated data is a time consuming and tedious task, which often cannot be performed sufficiently well for remote sensing tasks since the required data must cover a wide variety of scenes and objects. In this paper, we analyze the performance of such a network given a limited amount of training data and address the research question of whether artificially generated training data can be used to overcome the challenge of real-world data sets with a small amount of training data. For our experiments, we use the ISPRS 2D Semantic Labeling Contest Potsdam data set for vehicle detection, where we derive object-bounding boxes of vehicles suitable for our task. We generate artificial data based on vehicle blueprints and show that networks trained only on generated data may have a lower performance, but are still able to detect most of the vehicles found in the real data set. Moreover, we show that adding generated data to real-world data sets with a limited amount of training data, the performance can be increased significantly, and in some cases, almost reach baseline performance levels.

Download Full-text

Empirical evaluation of feature subset selection based on a real-world data set

Engineering Applications of Artificial Intelligence ◽

10.1016/j.engappai.2004.03.005 ◽

2004 ◽

Vol 17 (3) ◽

pp. 285-288 ◽

Cited By ~ 5

Author(s):

Petra Perner ◽

Chid Apte

Keyword(s):

Real World ◽

Empirical Evaluation ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Real World Data ◽

Data Set ◽

World Data

Download Full-text

Consensus Development of a Modern Ontology of Emergency Department Presenting Problems—The Hierarchical Presenting Problem Ontology (HaPPy)

Applied Clinical Informatics ◽

10.1055/s-0039-1691842 ◽

2019 ◽

Vol 10 (03) ◽

pp. 409-420 ◽

Cited By ~ 3

Author(s):

Steven Horng ◽

Nathaniel R. Greenbaum ◽

Larry A. Nathanson ◽

James C. McClay ◽

Foster R. Goss ◽

...

Keyword(s):

Emergency Department ◽

Real World ◽

Emergency Department Patient ◽

Snomed Ct ◽

Presenting Problems ◽

Real World Data ◽

Validation Data ◽

Data Set ◽

World Data ◽

Academic Level

Objective Numerous attempts have been made to create a standardized “presenting problem” or “chief complaint” list to characterize the nature of an emergency department visit. Previous attempts have failed to gain widespread adoption as they were not freely shareable or did not contain the right level of specificity, structure, and clinical relevance to gain acceptance by the larger emergency medicine community. Using real-world data, we constructed a presenting problem list that addresses these challenges. Materials and Methods We prospectively captured the presenting problems for 180,424 consecutive emergency department patient visits at an urban, academic, Level I trauma center in the Boston metro area. No patients were excluded. We used a consensus process to iteratively derive our system using real-world data. We used the first 70% of consecutive visits to derive our ontology, followed by a 6-month washout period, and the remaining 30% for validation. All concepts were mapped to Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT). Results Our system consists of a polyhierarchical ontology containing 692 unique concepts, 2,118 synonyms, and 30,613 nonvisible descriptions to correct misspellings and nonstandard terminology. Our ontology successfully captured structured data for 95.9% of visits in our validation data set. Discussion and Conclusion We present the HierArchical Presenting Problem ontologY (HaPPy). This ontology was empirically derived and then iteratively validated by an expert consensus panel. HaPPy contains 692 presenting problem concepts, each concept being mapped to SNOMED CT. This freely sharable ontology can help to facilitate presenting problem-based quality metrics, research, and patient care.

Download Full-text

Structure Identification-Based Clustering According to Density Consistency

Mathematical Problems in Engineering ◽

10.1155/2011/890901 ◽

2011 ◽

Vol 2011 ◽

pp. 1-14 ◽

Cited By ~ 1

Author(s):

Chunzhong Li ◽

Zongben Xu

Keyword(s):

High Dimension ◽

Real World ◽

Clustering Algorithm ◽

Density Difference ◽

Structure Identification ◽

Data Sets ◽

Critical Importance ◽

Real World Data ◽

Data Set ◽

High Dimension Data

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.

Download Full-text

Estimating Real World Performance of a Predictive Model: A Case-Study in Predicting End-of-Life

10.1101/19008821 ◽

2019 ◽

Author(s):

Vincent J Major ◽

Neil Jethani ◽

Yindalon Aphinyanaphongs

Keyword(s):

Experimental Design ◽

End Of Life ◽

Real World ◽

Hospital Admissions ◽

Model Performance ◽

Training Data ◽

Real World Data ◽

Test Set ◽

Subsequent Effect ◽

One Year

AbstractObjectiveThe main criteria for choosing how models are built is the subsequent effect on future (estimated) model performance. In this work, we evaluate the effects of experimental design choices on both estimated and actual model performance.Materials and MethodsFour years of hospital admissions are used to develop a 1 year end-of-life prediction model. Two common methods to select appropriate prediction timepoints (backwards-from-outcome and forwards-from-admission) are introduced and combined with two ways of separating cohorts for training and testing (internal and temporal). Two models are trained in identical conditions, and their performances are compared. Finally, operating thresholds are selected in each test set and applied in a final, ‘real-world’ cohort consisting of one year of admissions.ResultsBackwards-from-outcome cohort selection discards 75% of candidate admissions (n=23,579), whereas forwards-from-admission selection includes many more (n=92,148). Both selection methods produce similar global performances when applied to an internal test set. However, when applied to the temporally defined ‘real-world’ set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3 and 56.5% vs. 83.2 and 41.6%).DiscussionA backwards-from-outcome experiment effectively transforms the training data such that it no longer resembles real-world data. This results in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance.ConclusionExperimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

Empirical Evaluation of Feature Subset Selection Based on a Real-World Data Set

Principles of Data Mining and Knowledge Discovery - Lecture Notes in Computer Science ◽

10.1007/3-540-45372-5_68 ◽

2000 ◽

pp. 575-580 ◽

Cited By ~ 7

Author(s):

Petra Perner ◽

Chid Apte

Keyword(s):

Real World ◽

Empirical Evaluation ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Real World Data ◽

Data Set ◽

World Data

Download Full-text

A framework to aggregate multiple ontology matchers

International Journal of Web Information Systems ◽

10.1108/ijwis-05-2019-0023 ◽

2019 ◽

Vol 16 (2) ◽

pp. 151-169 ◽

Cited By ~ 1

Author(s):

Jairo Francisco de Souza ◽

Sean Wolfgand Matsui Siqueira ◽

Bernardo Nunes

Keyword(s):

Equation System ◽

Test Cases ◽

Semantic Heterogeneity ◽

Ontology Alignment ◽

Data Set ◽

Similarity Functions ◽

Content Type ◽

New Methods ◽

Linear Equation System ◽

Ontology Alignment Evaluation Initiative

Purpose Although ontology matchers are annually proposed to address different aspects of the semantic heterogeneity problem, finding the most suitable alignment approach is still an issue. This study aims to propose a computational solution for ontology meta-matching (OMM) and a framework designed for developers to make use of alignment techniques in their applications. Design/methodology/approach The framework includes some similarity functions that can be chosen by developers and then, automatically, set weights for each function to obtain better alignments. To evaluate the framework, several simulations were performed with a data set from the Ontology Alignment Evaluation Initiative. Simple similarity functions were used, rather than aligners known in the literature, to demonstrate that the results would be more influenced by the proposed meta-alignment approach than the functions used. Findings The results showed that the framework is able to adapt to different test cases. The approach achieved better results when compared with existing ontology meta-matchers. Originality/value Although approaches for OMM have been proposed, it is not easy to use them during software development. On the other hand, this work presents a framework that can be used by developers to align ontologies. New ontology matchers can be added and the framework is extensible to new methods. Moreover, this work presents a novel OMM approach modeled as a linear equation system which can be easily computed.

Download Full-text

A retrospective analysis of treatment patterns in metastatic castration-resistant prostate cancer patients treated with radium-223.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.7_suppl.180 ◽

2019 ◽

Vol 37 (7_suppl) ◽

pp. 180-180 ◽

Cited By ~ 1

Author(s):

A. Oliver Sartor ◽

Sreevalsa Appukkuttan ◽

Ronald E. Aubert ◽

Jeffrey Weiss ◽

Joy Wang ◽

...

Keyword(s):

Prostate Cancer ◽

Real World ◽

Pilot Trial ◽

Clinical Trial Data ◽

Castration Resistant Prostate Cancer ◽

Real World Data ◽

Data Set ◽

World Data ◽

Castration Resistant ◽

Radium 223

180 Background: Radium-223 (RA-223) is the first FDA approved targeted alpha therapy that significantly improves overall survival (OS) in patients (pts) with metastatic castration resistant prostate cancer (mCRPC) with symptomatic bone metastases. There is limited real world data describing RA-223 current use. Methods: A retrospective patient chart review was done of men who received at least 1 cycle of Ra-223 for mCRPC in 10 centers throughout the US (4 academic, 6 private practices). All pts had a minimum follow-up of 4 months, or placed in hospice or death. Descriptive analyses for clinical characteristics and treatment outcomes were performed. Results: Among the 200 pts (mean age-73.6 years, mean Charlson comorbidity index-6.9) RA-223 was initiated on average 1.6 years from mCRPC diagnosis (first line use (1L)=38.5%, 2L=31.5% and ≥3L=30%). 78% completed 5-6 cycles of RA-223 with mean therapy duration of 4.2 months. Among all pts, 43% received RA-223 as monotherapy (no overlap with other mCRPC therapies) while 57% had combination therapy with either abiraterone or enzalutamide. Median OS following RA-223 initiation was 21.2 months (95% CI 19.6- 29.2). Table provides the RA-223 utilization by type of clinical practice. Conclusions: Utilization of RA-223 in this real world data set was distinct from clinical trial data. Most patients received RA-223 in combination with abiraterone or enzalutamide, therapies that were unavailable when the pilot trial was conducted. Median survival was 21.2 months. Real world use of RA-223 has evolved as newer agents have become FDA approved in bone-metastatic CRPC. Academic and community patterns of practice were more similar than distinct. [Table: see text]

Download Full-text