Query Combinators: Domain Specific Query Languages for Medical Research

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions

JMIR Formative Research ◽

10.2196/17687 ◽

2020 ◽

Vol 4 (8) ◽

pp. e17687

Author(s):

Kristina K Gagalova ◽

M Angelica Leon Elizalde ◽

Elodie Portales-Casamar ◽

Matthias Görges

Keyword(s):

Data Processing ◽

Clinical Data ◽

Data Reuse ◽

Data Sources ◽

Primary Data ◽

Data Repositories ◽

Planning Stage ◽

Common Features ◽

Comparative Review ◽

High Level

Background Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. Objective The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. Methods We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. Results Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). Conclusions IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.

Download Full-text

Cube Algebra

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2013040103 ◽

2013 ◽

Vol 9 (2) ◽

pp. 39-65 ◽

Cited By ~ 14

Author(s):

Cristina Ciferri ◽

Ricardo Ciferri ◽

Leticia Gómez ◽

Markus Schneider ◽

Alejandro Vaisman ◽

...

Keyword(s):

Conceptual Model ◽

User Interfaces ◽

Query Language ◽

Knowledge Worker ◽

Query Languages ◽

Data Warehouses ◽

End User ◽

Uml Extensions ◽

High Level ◽

User Friendly

The lack of an appropriate conceptual model for data warehouses and OLAP systems has led to the tendency to deploy logical models (for example, star, snowflake, and constellation schemas) for them as conceptual models. ER model extensions, UML extensions, special graphical user interfaces, and dashboards have been proposed as conceptual approaches. However, they introduce their own problems, are somehow complex and difficult to understand, and are not always user-friendly. They also require a high learning curve, and most of them address only structural design, not considering associated operations. Therefore, they are not really an improvement and, in the end, only represent a reflection of the logical model. The essential drawback of offering this system-centric view as a user concept is that knowledge workers are confronted with the full and overwhelming complexity of these systems as well as complicated and user-unfriendly query languages such as SQL OLAP and MDX. In this article, the authors propose a user-centric conceptual model for data warehouses and OLAP systems, called the Cube Algebra. It takes the cube metaphor literally and provides the knowledge worker with high-level cube objects and related concepts. A novel query language leverages well known high-level operations such as roll-up, drill-down, slice, and drill-across. As a result, the logical and physical levels are hidden from the unskilled end user.

Download Full-text

SchenQL: in-depth analysis of a query language for bibliographic metadata

International Journal on Digital Libraries ◽

10.1007/s00799-021-00317-8 ◽

2021 ◽

Author(s):

Christin Katharina Kreutz ◽

Michael Wolz ◽

Jascha Knack ◽

Benjamin Weyers ◽

Ralf Schenkel

Keyword(s):

Information Needs ◽

User Study ◽

Query Language ◽

Information Access ◽

Domain Experts ◽

Domain Specific ◽

Complex Information ◽

Depth Analysis ◽

Information Exploration ◽

High Level

AbstractInformation access to bibliographic metadata needs to be uncomplicated, as users may not benefit from complex and potentially richer data that may be difficult to obtain. Sophisticated research questions including complex aggregations could be answered with complex SQL queries. However, this comes with the cost of high complexity, which requires for a high level of expertise even for trained programmers. A domain-specific query language could provide a straightforward solution to this problem. Although less generic, it can support users not familiar with query construction in the formulation of complex information needs. In this paper, we present and evaluate SchenQL, a simple and applicable query language that is accompanied by a prototypical GUI. SchenQL focuses on querying bibliographic metadata using the vocabulary of domain experts. The easy-to-learn domain-specific query language is suitable for domain experts as well as casual users while still providing the possibility to answer complex information demands. Query construction and information exploration are supported by a prototypical GUI. We present an evaluation of the complete system: different variants for executing SchenQL queries are benchmarked; interviews with domain-experts and a bipartite quantitative user study demonstrate SchenQL’s suitability and high level of users’ acceptance.

Download Full-text

Characterizing Tractability of Simple Well-Designed Pattern Trees with Projection

Theory of Computing Systems ◽

10.1007/s00224-020-10002-z ◽

2020 ◽

Author(s):

Stefan Mengel ◽

Sebastian Skritek

Keyword(s):

Semantic Web ◽

Structural Characterization ◽

Incomplete Data ◽

Query Language ◽

Query Languages ◽

Data Sources ◽

Central Feature ◽

Simple Pattern ◽

Underlying Graph

Abstract We study the complexity of evaluating well-designed pattern trees, a query language extending conjunctive queries with the possibility to define parts of the query to be optional. This possibility of optional parts is important for obtaining meaningful results over incomplete data sources as it is common in semantic web settings. Recently, a structural characterization of the classes of well-designed pattern trees that can be evaluated in polynomial time was shown. However, projection—a central feature of many query languages—was not considered in this study. We work towards closing this gap by giving a characterization of all tractable classes of simple well-designed pattern trees with projection (under some common complexity theoretic assumptions). Since well-designed pattern trees correspond to the fragment of well-designed {, }-SPARQL queries this gives a complete description of the tractable classes of queries with projections in this fragment that can be characterized by the underlying graph structures of the queries. For non-simple pattern trees the tractability criteria for simple pattern trees do not capture all tractable classes. We thus extend the characterization for the non-simple case in order to capture some additional tractable cases.

Download Full-text

An Eligibility Criteria Query Language for Heterogeneous Data Warehouses

Methods of Information in Medicine ◽

10.3414/me13-02-0027 ◽

2015 ◽

Vol 54 (01) ◽

pp. 41-44 ◽

Cited By ~ 11

Author(s):

A. Taweel ◽

S. Miles ◽

B. C. Delaney ◽

R. Bache

Keyword(s):

Clinical Data ◽

Query Language ◽

Data Representation ◽

Query Languages ◽

Heterogeneous Data ◽

Data Sources ◽

Data Warehouses ◽

Eligibility Criteria ◽

Strong Basis ◽

Temporal Semantics

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Managing Interoperability and Complexity in Health Systems”.Objectives: The increasing availability of electronic clinical data provides great potential for finding eligible patients for clinical research. However, data heterogeneity makes it difficult for clinical researchers to interrogate sources consistently. Existing standard query languages are often not sufficient to query across diverse representations. Thus, a higher- level domain language is needed so that queries become data-representation agnostic. To this end, we define a clinician-readable computational language for querying whether patients meet eligibility criteria (ECs) from clinical trials. This language is capable of implementing the temporal semantics required by many ECs, and can be automatically evaluated on heterogeneous data sources.Methods: By reference to standards and examples of existing ECs, a clinician-readable query language was developed. Using a model-based approach, it was implemented to transform captured ECs into queries that interrogate heterogeneous data warehouses. The query language was evaluated on two types of data sources, each different in structure and content.Results: The query language abstracts the level of expressivity so that researchers construct their ECs with no prior knowledge of the data sources. It was evaluated on two types of semantically and structurally diverse data warehouses. This query language is now used to express ECs in the EHR4CR project. A survey shows that it was perceived by the majority of users to be useful, easy to understand and unambiguous.Discussion: An EC-specific language enables clinical researchers to express their ECs as a query such that the user is isolated from complexities of different heterogeneous clinical data sets. More generally, the approach demonstrates that a domain query language has potential for overcoming the problems of semantic interoperability and is applicable where the nature of the queries is well understood and the data is conceptually similar but in different representations.Conclusions: Our language provides a strong basis for use across different clinical domains for expressing ECs by overcoming the heterogeneous nature of electronic clinical data whilst maintaining semantic consistency. It is readily comprehensible by target users. This demonstrates that a domain query language can be both usable and interoperable.

Download Full-text

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions (Preprint)

10.2196/preprints.17687 ◽

2020 ◽

Author(s):

Kristina K Gagalova ◽

M Angelica Leon Elizalde ◽

Elodie Portales-Casamar ◽

Matthias Görges

Keyword(s):

Data Processing ◽

Clinical Data ◽

Data Reuse ◽

Data Sources ◽

Primary Data ◽

Data Repositories ◽

Planning Stage ◽

Common Features ◽

Comparative Review ◽

High Level

BACKGROUND Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. OBJECTIVE The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. METHODS We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. RESULTS Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). CONCLUSIONS IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.

Download Full-text

Query Rewriting for Incremental Continuous Query Evaluation in HIFUN

Algorithms ◽

10.3390/a14050149 ◽

2021 ◽

Vol 14 (5) ◽

pp. 149

Author(s):

Petros Zervoudakis ◽

Haridimos Kondylakis ◽

Nicolas Spyratos ◽

Dimitris Plexousakis

Keyword(s):

Query Optimization ◽

Query Language ◽

Computational Cost ◽

Continuous Queries ◽

Continuous Query ◽

Query Rewriting ◽

Query Evaluation ◽

Clear Separation ◽

Complete Dataset ◽

High Level

HIFUN is a high-level query language for expressing analytic queries of big datasets, offering a clear separation between the conceptual layer, where analytic queries are defined independently of the nature and location of data, and the physical layer, where queries are evaluated. In this paper, we present a methodology based on the HIFUN language, and the corresponding algorithms for the incremental evaluation of continuous queries. In essence, our approach is able to process the most recent data batch by exploiting already computed information, without requiring the evaluation of the query over the complete dataset. We present the generic algorithm which we translated to both SQL and MapReduce using SPARK; it implements various query rewriting methods. We demonstrate the effectiveness of our approach in temrs of query answering efficiency. Finally, we show that by exploiting the formal query rewriting methods of HIFUN, we can further reduce the computational cost, adding another layer of query optimization to our implementation.

Download Full-text

Distributed Simulation Platforms and Data Passing Tools for Natural Hazards Engineering: Reviews, Limitations, and Recommendations

International Journal of Disaster Risk Science ◽

10.1007/s13753-021-00361-7 ◽

2021 ◽

Author(s):

Lichao Xu ◽

Szu-Yun Lin ◽

Andrew W. Hlynka ◽

Hao Lu ◽

Vineet R. Kamat ◽

...

Keyword(s):

Natural Hazards ◽

Data Exchange ◽

Distributed Simulation ◽

Distributed Data ◽

Interactive Simulation ◽

Integrated Simulation ◽

Domain Specific ◽

Advantages And Disadvantages ◽

High Level ◽

Data Passing

AbstractThere has been a strong need for simulation environments that are capable of modeling deep interdependencies between complex systems encountered during natural hazards, such as the interactions and coupled effects between civil infrastructure systems response, human behavior, and social policies, for improved community resilience. Coupling such complex components with an integrated simulation requires continuous data exchange between different simulators simulating separate models during the entire simulation process. This can be implemented by means of distributed simulation platforms or data passing tools. In order to provide a systematic reference for simulation tool choice and facilitating the development of compatible distributed simulators for deep interdependent study in the context of natural hazards, this article focuses on generic tools suitable for integration of simulators from different fields but not the platforms that are mainly used in some specific fields. With this aim, the article provides a comprehensive review of the most commonly used generic distributed simulation platforms (Distributed Interactive Simulation (DIS), High Level Architecture (HLA), Test and Training Enabling Architecture (TENA), and Distributed Data Services (DDS)) and data passing tools (Robot Operation System (ROS) and Lightweight Communication and Marshalling (LCM)) and compares their advantages and disadvantages. Three specific limitations in existing platforms are identified from the perspective of natural hazard simulation. For mitigating the identified limitations, two platform design recommendations are provided, namely message exchange wrappers and hybrid communication, to help improve data passing capabilities in existing solutions and provide some guidance for the design of a new domain-specific distributed simulation framework.

Download Full-text

Temporal Processing Capacity in High-Level Visual Cortex Is Domain Specific

Journal of Neuroscience ◽

10.1523/jneurosci.4822-14.2015 ◽

2015 ◽

Vol 35 (36) ◽

pp. 12412-12424 ◽

Cited By ~ 65

Author(s):

A. Stigliani ◽

K. S. Weiner ◽

K. Grill-Spector

Keyword(s):

Visual Cortex ◽

Temporal Processing ◽

Processing Capacity ◽

Domain Specific ◽

High Level

Download Full-text

Evolution of the ALICE Software Framework for Run 3

EPJ Web of Conferences ◽

10.1051/epjconf/201921405010 ◽

2019 ◽

Vol 214 ◽

pp. 05010 ◽

Cited By ~ 1

Author(s):

Giulio Eulisse ◽

Piotr Konopka ◽

Mikolaj Krzewicki ◽

Matthias Richter ◽

David Rohr ◽

...

Keyword(s):

Data Processing ◽

Data Model ◽

Message Passing ◽

Software Framework ◽

Distributed Software ◽

Central Collisions ◽

Modular Software ◽

Level Trigger ◽

Data Throughput ◽

High Level

ALICE is one of the four major LHC experiments at CERN. When the accelerator enters the Run 3 data-taking period, starting in 2021, ALICE expects almost 100 times more Pb-Pb central collisions than now, resulting in a large increase of data throughput. In order to cope with this new challenge, the collaboration had to extensively rethink the whole data processing chain, with a tighter integration between Online and Offline computing worlds. Such a system, code-named ALICE O2, is being developed in collaboration with the FAIR experiments at GSI. It is based on the ALFA framework which provides a generalized implementation of the ALICE High Level Trigger approach, designed around distributed software entities coordinating and communicating via message passing. We will highlight our efforts to integrate ALFA within the ALICE O2 environment. We analyze the challenges arising from the different running environments for production and development, and conclude on requirements for a flexible and modular software framework. In particular we will present the ALICE O2 Data Processing Layer which deals with ALICE specific requirements in terms of Data Model. The main goal is to reduce the complexity of development of algorithms and managing a distributed system, and by that leading to a significant simplification for the large majority of the ALICE users.

Download Full-text