A Data Element-Function Conceptual Model for Data Quality Checks

James R. Rogers; Tiffany J. Callahan; Tian Kang; Alan Bauck; Ritu Khare; Jeffrey S. Brown; Michael G. Kahn; Chunhua Weng

doi:10.5334/egems.289

Using electronic patient records to assess the effect of a complex antenatal intervention in a cluster randomised controlled trial—data management experience from the DESiGN Trial team

Trials ◽

10.1186/s13063-021-05141-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Sophie Relph ◽

◽

Maria Elstad ◽

Bolaji Coker ◽

Matias C. Vieira ◽

...

Keyword(s):

Clinical Trials ◽

Data Quality ◽

Data Management ◽

Randomised Control Trial ◽

Data Dictionary ◽

Patient Records ◽

Electronic Patient Records ◽

Cluster Randomised ◽

Quality Checks ◽

Data Harmonisation

Abstract Background The use of electronic patient records for assessing outcomes in clinical trials is a methodological strategy intended to drive faster and more cost-efficient acquisition of results. The aim of this manuscript was to outline the data collection and management considerations of a maternity and perinatal clinical trial using data from electronic patient records, exemplifying the DESiGN Trial as a case study. Methods The DESiGN Trial is a cluster randomised control trial assessing the effect of a complex intervention versus standard care for identifying small for gestational age foetuses. Data on maternal/perinatal characteristics and outcomes including infants admitted to neonatal care, parameters from foetal ultrasound and details of hospital activity for health-economic evaluation were collected at two time points from four types of electronic patient records held in 22 different electronic record systems at the 13 research clusters. Data were pseudonymised on site using a bespoke Microsoft Excel macro and securely transferred to the central data store. Data quality checks were undertaken. Rules for data harmonisation of the raw data were developed and a data dictionary produced, along with rules and assumptions for data linkage of the datasets. The dictionary included descriptions of the rationale and assumptions for data harmonisation and quality checks. Results Data were collected on 182,052 babies from 178,350 pregnancies in 165,397 unique women. Data availability and completeness varied across research sites; each of eight variables which were key to calculation of the primary outcome were completely missing in median 3 (range 1–4) clusters at the time of the first data download. This improved by the second data download following clarification of instructions to the research sites (each of the eight key variables were completely missing in median 1 (range 0–1) cluster at the second time point). Common data management challenges were harmonising a single variable from multiple sources and categorising free-text data, solutions were developed for this trial. Conclusions Conduct of clinical trials which use electronic patient records for the assessment of outcomes can be time and cost-effective but still requires appropriate time and resources to maximise data quality. A difficulty for pregnancy and perinatal research in the UK is the wide variety of different systems used to collect patient data across maternity units. In this manuscript, we describe how we managed this and provide a detailed data dictionary covering the harmonisation of variable names and values that will be helpful for other researchers working with these data. Trial registration Primary registry and trial identifying number: ISRCTN 67698474. Registered on 02/11/16.

Download Full-text

A New Automatic Monitoring Network of Surface Waters in Greece: Preliminary Data Quality Checks and Visualization

Hydrology ◽

10.3390/hydrology8010033 ◽

2021 ◽

Vol 8 (1) ◽

pp. 33

Author(s):

Yiannis Panagopoulos ◽

Anna Konstantinidou ◽

Konstantinos Lazogiannis ◽

Anastasios Papadopoulos ◽

Elias Dimitriou

Keyword(s):

Data Quality ◽

Surface Waters ◽

Data Dissemination ◽

Monitoring Program ◽

Monitoring Network ◽

Automatic Monitoring ◽

Water Monitoring ◽

Marine Research ◽

Management Actions ◽

Quality Checks

The monitoring of surface waters is of fundamental importance for their preservation under good quantitative and qualitative conditions, as it can facilitate the understanding of the actual status of water and indicate suitable management actions. Taking advantage of the experience gained from the coordination of the national water monitoring program in Greece and the available funding from two ongoing infrastructure projects, the Institute of Inland Waters of the Hellenic Centre for Marine Research has developed the first homogeneous real-time network of automatic water monitoring across many Greek rivers. In this paper, its installation and maintenance procedures are presented with emphasis on the data quality checks, based on values range and variability tests, before their online publication and dissemination to end-users. Preliminary analyses revealed that the water pH and dissolved oxygen (DO) sensors and produced data need increased maintenance and quality checks respectively, compared to the more reliably recorded water stage, temperature (T) and electrical conductivity (EC). Moreover, the data dissemination platform and selected data visualization options are demonstrated and the need for both this platform and the monitoring network to be maintained and potentially expanded after the termination of the funding projects is highlighted.

Download Full-text

Trap questions in online surveys: Results from three web survey experiments

International Journal of Market Research ◽

10.1177/1470785317744856 ◽

2018 ◽

Vol 60 (1) ◽

pp. 32-49 ◽

Cited By ~ 7

Author(s):

Mingnan Liu ◽

Laura Wronski

Keyword(s):

Data Quality ◽

Quality Measures ◽

Web Survey ◽

Close Attention ◽

Online Surveys ◽

The Third ◽

Survey Questions ◽

Survey Experiments ◽

Quality Checks ◽

Level Of Difficulty

This study examines the use of trap questions as indicators of data quality in online surveys. Trap questions are intended to identify respondents who are not paying close attention to survey questions, which would mean that they are providing sub-optimal responses to not only the trap question itself but to other questions included in the survey. We conducted three experiments using an online non-probability panel. In the first experiment, we examine whether there is any difference in responses to surveys with one trap question as those that have two trap questions. In the second study, we examine responses to surveys with trap questions of varying difficulty. In the third experiment, we test the level of difficulty, the placement of the trap question, and other forms of attention checks. In all studies, we correlate the responses to the trap question(s) with other data quality checks, most of which were derived from the literature on satisficing. Also, we compare the responses to several substance questions by the response to the trap questions. This would tell us whether participants who failed the trap questions gave consistently different answers from those who passed the trap questions. We find that the rate of passing/failing various trap questions varies widely, from 27% to 87% among the types we tested. We also find evidence that some types of trap questions are more significantly correlated with other data quality measures.

Download Full-text

Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality

10.1101/2021.03.25.21254341 ◽

2021 ◽

Author(s):

Clair Blacketer ◽

Frank J Defalco ◽

Patrick B Ryan ◽

Peter R Rijnbeek

Keyword(s):

Data Quality ◽

Real World ◽

R Package ◽

Observational Research ◽

Real World Data ◽

Quality Reporting ◽

Healthcare Data ◽

Real World Evidence ◽

Quality Checks

Advances in standardization of observational healthcare data have enabled methodological breakthroughs, rapid global collaboration, and generation of real-world evidence to improve patient outcomes. Standardizations in data structure, such as use of Common Data Models (CDM), need to be coupled with standardized approaches for data quality assessment. To ensure confidence in real-world evidence generated from the analysis of real-world data, one must first have confidence in the data itself. The Data Quality Dashboard is an open-source R package that reports potential quality issues in an OMOP CDM instance through the systematic execution and summarization of over 3,300 configurable data quality checks. We describe the implementation of check types across a data quality framework of conformance, completeness, plausibility, with both verification and validation. We illustrate how data quality checks, paired with decision thresholds, can be configured to customize data quality reporting across a range of observational health data sources. We discuss how data quality reporting can become part of the overall real-world evidence generation and dissemination process to promote transparency and build confidence in the resulting output. Transparently communicating how well CDM standardized databases adhere to a set of quality measures adds a crucial piece that is currently missing from observational research. Assessing and improving the quality of our data will inherently improve the quality of the evidence we generate.

Download Full-text

Neuroscience publishing is too important to leave to publishers

Neuroanatomy and Behaviour ◽

10.35430/nab.2019.e7 ◽

2019 ◽

Vol 1 ◽

pp. ed1

Author(s):

Shaun Yon-Seng Khoo

Keyword(s):

Open Access ◽

Data Quality ◽

Open Data ◽

Open Access Journal ◽

Non Profit ◽

Quality Checks

Almost every open access neuroscience journal is pay-to-publish. This leaves neuroscientists with a choice of submitting to journals that not all of our colleagues can legitimately access and choosing to pay large sums of money to publish open access. Neuroanatomy and Behaviour is a new platinum open access journal published by a non-profit association of scientists. Since we do not charge fees, we will focus entirely on the quality of submitted articles and encourage the adoption of reproducibility-enhancing practices, like open data, preregistration, and data quality checks. We hope that our colleagues will join us in this endeavour so that we can support good neuroscience no matter where it comes from.

Download Full-text

Citizen engagement with open government data

Transforming Government People Process and Policy ◽

10.1108/tg-06-2019-0051 ◽

2020 ◽

Vol 14 (1) ◽

pp. 1-30 ◽

Cited By ~ 3

Author(s):

Arie Purwanto ◽

Anneke Zuiderwijk ◽

Marijn Janssen

Keyword(s):

Data Quality ◽

Conceptual Model ◽

Necessary Conditions ◽

Open Government ◽

Future Research ◽

Citizen Engagement ◽

Content Type ◽

Open Government Data ◽

Government Data

Purpose Citizen engagement is key to the success of many Open Government Data (OGD) initiatives. However, not much is known regarding how this type of engagement emerges. This study aims to investigate the necessary conditions for the emergence of citizen-led engagement with OGD and to identify which factors stimulate this type of engagement. Design/methodology/approach First, the authors created a systematic overview of the literature to develop a conceptual model of conditions and factors of OGD citizen engagement at the societal, organizational and individual level. Second, the authors used the conceptual model to systematically study citizens’ engagement in the case of a particular OGD initiative, namely, the digitization of presidential election results data in Indonesia in 2014. The authors used multiple information sources, including interviews and documents, to explore the conditions and factors of OGD citizen-led engagement in this case. Findings From the literature the authors identified five conditions for the emergence of OGD citizen-led engagement as follows: the availability of a legal and political framework that grants a mandate to open up government data, sufficient budgetary resources allocated for OGD provision, the availability of OGD feedback mechanisms, citizens’ perceived ease of engagement and motivated citizens. In the literature, the authors found six factors contributing to OGD engagement as follows: democratic culture, the availability of supporting institutional arrangements, the technical factors of OGD provision, the availability of citizens’ resources, the influence of social relationships and citizens’ perceived data quality. Some of these conditions and factors were found to be less important in the studied case, namely, citizens’ perceived ease of engagement and citizens’ perceived data quality. Moreover, the authors found several new conditions that were not mentioned in the studied literature, namely, citizens’ sense of urgency, competition among citizen-led OGD engagement initiatives, the diversity of citizens’ skills and capabilities and the intensive use of social media. The difference between the conditions and factors that played an important role in the case and those derived from the literature review might be because of the type of OGD engagement that the authors studied, namely, citizen-led engagement, without any government involvement. Research limitations/implications The findings are derived using a single case study approach. Future research can investigate multiple cases and compare the conditions and factors for citizen-led engagement with OGD in different contexts. Practical implications The conditions and factors for citizen-led engagement with OGD have been evaluated in practice and discussed with public managers and practitioners through interviews. Governmental organizations should prioritize and stimulate those conditions and factors that enhance OGD citizen engagement to create more value with OGD. Originality/value While some research on government-led engagement with OGD exists, there is hardly any research on citizen-led engagement with OGD. This study is the first to develop a conceptual model of necessary conditions and factors for citizen engagement with OGD. Furthermore, the authors applied the developed multilevel conceptual model to a case study and gathered empirical evidence of OGD engagement and its contributions to solving societal problems, rather than staying at the conceptual level. This research can be used to investigate citizen engagement with OGD in other cases and offers possibilities for systematic cross-case lesson-drawing.

Download Full-text

ETL Best Practices for Data Quality Checks in RIS Databases

Informatics ◽

10.3390/informatics6010010 ◽

2019 ◽

Vol 6 (1) ◽

pp. 10 ◽

Cited By ~ 9

Author(s):

Otmane Azeroual ◽

Gunter Saake ◽

Mohammad Abuosba

Keyword(s):

Data Integration ◽

Data Quality ◽

Information Quality ◽

Database Systems ◽

Data Sources ◽

Research Information ◽

External Data ◽

Quality Checks ◽

Quality Issues ◽

It Departments

The topic of data integration from external data sources or independent IT-systems has received increasing attention recently in IT departments as well as at management level, in particular concerning data integration in federated database systems. An example of the latter are commercial research information systems (RIS), which regularly import, cleanse, transform and prepare the analysis research information of the institutions of a variety of databases. In addition, all these so-called steps must be provided in a secured quality. As several internal and external data sources are loaded for integration into the RIS, ensuring information quality is becoming increasingly challenging for the research institutions. Before the research information is transferred to a RIS, it must be checked and cleaned up. An important factor for successful or competent data integration is therefore always the data quality. The removal of data errors (such as duplicates and harmonization of the data structure, inconsistent data and outdated data, etc.) are essential tasks of data integration using extract, transform, and load (ETL) processes. Data is extracted from the source systems, transformed and loaded into the RIS. At this point conflicts between different data sources are controlled and solved, as well as data quality issues during data integration are eliminated. Against this background, our paper presents the process of data transformation in the context of RIS which gains an overview of the quality of research information in an institution’s internal and external data sources during its integration into RIS. In addition, the question of how to control and improve the quality issues during the integration process in RIS will be addressed.

Download Full-text

Data Quality Improvements in National Syndromic Surveillance Program (NSSP) Data

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v10i1.9122 ◽

2018 ◽

Vol 10 (1) ◽

Author(s):

Girum S. Ejigu ◽

Kakshmi Radhakrishnan ◽

Paul McMurray ◽

Roseanne English

Keyword(s):

Data Quality ◽

Syndromic Surveillance ◽

Data Flow ◽

Chief Complaint ◽

Surveillance Program ◽

Discharge Diagnosis ◽

Standard Data ◽

Quality Checks ◽

State And Local ◽

Data Elements

ObjectiveReview the impact of applying regular data quality checks to assess completeness of core data elements that support syndromic surveillance.IntroductionThe National Syndromic Surveillance Program (NSSP) is a community focused collaboration among federal, state, and local public health agencies and partners for timely exchange of syndromic data. These data, captured in nearly real time, are intended to improve the nation's situational awareness and responsiveness to hazardous events and disease outbreaks. During CDC’s previous implementation of a syndromic surveillance system (BioSense 2), there was a reported lack of transparency and sharing of information on the data processing applied to data feeds, encumbering the identification and resolution of data quality issues. The BioSense Governance Group Data Quality Workgroup paved the way to rethink surveillance data flow and quality. Their work and collaboration with state and local partners led to NSSP redesigning the program’s data flow. The new data flow provided a ripe opportunity for NSSP analysts to study the data landscape (e.g., capturing of HL7 messages and core data elements), assess end-to-end data flow, and make adjustments to ensure all data being reported were processed, stored, and made accessible to the user community. In addition, NSSP extensively documented the new data flow, providing the transparency the community needed to better understand the disposition of facility data. Even with a new and improved data flow, data quality issues that were issues in the past, but went unreported, remained issues in the new data. However, these issues were now identified. The newly designed data flow provided opportunities to report and act on issues found in the data unlike previous versions. Therefore, an important component of the NSSP data flow was the implementation of regularly scheduled standard data quality checks, and release of standard data quality reports summarizing data quality findings.MethodsNSSP data was assessed for the national-level completeness of chief complaint and discharge diagnosis data. Completeness is the rate of non- null values (Batini et al., 2009). It was defined as the percent of visits (e.g., emergency department, urgent care center) with a non-null value found among the one or more records associated with the visit. National completeness rates for visits in 2016 were compared with completeness rates of visits in 2017 (a partial year including visits through August 2017). In addition, facility-level progress was quantified after scoring each facility based on the percent completeness change between 2016 and 2017. Legacy data processed prior to introducing the new NSSP data flow were not included in this assessment.ResultsNationally, the percent completeness of chief complaint for visits in 2016 was 82.06% (N=58,192,721), and the percent completeness of chief complaint for visits in 2017 was 87.15% (N=80,603,991). Of the 2,646 facilities that sent visits data in 2016 and 2017, 114 (4.31%) facilities showed an increase of at least 10% in chief complaint completeness in 2017 compared with 2016. As for discharge diagnosis, national results showed the percent completeness of discharge diagnosis for 2016 visits was 50.83% (N=36,048,334), and the percent completeness of discharge diagnosis for 2017 was 59.23% (N=54,776,310). Of the 2,646 facilities that sent data for visits in 2016 and 2017, 306 (11.56%) facilities showed more than a 10% increase in percent completeness of discharge diagnosis in 2017 compared with 2016.ConclusionsNationally, the percent completeness of chief complaint for visits in 2016 was 82.06% (N=58,192,721), and the percent completeness of chief complaint for visits in 2017 was 87.15% (N=80,603,991). Of the 2,646 facilities that sent visits data in 2016 and 2017, 114 (4.31%) facilities showed an increase of at least 10% in chief complaint completeness in 2017 compared with 2016. As for discharge diagnosis, national results showed the percent completeness of discharge diagnosis for 2016 visits was 50.83% (N=36,048,334), and the percent completeness of discharge diagnosis for 2017 was 59.23% (N=54,776,310). Of the 2,646 facilities that sent data for visits in 2016 and 2017, 306 (11.56%) facilities showed more than a 10% increase in percent completeness of discharge diagnosis in 2017 compared with 2016.ReferencesBatini, C., Cappiello. C., Francalanci, C. and Maurino, A. (2009) Methodologies for data quality assessment and improvement. ACM Comput. Surv., 41(3). 1-52.

Download Full-text

Innovative Data Collection Techniques for Roadside Origin-Destination Surveys

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1719-18 ◽

2000 ◽

Vol 1719 (1) ◽

pp. 140-146 ◽

Cited By ~ 3

Author(s):

Cesar Quiroga ◽

Russell Henk ◽

Marc Jacobson

Keyword(s):

Data Collection ◽

Data Quality ◽

Data Reduction ◽

Lessons Learned ◽

Laptop Computer ◽

Design And Development ◽

Extensive Data ◽

Survey Report ◽

Quality Checks

Described are the results of a pilot application intended to automate the data collection and data reduction phases of roadside origin-destination (O-D) studies. Most techniques used to obtain O-D data are quite labor intensive, during both the data collection and the data reduction phase. Frequently, they result in extensive data quality checks and long turnaround periods between the data collection work and the submittal of the corresponding survey report. The application described automates the data collection and data reduction phases by using portable, handheld data collection devices. These devices can be connected to a desktop or laptop computer to transfer the O-D data to a depository database. Included are a brief background discussion, a description of the hardware and software used and the design and development of O-D applications, a description of two applications of the handheld data collection devices, and a list of lessons learned.

Download Full-text

The impact of improved data quality on the prevalence estimates of anthropometric measures using DHS datasets in India

Scientific Reports ◽

10.1038/s41598-021-89319-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Harsh Vivek Harkare ◽

Daniel J. Corsi ◽

Rockli Kim ◽

Sebastian Vollmer ◽

S. V. Subramanian

Keyword(s):

Data Quality ◽

Differential Impact ◽

Anthropometric Measures ◽

True Prevalence ◽

Contentious Issue ◽

Percentage Points ◽

Prevalence Estimates ◽

Quality Checks ◽

The Impact ◽

Data Quality Indicators

AbstractThe importance of data quality to correctly determine prevalence estimates of child anthropometric failures has been a contentious issue among policymakers and researchers. Our research objective was to ascertain the impact of improved DHS data quality on the prevalence estimates of stunting, wasting, and underweight. The study also looks for the drivers of data quality. Using five data quality indicators based on age, sex, anthropometric measurements, and normality distribution, we arrive at two datasets of differential data quality and their estimates of anthropometric failures. For this purpose, we use the 2005–2006 and 2015–2016 NFHS data covering 311,182 observations from India. The prevalence estimates of stunting and underweight were virtually unchanged after the application of quality checks. The estimate of wasting had fallen 2 percentage points, indicating an overestimation of the true prevalence. However, this differential impact on the estimate of wasting was driven by the flagging procedure’s sensitivity and was in accordance with empirical evidence from existing literature. We found DHS data quality to be of sufficiently high quality for the prevalence estimates of stunting and underweight, to not change significantly after further improving the data quality. The differential estimate of wasting is attributable to the sensitivity of the flagging procedure.

Download Full-text