An R Script for Assessment of Data Quality in the BioSense Locker Database

Serena Rezny; Stacey Hoferka

doi:10.5210/ojphi.v8i1.6573

An R Script for Assessment of Data Quality in the BioSense Locker Database

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v8i1.6573 ◽

2016 ◽

Vol 8 (1) ◽

Author(s):

Serena Rezny ◽

Stacey Hoferka

Keyword(s):

Data Quality ◽

Quality Assessment ◽

Syndromic Surveillance ◽

Quality Metrics ◽

Quality Improvements ◽

Aggregated Data ◽

Ed Visits ◽

Facility Level ◽

Data Quality Metrics

Syndromic surveillance requires reliable, accurate, and complete healthcare encounter data. To address the need for quality assessment of ED data, we developed an R script to assess and produce reports on data quality in the BioSense locker database. The script examines identifying variables in the HL7 messages from the locker, aggregates messages into ED visits based on these identifiers, processes the aggregated data to calculate metadata for each visit, and computes various data quality metrics. Facility-level reports are written to HTML files, which can then be shared with hospitals and vendors to support ongoing data quality improvements.

Download Full-text

Specification for data quality metrics for industrial measurement and control systems

10.3403/02317420 ◽

2001 ◽

Keyword(s):

Data Quality ◽

Control Systems ◽

Quality Metrics ◽

Data Quality Metrics ◽

And Control ◽

Measurement And Control ◽

Industrial Measurement

Download Full-text

An Algebraic Approach to Data Quality Metrics for Entity Resolution over Large Datasets

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch196 ◽

2008 ◽

pp. 3067-3084

Author(s):

John Talburt ◽

Richard Wang ◽

Kimberly Hess ◽

Emily Kuo

Keyword(s):

Data Quality ◽

Algebraic Approach ◽

Entity Resolution ◽

Quality Metrics ◽

Quality Literature ◽

Intrinsic Quality ◽

Recognition Systems ◽

Entity Identification ◽

Data Quality Metrics

This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on “persons of interest” across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results, and outlines areas and approaches to further research in this area.

Download Full-text

Prioritizing technical debt in database normalization using portfolio theory and data quality metrics

Proceedings of the 2018 International Conference on Technical Debt - TechDebt '18 ◽

10.1145/3194164.3194170 ◽

2018 ◽

Cited By ~ 2

Author(s):

Mashel Albarak ◽

Rami Bahsoon

Keyword(s):

Data Quality ◽

Portfolio Theory ◽

Quality Metrics ◽

Technical Debt ◽

Data Quality Metrics ◽

Database Normalization

Download Full-text

Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics

Journal of Data and Information Quality ◽

10.1145/3190578 ◽

2018 ◽

Vol 10 (1) ◽

pp. 1-26 ◽

Cited By ~ 2

Author(s):

Christian Bors ◽

Theresia Gschwandtner ◽

Simone Kriglstein ◽

Silvia Miksch ◽

Margit Pohl

Keyword(s):

Data Quality ◽

Quality Metrics ◽

Data Quality Metrics

Download Full-text

Recommendations for Mass Spectrometry Data Quality Metrics for Open Access Data (Corollary to the Amsterdam Principles)

Journal of Proteome Research ◽

10.1021/pr201071t ◽

2011 ◽

Vol 11 (2) ◽

pp. 1412-1419 ◽

Cited By ~ 21

Author(s):

Christopher R. Kinsinger ◽

James Apffel ◽

Mark Baker ◽

Xiaopeng Bian ◽

Christoph H. Borchers ◽

...

Keyword(s):

Mass Spectrometry ◽

Open Access ◽

Data Quality ◽

Quality Metrics ◽

Mass Spectrometry Data ◽

Open Access Data ◽

Data Quality Metrics ◽

Access Data

Download Full-text

Context-based Data Quality Metrics in Data Warehouse Systems

CLEI electronic journal ◽

10.19153/cleiej.20.2.3 ◽

2017 ◽

Vol 20 (2) ◽

Cited By ~ 1

Author(s):

Flavia Serra ◽

Adriana Marotta

Keyword(s):

Decision Making ◽

Literature Review ◽

Data Quality ◽

Conceptual Framework ◽

Data Warehouse ◽

Quality Metrics ◽

Research Community ◽

System Component ◽

Community Data ◽

Data Quality Metrics

The fact that Data Quality (DQ) depends on the context, in which data are produced, stored and used, is widely recognized in the research community. Data Warehouse Systems (DWS), whose main goal is to give support to decision making based on data, have had a huge growth in the last years, in research and industry. DQ in this kind of systems becomes essential. This work presents a proposal for identifying DQ problems in the domain of DWS, considering the different contexts that exist in each system component. This proposal may act as a first conceptual framework that guides the DQ-responsible in the management of DQ in DWS. The main contributions of this work are a thorough literature review about how contexts are used for evaluating DQ in DWS, and a proposal for assessing DQ in DWS through context-based DQ metrics.

Download Full-text

Data Quality Metrics for Genome Wide Association Studies

2010 Workshops on Database and Expert Systems Applications ◽

10.1109/dexa.2010.40 ◽

2010 ◽

Author(s):

Lorena Etcheverry ◽

Adriana Marotta ◽

Raul Ruggia

Keyword(s):

Data Quality ◽

Association Studies ◽

Quality Metrics ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Data Quality Metrics

Download Full-text

Recommendations for Mass Spectrometry Data Quality Metrics for Open Access Data (Corollary to the Amsterdam Principles)

Molecular & Cellular Proteomics ◽

10.1074/mcp.o111.015446 ◽

2011 ◽

Vol 10 (12) ◽

pp. O111.015446 ◽

Cited By ~ 12

Author(s):

Christopher R. Kinsinger ◽

James Apffel ◽

Mark Baker ◽

Xiaopeng Bian ◽

Christoph H. Borchers ◽

...

Keyword(s):

Mass Spectrometry ◽

Open Access ◽

Data Quality ◽

Quality Metrics ◽

Mass Spectrometry Data ◽

Open Access Data ◽

Data Quality Metrics ◽

Access Data

Download Full-text

Leveraging the NSSP R Studio Server to Automate QA Monitoring and Reporting

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v10i1.8374 ◽

2018 ◽

Vol 10 (1) ◽

Author(s):

Peter J Rock ◽

Michael D Singleton

Keyword(s):

Data Quality ◽

Information Exchange ◽

Syndromic Surveillance ◽

Health Information Exchange ◽

Opioid Overdose ◽

Diagnosis Codes ◽

Key Stakeholders ◽

Facility Level ◽

Key Indicators

Objective: The aim of this project was to develop a nimble system to both monitor and report on the quality of Kentucky emergency department syndromic surveillance (SyS) data at system-wide and facility levels.Introduction: In 2016, the CDC funded 12 states, under the Enhanced State Opioid Overdose Surveillance (ESOOS) program, to utilize SyS to increase timeliness of state data on drug overdose events. In order to operationalize the objectives of the grant, there was a need to assess and monitor the quality of Kentucky’s SyS data, with limited resources. We leveraged the NSSP’s R Studio Server to automate quality assurance (QA) monitoring and reporting to meet these objectives.Methods: Using the R Server, we pulled data from the process messages table, aggregating messages to single patient encounters. In addition to compiling the code on a powerful remote server, the server can access the process table messages relatively quickly. We developed an R Markdown report to produce a report that includes a variety of system- and facility-level metrics that highlight key indicators of system performance and data flows. By using R, we were able to create an auto-generating QA report that runs weekly and e-mails for analyst review. Quality metrics included: % completeness of chief complaint and discharge diagnosis codes (overall and by facility)[Fig 1 & Fig 2]; visit trend by day of visit (with interactive spark lines)[Fig 2]; maximum date of message created, date message arrived at NSSP server, date of visit, and total messages[Fig 3]; message arrived trend (interactive sparklines)[Fig 3]; volume and type of error messages failing to process[Fig 4]; message volume by ADT type[Fig 5]; and volume of patient class by type by day[not shown]. Our SyS analyst reviews the report and delivers it to stakeholders with general comments about ongoing and newly emerging data quality concerns.Results: The report has proven to be beneficial in ongoing QA monitoring. The report is shared weekly with key stakeholders at the Kentucky Department for Public Health, Kentucky Health Information Exchange, NSSP, and regional ESSENCE users. Findings are reviewed at monthly SyS stakeholder meetings. The report has identified numerous errors, dead feeds, and other systems changes in near real-time; leading to corrective action and general data quality enhancement. Weekly monitoring of QA has improved data feed stability and communication of identified issue with key stakeholders.Conclusions: The R Studio Server provides a nimble platform to develop, refine, and automate a QA reporting system that can lead to improved SyS data quality. In Kentucky, in addition to improving overall data quality, these weekly reports and subsequent communication have help built relationships among key stakeholders and elevated the importance of syndromic surveillance data locally. Continual monitoring of data is critical to ensure quality and therefor the validity of the data.

Download Full-text