scholarly journals Federated queries of clinical data repositories: balancing accuracy and privacy

2019 ◽  
Author(s):  
Yun William Yu ◽  
Griffin M Weber

AbstractResearchers use large federated clinical data networks that connect dozens of healthcare organizations to access data on millions of patients. However, because patients often receive care from multiple sites in the network, queries frequently double-count patients. Using the probabilistic streaming algorithm HyperLogLog and adding obfuscation, we developed a scalable method for estimating the number of distinct lives that match a query, which balances accuracy and privacy in a “tunable” way.

10.2196/18735 ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. e18735
Author(s):  
Yun William Yu ◽  
Griffin M Weber

Background Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. Objective This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. Methods We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. Results In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. Conclusions Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.


2020 ◽  
Author(s):  
Yun William Yu ◽  
Griffin M Weber

BACKGROUND Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to <i>link</i> patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. OBJECTIVE This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. METHODS We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is <i>tunable</i>, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, <i>k</i>-anonymity privacy risk (with <i>k</i>=10), and computational runtime of our algorithm with several existing techniques. RESULTS In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining <i>k</i>-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. CONCLUSIONS Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.


Author(s):  
Tak-Ming Chan ◽  
Yuxi Li ◽  
Choo-Chiap Chiau ◽  
Jane Zhu ◽  
Jie Jiang ◽  
...  

2008 ◽  
Vol 47 (06) ◽  
pp. 549-559 ◽  
Author(s):  
K. Ohe ◽  
Y. Kawazoe

Summary Objective: We have been developing a decision support system that uses electronic clinical data and provides alerts to clinicians. However, the inference rules for such a system are difficult to write in terms of representing domain concepts and temporal reasoning. To address this problem, we have developed an ontologybased mediator of clinical information for the decision support system. Methods: Our approach consists of three steps: 1) development of an ontology-based mediator that represents domain concepts and temporal information; 2) mapping of clinical data to corresponding concepts in the mediator; 3) temporal abstraction that creates high-level, interval-based concepts from time-stamped clinical data. As a result, we can write a concept-based rule expression that is available for use in domain concepts and interval-based temporal information. The proposed approach was applied to a prototype of clinical alert system, and the rules for adverse drug events were executed on data gathered over a 3-month period. Results: The system generated 615 alerts. 346 cases (56%) were considered appropriate and 269 cases (44%) were inappropriate. Of the false alerts, 192 cases were due to data inaccuracy and 77 cases were due to insufficiency of the temporal abstraction. Conclusion: Our approach enabled to represent a concept-based rule expression that was available for the prototype of a clinical alert system. We believe our approach will contribute to narrow the gaps of information model between domain concepts and clinical data repositories.


2015 ◽  
Vol 22 (3) ◽  
pp. 600-607 ◽  
Author(s):  
Bonnie L Westra ◽  
Gail E Latimer ◽  
Susan A Matney ◽  
Jung In Park ◽  
Joyce Sensmeier ◽  
...  

Abstract Background There is wide recognition that, with the rapid implementation of electronic health records (EHRs), large data sets are available for research. However, essential standardized nursing data are seldom integrated into EHRs and clinical data repositories. There are many diverse activities that exist to implement standardized nursing languages in EHRs; however, these activities are not coordinated, resulting in duplicate efforts rather than building a shared learning environment and resources. Objective The purpose of this paper is to describe the historical context of nursing terminologies, challenges to the use of nursing data for purposes other than documentation of care, and a national action plan for implementing and using sharable and comparable nursing data for quality reporting and translational research. Methods In 2013 and 2014, the University of Minnesota School of Nursing hosted a diverse group of nurses to participate in the Nursing Knowledge: Big Data and Science to Transform Health Care consensus conferences. This consensus conference was held to develop a national action plan and harmonize existing and new efforts of multiple individuals and organizations to expedite integration of standardized nursing data within EHRs and ensure their availability in clinical data repositories for secondary use. This harmonization will address the implementation of standardized nursing terminologies and subsequent access to and use of clinical nursing data. Conclusion Foundational to integrating nursing data into clinical data repositories for big data and science, is the implementation of standardized nursing terminologies, common data models, and information structures within EHRs. The 2014 National Action Plan for Sharable and Comparable Nursing Data for Transforming Health and Healthcare builds on and leverages existing, but separate long standing efforts of many individuals and organizations. The plan is action focused, with accountability for coordinating and tracking progress designated.


2016 ◽  
Vol 22 (8) ◽  
pp. S72
Author(s):  
Abhishek Khemka ◽  
Richard Kovacs ◽  
Wanzhu Tu ◽  
Ross Hayden ◽  
Abdullah Masud ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document