Correlates of record linkage and estimating risks of non‐linkage biases in business data sets

Deterministic record linkage (RL) is frequently regarded as a rival to more sophisticated strategies like probabilistic RL. We investigate the effect of combining deterministic linkage with other linkage techniques. For this task, we use a simple deterministic linkage strategy as a preceding filter: a data pair is classified as ‘match' if all values of attributes considered agree exactly, otherwise as ‘nonmatch'. This strategy is separately combined with two probabilistic RL methods based on the Fellegi–Sunter model and with two classification tree methods (CART and Bagging). An empirical comparison was conducted on two real data sets. We used four different partitions into training data and test data to increase the validity of the results. In almost all cases, application of deterministic linkage as a preceding filter leads to better results compared to the omission of such a pre-filter, and overall classification trees exhibited best results. On all data sets, probabilistic RL only profited from deterministic linkage when the underlying probabilities were estimated before applying deterministic linkage. When using a pre-filter for subtracting definite cases, the underlying population of data pairs changes. It is crucial to take this into account for model-based probabilistic RL.

Download Full-text

BUILDING CUSTOMER MODELS FROM BUSINESS DATA: AN AUTOMATIC APPROACH BASED ON FUZZY CLUSTERING AND MACHINE LEARNING

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026809002692 ◽

2009 ◽

Vol 08 (04) ◽

pp. 445-465 ◽

Cited By ~ 1

Author(s):

LOTFI BEN ROMDHANE ◽

NADIA FADHEL ◽

BECHIR AYEB

Keyword(s):

Fuzzy Clustering ◽

Information Loss ◽

Second Step ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Useful Knowledge ◽

The Third ◽

Validity Criteria ◽

Business Data

Data mining (DM) is a new emerging discipline that aims to extract knowledge from data using several techniques. DM turned out to be useful in business where the data describing the customers and their transactions is in the order of terabytes. In this paper, we propose an approach for building customer models (said also profiles in the literature) from business data. Our approach is three-step. In the first step, we use fuzzy clustering to categorize customers, i.e., determine groups of customers. A key feature is that the number of groups (or clusters) is computed automatically from data using the partition entropy as a validity criteria. In the second step, we proceed to a dimensionality reduction which aims at keeping for each group of customers only the most informative attributes. For this, we define the information loss to quantify the information degree of an attribute. Hence, and as a result to this second step, we obtain groups of customers each described by a distinct set of attributes. In the third and final step, we use backpropagation neural networks to extract useful knowledge from these groups. Experimental results on real-world data sets reveal a good performance of our approach and should simulate future research.

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

An Improved Fellegi-Sunter Framework for Probabilistic Record Linkage Between Large Data Sets

Journal of Official Statistics ◽

10.2478/jos-2020-0039 ◽

2020 ◽

Vol 36 (4) ◽

pp. 803-825

Author(s):

Marco Fortini

Keyword(s):

Record Linkage ◽

Large Data ◽

Real Data ◽

Large Data Sets ◽

Training Data ◽

Data Sets ◽

Estimated Parameters ◽

Unbiased Estimates ◽

Structural Zeros ◽

Different Sources

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.

Download Full-text

Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Business Information Systems ◽

10.52825/bis.v1i.44 ◽

2021 ◽

pp. 13-26

Author(s):

Felix Kruse ◽

Jan-Philipp Awick ◽

Jorge Marx Gómez ◽

Peter Loos

Keyword(s):

Data Quality ◽

Record Linkage ◽

Hybrid Approach ◽

Supervised Machine Learning ◽

Data Sets ◽

Legal Form ◽

Process Step ◽

Machine Learning Model ◽

Rule Set ◽

Processing Steps

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.

Download Full-text

On the Concepts of Identity and Similarity in the Context of Biomedical Record Linkage

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210203 ◽

2021 ◽

Author(s):

Murat Sariyar ◽

Jürgen Holm

Keyword(s):

Record Linkage ◽

Similarity Measures ◽

Data Sets ◽

Biomedical Data ◽

Relational Identity ◽

Numerical Identity ◽

The Real

Record linkage refers to a range of methods for merging and consolidating data in a manner such that duplicates are detected and false links are avoided. It is crucial for such a task to discern between similarity and identity of entities. This paper explores the implications of the ontological concepts of identity for record linkage (RL) on biomedical data sets. In order to draw substantial conclusions, we use the differentiation between numerical identity, qualitative identity and relational identity. We will discuss the problems of using similarity measures for record pairs and quality identity for ascertaining the real status of these pairs. We conclude that relational identity should be operationalized for RL.

Download Full-text

INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2015.1275 ◽

2015 ◽

pp. 59-65

Author(s):

SUNITHA YEDDULA ◽

K. LAKSHMAIAH

Keyword(s):

Record Linkage ◽

Data Cleaning ◽

Real Data ◽

Data Sets ◽

Cleaning Process ◽

Matching Process ◽

Indexing Techniques ◽

Proper Definition ◽

Definition Of ◽

Matched Data

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.

Download Full-text

Derivation and Validation of a Record Linkage Algorithm between EMS and the Emergency Department

10.1101/124313 ◽

2017 ◽

Author(s):

Colby Redfield ◽

Abdulhakim Tlimat ◽

Yoni Halpern ◽

David Schoenfeld ◽

Edward Ullman ◽

...

Keyword(s):

Record Linkage ◽

Cross Validation ◽

Supervised Machine Learning ◽

Multivariate Logistic Regression Model ◽

Data Sets ◽

Multivariate Logistic Regression ◽

Linkage Algorithm ◽

Limited Success ◽

Test Sets ◽

Fold Cross Validation

AbstractBackgroundLinking EMS electronic patient care reports (ePCRs) to ED records can provide clinicians access to vital information that can alter management. It can also create rich databases for research and quality improvement. Unfortunately, previous attempts at ePCR - ED record linkage have had limited success.ObjectiveTo derive and validate an automated record linkage algorithm between EMS ePCR’s and ED records using supervised machine learning.MethodsAll consecutive ePCR’s from a single EMS provider between June 2013 and June 2015 were included. A primary reviewer matched ePCR’s to a list of ED patients to create a gold standard. Age, gender, last name, first name, social security number (SSN), and date of birth (DOB) were extracted. Data was randomly split into 80%/20% training and test data sets. We derived missing indicators, identical indicators, edit distances, and percent differences. A multivariate logistic regression model was trained using 5k fold cross-validation, using label k-fold, L2 regularization, and class re-weighting.ResultsA total of 14,032 ePCRs were included in the study. Inter-rater reliability between the primary and secondary reviewer had a Kappa of 0.9. The algorithm had a sensitivity of 99.4%, a PPV of 99.9% and AUC of 0.99 in both the training and test sets. DOB match had the highest odd ratio of 16.9, followed by last name match (10.6). SSN match had an odds ratio of 3.8.ConclusionsWe were able to successfully derive and validate a probabilistic record linkage algorithm from a single EMS ePCR provider to our hospital EMR.

Download Full-text

Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study

BMC Health Services Research ◽

10.1186/1472-6963-10-41 ◽

2010 ◽

Vol 10 (1) ◽

Cited By ~ 25

Author(s):

Rosemary Karmel ◽

Phil Anderson ◽

Diane Gibson ◽

Ann Peut ◽

Stephen Duckett ◽

...

Keyword(s):

Cohort Study ◽

Record Linkage ◽

Data Sets ◽

Multiple Data ◽

Multiple Data Sets

Download Full-text