Record Linkage

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 16560 Experts worldwide ranked by ideXlab platform

V.s. Verykios - One of the best experts on this subject based on the ideXlab platform.

  • Privacy-preserving Record Linkage
    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2013
    Co-Authors: V.s. Verykios, Peter Christen
    Abstract:

    It has been recognized that sharing data between organizations can be of great benefit, since it can help discover novel and valuable information that is not available in individual databases. However, as organizations are under pressure to better utilize their large databases through sharing, integration, and analysis, protecting the privacy of personal information in such databases is an increasingly difficult task. Record Linkage is the task of identifying and matching Records that correspond to the same real-world entity in several databases. This task implies a crucial infrastructure component in many modern information systems. Privacy and confidentiality concerns, however, commonly prevent the matching of databases that contain personal information across different organizations. In the past decade, efforts in the research area of privacy-preserving Record Linkage (PPRL) have aimed to develop techniques that facilitate the matching of Records across databases such that besides the matched Records no private or confidential information is being revealed to any organisztion involved in such a Linkage, or to any external party. We discuss the development of key techniques that solve the three main subproblems of PPRL, namely privacy, Linkage quality, and scaling PPRL to large databases. We then highlight open challenges in this research area.

  • A taxonomy of privacy-preserving Record Linkage techniques
    Information Systems, 2013
    Co-Authors: Dinusha Vatsalan, Peter Christen, V.s. Verykios
    Abstract:

    The process of identifying which Records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as Record Linkage, data matching or entity resolution, this process has attracted interest from researchers in fields such as databases and data warehousing, data mining, information systems, and machine learning. Record Linkage has various challenges, including scalability to large databases, accurate matching and classification, and privacy and confidentiality. The latter challenge arises because commonly personal identifying data, such as names, addresses and dates of birth of individuals, are used in the Linkage process. When databases are linked across organizations, the issue of how to protect the privacy and confidentiality of such sensitive information is crucial to successful application of Record Linkage. In this paper we present an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data. Known as 'privacy-preserving Record Linkage' (PPRL), various such techniques have been developed. We present a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions, and conduct a survey of PPRL techniques. We then highlight shortcomings of current techniques and discuss avenues for future research.

  • Advances in Privacy Preserving Record Linkage
    E-Activity and Intelligent Web Construction, 2011
    Co-Authors: Alexandros Karakasidis, V.s. Verykios
    Abstract:

    However, even though many solutions have been proposed towards addressing this problem, a new side effect rises regarding the privacy of the data which usually has to be protected during Linkage. Sensitive information such as names, addresses, and illnesses, especially in cases of medical data, should not be revealed without further evidence to any participant of the merging procedure. This raises the need of creating new techniques for linking data while, at the same time, the privacy of the subjects described by these data is preserved. This need led to the evolvement of a new research area called privacy preserving Record Linkage. This chapter will attempt to present the state of the art of the methods proposed to address the privacy preserving Record Linkage problem and provide a taxonomy of these techniques based on their core characteristics.

  • Privacy preserving Record Linkage approaches
    International Journal of Data Mining Modelling and Management, 2009
    Co-Authors: V.s. Verykios, Alexandros Karakasidis, Vassilios K. Mitrogiannis
    Abstract:

    Privacy-preserving Record Linkage is a very important task, mostly because of the very sensitive nature of the personal data. The main focus in this task is to find a way to match Records from among different organisation data sets or databases without revealing competitive or personal information to non-owners. Towards accomplishing this task, several methods and protocols have been proposed. In this work, we propose a certain methodology for preserving the privacy of various Record Linkage approaches and we implement, examine and compare four pairs of privacy preserving Record Linkage methods and protocols. Two of these protocols use n-gram based similarity comparison techniques, the third protocol uses the well known edit distance and the fourth one implements the Jaro-Winkler distance metric. All of the protocols used are enhanced by private key cryptography and hash encoding. This paper presents also a blocking scheme as an extension to the privacy preserving Record Linkage methodology. Our comparison is backed up by extended experimental evaluation that demonstrates the performance achieved by each of the proposed protocols.

  • Optimal Stopping: A Record-Linkage Approach
    Journal of Data and Information Quality, 2009
    Co-Authors: George V. Moustakides, V.s. Verykios
    Abstract:

    Record-Linkage is the process of identifying whether two separate Records refer to the same real-world entity when some elements of the Record’s identifying information (attributes) agree and others disagree. Existing Record-Linkage decision methodologies use the outcomes from the comparisons of the whole set of attributes. Here, we propose an alternative scheme that assesses the attributes sequentially, allowing for a decision to made at any attribute’s comparison stage, and thus before exhausting all available attributes. The scheme we develop is optimum in that it minimizes a well-defined average cost criterion while the corresponding optimum solution can be easily mapped into a decision tree to facilitate the Record-Linkage decision process. Experimental results performed in real datasets indicate the superiority of our methodology compared to existing approaches.

A.k. Elmagarmid - One of the best experts on this subject based on the ideXlab platform.

  • behavior based Record Linkage
    Very Large Data Bases, 2010
    Co-Authors: Mohamed Yakout, A.k. Elmagarmid, Hazen Elmeleegy, Mourad Ouzzani
    Abstract:

    In this paper, we present a new Record Linkage approach that uses entity behavior to decide if potentially different entities are in fact the same. An entity's behavior is extracted from a transaction log that Records the actions of this entity with respect to a given data source. The core of our approach is a technique that merges the behavior of two possible matched entities and computes the gain in recognizing behavior patterns as their matching score. The idea is that if we obtain a well recognized behavior after merge, then most likely, the original two behaviors belong to the same entity as the behavior becomes more complete after the merge. We present the necessary algorithms to model entities' behavior and compute a matching score for them. To improve the computational efficiency of our approach, we precede the actual matching phase with a fast candidate generation that uses a "quick and dirty" matching method. Extensive experiments on real data show that our approach can significantly enhance Record Linkage quality while being practical for large transaction logs.

  • ICDE - Efficient Private Record Linkage
    2009 IEEE 25th International Conference on Data Engineering, 2009
    Co-Authors: Mohamed Yakout, Mikhail J. Atallah, A.k. Elmagarmid
    Abstract:

    Record Linkage is the computation of the associations among Records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. The autonomous entities who wish to carry out the Record matching computation are often reluctant to fully share their data. In such a framework where the entities are unwilling to share data with each other, the problem of carrying out the Linkage computation without full data exchange has been called private Record Linkage. Previous private Record Linkage techniques have made use of a third party. We provide efficient techniques for private Record Linkage that improve on previous work in that (i) they make no use of a third party; (ii) they achieve much better performance than that of previous schemes in terms of execution time and quality of output (i.e., practically without false negatives and minimal false positives). Our software implementation provides experimental validation of our approach and the above claims.

  • Record Linkage Based on Entities' Behavior
    2008
    Co-Authors: Mohamed Yakout, A.k. Elmagarmid, Hazen Elmeleegy, Mourad Ouzzani
    Abstract:

    Record Linkage is the problem of identifying similar Records across different data sources. Traditional Record Linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched Records. Recently, Record Linkage techniques have considered useful extracted knowledge and domain information to help enhancing the matching accuracy. In this paper, we present a new technique for Record Linkage that is based on entity’s behavior, which can be extracted from a transaction log. In the matching process, we measure the improvement of identifying a behavior when comparing two entities by merging their transaction log. To do so, we use two matching phases; first, a candidate generation phase, which is fast and provide almost no false negatives, while producing low precision. Second, an accurate matching phase, which enhances the precision of the matching at high run time cost. In the candidates phase generation, behavior is represented by points in the complex plan, where we perform approximate evaluations. In the accurate matching phase, we use a heuristic called compressibility, where identified behaviors are more compressible. Our experiments show that the proposed technique can be used to enhance the Record Linkage quality while being practical for large logs. We also perform extensive sensitivity analysis for the technique’s accuracy and performance.

  • TAILOR: a Record Linkage toolbox
    Proceedings 18th International Conference on Data Engineering, 2002
    Co-Authors: M.g. Elfeky, V.s. Verykios, A.k. Elmagarmid
    Abstract:

    Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the Record pairs that represent the same entity (duplicate Records), commonly known as Record Linkage, is one of the essential elements of data cleaning. In this paper, we address the Record Linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage toolbox named TAILOR (backwards acronym for "Record Linkage Toolbox"). Users of TAILOR can build their own Record Linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the Record Linkage process, and is designed in an extensible way to interface with existing and future Record Linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning Record Linkage models outperform the existing ones both in accuracy and in performance.

  • ICDE - TAILOR: a Record Linkage toolbox
    Proceedings 18th International Conference on Data Engineering, 1
    Co-Authors: M.g. Elfeky, V.s. Verykios, A.k. Elmagarmid
    Abstract:

    Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the Record pairs that represent the same entity (duplicate Records), commonly known as Record Linkage, is one of the essential elements of data cleaning. In this paper, we address the Record Linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage toolbox named TAILOR (backwards acronym for "Record Linkage Toolbox"). Users of TAILOR can build their own Record Linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the Record Linkage process, and is designed in an extensible way to interface with existing and future Record Linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning Record Linkage models outperform the existing ones both in accuracy and in performance.

Peter Christen - One of the best experts on this subject based on the ideXlab platform.

  • Evaluation measure for group-based Record Linkage
    International Journal of Population Data Science, 2020
    Co-Authors: Charini Nanayakkara, Peter Christen, Thilina Ranbaduge, Eilidh Garrett
    Abstract:

    Introduction The robustness of Record Linkage evaluation measures is of high importance since Linkage techniques are assessed based on these. However, minimal research has been conducted to evaluate the suitability of existing evaluation measures in the context of linking groups of Records. Linkage quality is generally evaluated based on traditional measures such as precision and recall. As we show, these traditional evaluation measures are not suitable for evaluating groups of linked Records because they evaluate the quality of individual Record pairs rather than the quality of Records grouped into clusters. Objectives We highlight the shortcomings of traditional evaluation measures and then propose a novel method to evaluate clustering quality in the context of group-based Record Linkage. Methods The proposed Linkage evaluation method assesses how well individual Records have been allocated into predicted groups/clusters with respect to ground-truth data. We first identify the best representative predicted cluster for each ground-truth cluster and, based on the resulting mapping, each Record in a ground-truth cluster is assigned to one of seven categories. These categories reflect how well the Linkage technique assigned Records into groups. Results We empirically evaluate our proposed method using real-world data and show that it better reflects the quality of clusters generated by three group-based Record Linkage techniques. We also show that traditional measures such as precision and recall can produce ambiguous results whereas our method does not. Conclusions The proposed evaluation method provides unambiguous results regarding the assessed group-based Record Linkage approaches. The method comprises of seven categories which reflect how each Record was predicted, providing more detailed information about the quality of the Linkage result. This will help to make better-informed decisions about which Linkage technique is best suited for a given Linkage application.

  • Advanced Record Linkage methods and privacy aspects for population reconstruction
    2014
    Co-Authors: Peter Christen
    Abstract:

    Recent times have seen an increased interest into techniques that allow the linking of Records across databases. The main challenges of Record Linkage are (1) scalability to the increasingly large databases common today; (2) accurate and efficient classification of compared Records into matches and non-matches in the presence of variations and errors in the data; and (3) privacy issues that occur when the linking of Records is based on sensitive personal information about individuals. The first challenge has been addressed by the development of scalable indexing techniques, the second through advanced classification techniques that either employ machine learning or graph based methods, and the third challenge is investigated by research into privacy-preserving Record Linkage. In this paper, we describe these major challenges of Record Linkage in the context of population reconstruction, outline recent developments of advanced Record Linkage methods, and provide directions for future research.

  • Privacy-preserving Record Linkage
    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2013
    Co-Authors: V.s. Verykios, Peter Christen
    Abstract:

    It has been recognized that sharing data between organizations can be of great benefit, since it can help discover novel and valuable information that is not available in individual databases. However, as organizations are under pressure to better utilize their large databases through sharing, integration, and analysis, protecting the privacy of personal information in such databases is an increasingly difficult task. Record Linkage is the task of identifying and matching Records that correspond to the same real-world entity in several databases. This task implies a crucial infrastructure component in many modern information systems. Privacy and confidentiality concerns, however, commonly prevent the matching of databases that contain personal information across different organizations. In the past decade, efforts in the research area of privacy-preserving Record Linkage (PPRL) have aimed to develop techniques that facilitate the matching of Records across databases such that besides the matched Records no private or confidential information is being revealed to any organisztion involved in such a Linkage, or to any external party. We discuss the development of key techniques that solve the three main subproblems of PPRL, namely privacy, Linkage quality, and scaling PPRL to large databases. We then highlight open challenges in this research area.

  • A taxonomy of privacy-preserving Record Linkage techniques
    Information Systems, 2013
    Co-Authors: Dinusha Vatsalan, Peter Christen, V.s. Verykios
    Abstract:

    The process of identifying which Records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as Record Linkage, data matching or entity resolution, this process has attracted interest from researchers in fields such as databases and data warehousing, data mining, information systems, and machine learning. Record Linkage has various challenges, including scalability to large databases, accurate matching and classification, and privacy and confidentiality. The latter challenge arises because commonly personal identifying data, such as names, addresses and dates of birth of individuals, are used in the Linkage process. When databases are linked across organizations, the issue of how to protect the privacy and confidentiality of such sensitive information is crucial to successful application of Record Linkage. In this paper we present an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data. Known as 'privacy-preserving Record Linkage' (PPRL), various such techniques have been developed. We present a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions, and conduct a survey of PPRL techniques. We then highlight shortcomings of current techniques and discuss avenues for future research.

  • AusDM - Towards automated Record Linkage
    2006
    Co-Authors: Karl Goiser, Peter Christen
    Abstract:

    The field of Record Linkage is concerned with identifying Records from one or more datasets which refer to the same underlying entities. Where entity-unique identifiers are not available and errors occur, the process is non-trivial. Many techniques developed in this field require human intervention to set parameters, manually classify possibly matched Records, or provide examples of matched and non-matched Records. Whilst of great use and providing high quality results, the requirement of human input, besides being costly, means that if the parameters or examples are not produced or maintained properly, Linkage quality will be compromised. The contributions of this paper are a critical discussion on the Record Linkage process, arguing for a more restrictive use of blocking in research, and evaluating and modifying the farthest-first clustering technique to produce results close to a supervised technique.

Stephen E Fienberg - One of the best experts on this subject based on the ideXlab platform.

  • A Bayesian Approach to Graphical Record Linkage and Deduplication
    Journal of the American Statistical Association, 2016
    Co-Authors: Rebecca C. Steorts, Rob Hall, Stephen E Fienberg
    Abstract:

    ABSTRACTWe propose an unsupervised approach for linking Records across arbitrarily many files, while simultaneously detecting duplicate Records within files. Our key innovation involves the representation of the pattern of links between Records as a bipartite graph, in which Records are directly linked to latent true individuals, and only indirectly linked to other Records. This flexible representation of the Linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive Linkage probabilities across Records (and represent this visually), and propagate the uncertainty of Record Linkage into later analyses. Our method makes it particularly easy to integrate Record Linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our Linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously Record Linkage ap...

  • privacy preserving Record Linkage
    Privacy in Statistical Databases, 2010
    Co-Authors: Rob Hall, Stephen E Fienberg
    Abstract:

    Record Linkage has a long tradition in both the statistical and the computer science literature. We survey current approaches to the Record Linkage problem in a privacy-aware setting and contrast these with the more traditional literature. We also identify several important open questions that pertain to private Record Linkage from different perspectives.

  • Privacy in Statistical Databases - Privacy-preserving Record Linkage
    Privacy in Statistical Databases, 2010
    Co-Authors: Rob Hall, Stephen E Fienberg
    Abstract:

    Record Linkage has a long tradition in both the statistical and the computer science literature. We survey current approaches to the Record Linkage problem in a privacy-aware setting and contrast these with the more traditional literature. We also identify several important open questions that pertain to private Record Linkage from different perspectives.

Erhard Rahm - One of the best experts on this subject based on the ideXlab platform.

  • Optimization of the Mainzelliste software for fast privacy-preserving Record Linkage.
    Journal of translational medicine, 2021
    Co-Authors: Florens Rohde, Ziad Sehili, Martin Franke, Martin Lablans, Erhard Rahm
    Abstract:

    BACKGROUND Data analysis for biomedical research often requires a Record Linkage step to identify Records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, Record Linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in Record Linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving Record Linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source Record Linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. METHODS We evaluate the Linkage quality and performance of the Linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching Records. We conduct a comparison between (plaintext) Record Linkage and PPRL based on encoded Records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving Record Linkage. RESULTS The Mainzelliste achieves high Linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high Linkage quality. CONCLUSION We conduct the first comprehensive evaluation of the Record Linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high Linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

  • BTW - Privacy Preserving Record Linkage with PPJoin.
    2015
    Co-Authors: Ziad Sehili, Rainer Schnell, Lars Kolb, Christian Borgs, Erhard Rahm
    Abstract:

    Privacy-preserving Record Linkage (PPRL) becomes increasingly important to match and integrate Records with sensitive data. PPRL not only has to preserve the anonymity of the persons or entities involved but should also be highly efficient and scalable to large datasets. We therefore investigate how to adapt PPJoin, one of the fastest approaches for regular Record Linkage, to PPRL resulting in a new approach called P4Join. The use of bit vectors for PPRL also allows us to devise a parallel execution of P4Join on GPUs. We evaluate the new approaches and compare their efficiency with a PPRL approach based on multibit trees.