Duplicate Detection

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 8265 Experts worldwide ranked by ideXlab platform

Felix Naumann - One of the best experts on this subject based on the ideXlab platform.

  • Data Preparation for Duplicate Detection
    Journal of Data and Information Quality, 2020
    Co-Authors: Ioannis K. Koumarelas, Lan Jiang, Felix Naumann
    Abstract:

    Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of Duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing Duplicate Detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of Duplicate Detection up to 19% in AUC-PR.

  • MDedup: Duplicate Detection with matching dependencies
    Proceedings of the VLDB Endowment, 2020
    Co-Authors: Ioannis K. Koumarelas, Thorsten Papenbrock, Felix Naumann
    Abstract:

    Duplicate Detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing Duplicate Detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic Duplicate Detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as Duplicate Detection rules. Once trained, the model can select useful MDs for Duplicate Detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.

  • ICDM Workshops - Cluster-Based Sorted Neighborhood for Efficient Duplicate Detection
    2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
    Co-Authors: Ahmad Samiei, Felix Naumann
    Abstract:

    Duplicate Detection intends to find multiple and syntactically different representations of the same real-world entities in a dataset. The naive way of Duplicate Detection entails a quadratic number of pair-wise record comparisons to identify the Duplicates. This number of comparisons might take hours even for an average sized dataset. As today's databases grow very fast, different candidate-selection methods, such as sorted neighborhood, blocking, canopy clustering and their variations, address this problem by shrinking the comparison space. The volume and velocity of data-change require ever faster and more flexible methods of Duplicate Detection. In particular, they need dynamic indices that can be updated efficiently as new data arrives. We present a novel approach, which combines the idea of cluster-based methods with the well-known sorted neighborhood method. It carefully filters out irrelevant candidate pairs, which are less likely to yield Duplicates, by pre-clustering records based not only on their proximity after sorting, but also on their similarity in selected attributes. An empirical evaluation on synthetic and real-world datasets shows that our approach improves the overall runtime over existing approaches while maintaining comparable result quality. Meanwhile, it uses a dynamic indices, that in turns make it useful for deduplicating streaming data.

  • Progressive Duplicate Detection
    IEEE Transactions on Knowledge and Data Engineering, 2015
    Co-Authors: Thorsten Papenbrock, Arvid Heise, Felix Naumann
    Abstract:

    Duplicate Detection is the process of identifying multiple representations of same real world entities. Today, Duplicate Detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive Duplicate Detection algorithms that significantly increase the efficiency of finding Duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional Duplicate Detection and significantly improve upon related work.

  • ICIQ - On choosing thresholds for Duplicate Detection.
    2013
    Co-Authors: Uwe Draisbach, Felix Naumann
    Abstract:

    Duplicate Detection, i.e., the discovery of records that refer to the same real-world entity, is a task that usually depends on multiple input parameters by an expert. Most notably, an expert must specify some similarity measure and some threshold that declares duplicity for record pairs if their similarity surpasses it. Both are typically developed in a trial-and-error based manner with a given (sample) dataset. We posit that the similarity measure largely depends on the nature of the data and its contained errors that cause the Duplicates, but that the threshold largely depends on the size of the dataset it was tested on. In consequence, configurations of Duplicate Detection runs work well on the test dataset, but perform worse if the size of the dataset changes. This weakness is due to the transitive nature of duplicity: In larger datasets transitivity can cause more records to enter a Duplicate cluster than intended. We analyze this interesting effect extensively on four popular test datasets using different Duplicate Detection algorithms and report on our observations.

Xuanjing Huang - One of the best experts on this subject based on the ideXlab platform.

  • efficient partial Duplicate Detection based on sequence matching
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010
    Co-Authors: Qi Zhang, Yue Zhang, Xuanjing Huang
    Abstract:

    With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-Duplicates only contain a small piece of text taken from other sources and most existing near-Duplicate Detection approaches focus on document level, partial Duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-Duplicate Detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the Duplicated parts. The main idea is to divide the partial-Duplicate Detection task into two subtasks: sentence level near-Duplicate Detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-Duplicates on large web collections.

  • SIGIR - Efficient partial-Duplicate Detection based on sequence matching
    Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10, 2010
    Co-Authors: Qi Zhang, Yue Zhang, Xuanjing Huang
    Abstract:

    With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-Duplicates only contain a small piece of text taken from other sources and most existing near-Duplicate Detection approaches focus on document level, partial Duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-Duplicate Detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the Duplicated parts. The main idea is to divide the partial-Duplicate Detection task into two subtasks: sentence level near-Duplicate Detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-Duplicates on large web collections.

Melanie Weis - One of the best experts on this subject based on the ideXlab platform.

  • Industry-scale Duplicate Detection
    Proceedings of the VLDB Endowment, 2008
    Co-Authors: Melanie Weis, Felix Naumann, Ulrich Jehle, Jens Lufter, Holger Schuster
    Abstract:

    Duplicate Detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate Detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining. In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect Duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying Duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified Duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects Duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history. Besides the quality of Duplicate Detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.

  • structure based inference of xml similarity for fuzzy Duplicate Detection
    Conference on Information and Knowledge Management, 2007
    Co-Authors: Luís Leitão, Pável Calado, Melanie Weis
    Abstract:

    Fuzzy Duplicate Detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy Duplicate Detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the Duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve Duplicate Detection effectiveness. In this paper, we propose a novel method for fuzzy Duplicate Detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the Duplicate status of children, but rather the probability of descendants being Duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art Duplicate Detection system on three different XML databases.

  • CIKM - Structure-based inference of xml similarity for fuzzy Duplicate Detection
    Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07, 2007
    Co-Authors: Luís Leitão, Pável Calado, Melanie Weis
    Abstract:

    Fuzzy Duplicate Detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy Duplicate Detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the Duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve Duplicate Detection effectiveness. In this paper, we propose a novel method for fuzzy Duplicate Detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the Duplicate status of children, but rather the probability of descendants being Duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art Duplicate Detection system on three different XML databases.

  • Relationship-Based Duplicate Detection
    2006
    Co-Authors: Melanie Weis, Felix Naumann
    Abstract:

    Recent work both in the relational and the XML world have shown that the efficacy and efficiency of Duplicate Detection is enhanced by regarding relationships between ancestors and descendants. We present a novel comparison strategy that uses relationships but disposes of the strict bottom-up and topdown approaches proposed for hierarchical data. Instead, pairs of objects at any level of the hierarchy are compared in an order that depends on their relationships: Objects with many dependants influence many other duplicity-decisions and thus it should be decided early if they are Duplicates themselves. We apply this ordering strategy to two algorithms. RECONA allows to re-examine an object if its influencing neighbors turn out to be Duplicates. Here ordering reduces the number of such re-comparisons. ADAMA is more efficient by not allowing any re-comparison. Here the order minimizes the number of mistakes made.

  • Fuzzy Duplicate Detection on XML Data
    2005
    Co-Authors: Melanie Weis
    Abstract:

    XML is popular for data exchange and data publishing on the Web, but it comes with errors and inconsistencies inherent to real-world data. Hence, there is a need for XML data cleansing, which requires solutions for fuzzy Duplicate Detection in XML. The hierarchical and semi-structured nature of XML strongly difiers from the ∞at and structured relational model, which has received the main attention in Duplicate Detection so far. We consider four major challenges of XML Duplicate Detection to develop efiective, e‐cient, and scalable solutions to the problem.

Qi Zhang - One of the best experts on this subject based on the ideXlab platform.

  • efficient partial Duplicate Detection based on sequence matching
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010
    Co-Authors: Qi Zhang, Yue Zhang, Xuanjing Huang
    Abstract:

    With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-Duplicates only contain a small piece of text taken from other sources and most existing near-Duplicate Detection approaches focus on document level, partial Duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-Duplicate Detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the Duplicated parts. The main idea is to divide the partial-Duplicate Detection task into two subtasks: sentence level near-Duplicate Detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-Duplicates on large web collections.

  • SIGIR - Efficient partial-Duplicate Detection based on sequence matching
    Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10, 2010
    Co-Authors: Qi Zhang, Yue Zhang, Xuanjing Huang
    Abstract:

    With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-Duplicates only contain a small piece of text taken from other sources and most existing near-Duplicate Detection approaches focus on document level, partial Duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-Duplicate Detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the Duplicated parts. The main idea is to divide the partial-Duplicate Detection task into two subtasks: sentence level near-Duplicate Detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-Duplicates on large web collections.

Luís Leitão - One of the best experts on this subject based on the ideXlab platform.

  • Efficient and Effective Duplicate Detection in Hierarchical Data
    IEEE Transactions on Knowledge and Data Engineering, 2012
    Co-Authors: Luís Leitão, Pável Calado, Melanie Herschel
    Abstract:

    Although there is a long line of work on identifying Duplicates in relational data, only a few solutions focus on Duplicate Detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML Duplicate Detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being Duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. XMLDup is also able to outperform another state of the art Duplicate Detection solution, both in terms of efficiency and of effectiveness.

  • CIKM - Duplicate Detection through structure optimization
    Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011
    Co-Authors: Luís Leitão, Pável Calado
    Abstract:

    Detecting and eliminating Duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are Duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of Duplicate Detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing Duplicate Detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.

  • Soft Computing in XML Data Management - An Overview of XML Duplicate Detection Algorithms
    Soft Computing in XML Data Management, 2010
    Co-Authors: Pável Calado, Melanie Herschel, Luís Leitão
    Abstract:

    Fuzzy Duplicate Detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy Duplicate Detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the Duplicate status of their direct neighbors to improve Duplicate Detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy Duplicate Detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between Duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data.

  • structure based inference of xml similarity for fuzzy Duplicate Detection
    Conference on Information and Knowledge Management, 2007
    Co-Authors: Luís Leitão, Pável Calado, Melanie Weis
    Abstract:

    Fuzzy Duplicate Detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy Duplicate Detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the Duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve Duplicate Detection effectiveness. In this paper, we propose a novel method for fuzzy Duplicate Detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the Duplicate status of children, but rather the probability of descendants being Duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art Duplicate Detection system on three different XML databases.

  • CIKM - Structure-based inference of xml similarity for fuzzy Duplicate Detection
    Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07, 2007
    Co-Authors: Luís Leitão, Pável Calado, Melanie Weis
    Abstract:

    Fuzzy Duplicate Detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy Duplicate Detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the Duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve Duplicate Detection effectiveness. In this paper, we propose a novel method for fuzzy Duplicate Detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the Duplicate status of children, but rather the probability of descendants being Duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art Duplicate Detection system on three different XML databases.