Pay-as-You-Go

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 13530 Experts worldwide ranked by ideXlab platform

Suzanne Embury - One of the best experts on this subject based on the ideXlab platform.

  • Pay-as-You-Go Configuration of Entity Resolution
    Lecture Notes in Computer Science, 2016
    Co-Authors: Ruhaila Maskat, Norman W. Paton, Suzanne Embury
    Abstract:

    Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability all-against-all comparisons are computationally impractical and result quality syntactic evidence on record equivalence is often equivocal. As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify candidate duplicates, detailed comparison to refine the conclusions from blocking, and clustering to identify the sets of records that may represent the same entity. However, the quality of the result is often crucially dependent on configuration parameters in all of these stages, for which it may be difficult for a human expert to provide suitable values. This paper describes an approach in which a complete entity resolution process is optimized, on the basis of feedback such as might be obtained from crowds on candidate duplicates. Given such feedback, an evolutionary search of the space of configuration parameters is carried out, with a view to maximizing the fitness of the resulting clusters. The approach is Pay-as-You-Go in that more feedback can be expected to give rise to better outcomes. An empirical evaluation shows that the co-optimization of the different stages in entity resolution can yield significant improvements over default parameters, even with small amounts of feedback.

  • SOFSEM - Pay-as-You-Go Data Integration: Experiences and Recurring Themes
    Lecture Notes in Computer Science, 2016
    Co-Authors: Norman W. Paton, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Ruhaila Maskat
    Abstract:

    Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual with tool support construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-You-Go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-You-Go data integration tends to involve two steps. Initialisation: automatic creation of mappings generally of poor quality between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with Pay-as-You-Go data integration, providing a framework that can be used to compare or develop Pay-as-You-Go data integration techniques.

  • Pay-as-You-Go Data Integration: Experiences and Recurring Themes
    2016
    Co-Authors: Norman Paton, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Ruhaila Maskat
    Abstract:

    Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-You-Go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-You-Go data integration tends to involve two steps. Initialisation: automatic creation of mappings (generally of poor quality) between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with Pay-as-You-Go data integration, providing a framework that can be used to compare or develop Pay-as-You-Go data integration techniques.

  • SIGMOD Conference - Pay-as-You-Go mapping selection in dataspaces
    Proceedings of the 2011 international conference on Management of data - SIGMOD '11, 2011
    Co-Authors: Cornelia Hedeler, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Norman W. Paton, Lu Mao, Chenjuan Guo
    Abstract:

    The vision of dataspaces proposes an alternative to classical data integration approaches with reduced up-front costs followed by incremental improvement on a Pay-as-You-Go basis. In this paper, we demonstrate DSToolkit, a system that allows users to provide feedback on results of queries posed over an integration schema. Such feedback is then used to annotate the mappings with their respective precision and recall. The system then allows a user to state the expected levels of precision (or recall) that the query results should exhibit and, in order to produce those results, the system selects those mappings that are predicted to meet the stated constraints.

Norman W. Paton - One of the best experts on this subject based on the ideXlab platform.

  • Pay-as-You-Go Configuration of Entity Resolution
    Lecture Notes in Computer Science, 2016
    Co-Authors: Ruhaila Maskat, Norman W. Paton, Suzanne Embury
    Abstract:

    Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability all-against-all comparisons are computationally impractical and result quality syntactic evidence on record equivalence is often equivocal. As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify candidate duplicates, detailed comparison to refine the conclusions from blocking, and clustering to identify the sets of records that may represent the same entity. However, the quality of the result is often crucially dependent on configuration parameters in all of these stages, for which it may be difficult for a human expert to provide suitable values. This paper describes an approach in which a complete entity resolution process is optimized, on the basis of feedback such as might be obtained from crowds on candidate duplicates. Given such feedback, an evolutionary search of the space of configuration parameters is carried out, with a view to maximizing the fitness of the resulting clusters. The approach is Pay-as-You-Go in that more feedback can be expected to give rise to better outcomes. An empirical evaluation shows that the co-optimization of the different stages in entity resolution can yield significant improvements over default parameters, even with small amounts of feedback.

  • SOFSEM - Pay-as-You-Go Data Integration: Experiences and Recurring Themes
    Lecture Notes in Computer Science, 2016
    Co-Authors: Norman W. Paton, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Ruhaila Maskat
    Abstract:

    Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual with tool support construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-You-Go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-You-Go data integration tends to involve two steps. Initialisation: automatic creation of mappings generally of poor quality between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with Pay-as-You-Go data integration, providing a framework that can be used to compare or develop Pay-as-You-Go data integration techniques.

  • SWIM - Pay-as-You-Go data integration for linked data: opportunities, challenges and architectures
    Proceedings of the 4th International Workshop on Semantic Web Information Management - SWIM '12, 2012
    Co-Authors: Norman W. Paton, Alvaro A. A. Fernandes, Klitos Christodoulou, Bijan Parsia, Cornelia Hedeler
    Abstract:

    Linked Data (LD) provides principles for publishing data that underpin the development of an emerging web of data. LD follows the web in providing low barriers to entry: publishers can make their data available using a small set of standard technologies, and consumers can search for and browse published data using generic tools. Like the web, consumers frequently consume data in broadly the form in which it was published; this will be satisfactory in some cases, but the diversity of publishers means that the data required to support a task may be stored in many different sources, and described in many different ways. As such, although RDF provides a syntactically homogeneous language for describing data, sources typically manifest a wide range of heterogeneities, in terms of how data on a concept is represented. This paper makes the case that many aspects of both publication and consumption of LD stand to benefit from a Pay-as-You-Go approach to data integration. Specifically, the paper: (i) identifies a collection of opportunities for applying Pay-as-You-Go techniques to LD; (ii) describes some preliminary experiences applying a Pay-as-You-Go data integration system to LD; and (iii) presents some open issues that need to be addressed to enable the full benefits of pay-as-you go integration to be realised.

  • SIGMOD Conference - Pay-as-You-Go mapping selection in dataspaces
    Proceedings of the 2011 international conference on Management of data - SIGMOD '11, 2011
    Co-Authors: Cornelia Hedeler, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Norman W. Paton, Lu Mao, Chenjuan Guo
    Abstract:

    The vision of dataspaces proposes an alternative to classical data integration approaches with reduced up-front costs followed by incremental improvement on a Pay-as-You-Go basis. In this paper, we demonstrate DSToolkit, a system that allows users to provide feedback on results of queries posed over an integration schema. Such feedback is then used to annotate the mappings with their respective precision and recall. The system then allows a user to state the expected levels of precision (or recall) that the query results should exhibit and, in order to produce those results, the system selects those mappings that are predicted to meet the stated constraints.

Ruhaila Maskat - One of the best experts on this subject based on the ideXlab platform.

  • Pay-as-You-Go Configuration of Entity Resolution
    Lecture Notes in Computer Science, 2016
    Co-Authors: Ruhaila Maskat, Norman W. Paton, Suzanne Embury
    Abstract:

    Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability all-against-all comparisons are computationally impractical and result quality syntactic evidence on record equivalence is often equivocal. As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify candidate duplicates, detailed comparison to refine the conclusions from blocking, and clustering to identify the sets of records that may represent the same entity. However, the quality of the result is often crucially dependent on configuration parameters in all of these stages, for which it may be difficult for a human expert to provide suitable values. This paper describes an approach in which a complete entity resolution process is optimized, on the basis of feedback such as might be obtained from crowds on candidate duplicates. Given such feedback, an evolutionary search of the space of configuration parameters is carried out, with a view to maximizing the fitness of the resulting clusters. The approach is Pay-as-You-Go in that more feedback can be expected to give rise to better outcomes. An empirical evaluation shows that the co-optimization of the different stages in entity resolution can yield significant improvements over default parameters, even with small amounts of feedback.

  • SOFSEM - Pay-as-You-Go Data Integration: Experiences and Recurring Themes
    Lecture Notes in Computer Science, 2016
    Co-Authors: Norman W. Paton, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Ruhaila Maskat
    Abstract:

    Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual with tool support construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-You-Go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-You-Go data integration tends to involve two steps. Initialisation: automatic creation of mappings generally of poor quality between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with Pay-as-You-Go data integration, providing a framework that can be used to compare or develop Pay-as-You-Go data integration techniques.

  • Pay-as-You-Go Data Integration: Experiences and Recurring Themes
    2016
    Co-Authors: Norman Paton, Suzanne Embury, Alvaro A. A. Fernandes, Khalid Belhajjame, Ruhaila Maskat
    Abstract:

    Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-You-Go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-You-Go data integration tends to involve two steps. Initialisation: automatic creation of mappings (generally of poor quality) between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with Pay-as-You-Go data integration, providing a framework that can be used to compare or develop Pay-as-You-Go data integration techniques.

Alon Halevy - One of the best experts on this subject based on the ideXlab platform.

  • functional dependency generation and applications in pay as you go data integration systems
    International Workshop on the Web and Databases, 2009
    Co-Authors: Daisy Zhe Wang, Anish Das Sarma, Michael J. Franklin, Xin Luna Dong, Alon Halevy
    Abstract:

    Recently, the opportunity of extracting structured data from the Web has been identified by a number of research projects. One such example is that millions of relational-style HTML tables can be extracted from the Web. Traditional data integration approaches do not scale over such corpora with hundreds of small tables in one domain. To solve this problem, previous work has proposed Pay-as-You-Go data integration systems to provide, with little up-front cost, base services over loosely-integrated information. One key component of such systems, which has received little attention to date, is the need for a framework to gauge and improve the quality of the integration. We propose a framework based on functional dependencies(FDs). Unlike in traditional database design, where FDs are specified as statements of truth about all possible instances of the database; in web environment, FDs are not specified over the data tables. Instead, we generate FDs by counting-based algorithms over many data sources, and extend the FDs with probabilities to capture the inherent uncertainties in them. Given these probabilistic FDs, we show how to solve two problems to improve data and schema quality in a Pay-as-You-Go system: (1) pinpointing dirty data sources and (2) normalizing large mediated schemas. We describe these techniques and evaluate them over real-world data sets extracted from the Web.

  • Discovering Functional Dependencies in Pay-As-You- Go Data Integration Systems
    2009
    Co-Authors: Daisy Zhe Wang, Anish Das Sarma, Michael J. Franklin, Luna Dong, Alon Halevy
    Abstract:

    Functional dependency is one of the most extensively researched subjects in database theory, originally for improving quality of schemas, and recently for improving quality of data. In a payas-you-go data integration system, where the goal is to provide best-effort service even without thorough understanding of the underlying domain and the various data sources, functional dependency can play an even more important role, applied in normalizing an automatically generated mediated schema, pinpointing sources of low quality, resolving conflicts in data from different sources, improving efficiency of query answering, and so on. Despite its importance, discovering functional dependencies in such a context is challenging: we cannot assume upfront domain knowledge for specifying dependencies, and the data can be dirty, incomplete, or even misinterpreted, so make automatic discovery of dependencies hard. This paper studies how one can automatically discover functional dependencies in a Pay-as-You-Go data integration system. We introduce the notion of probabilistic functional dependencies (pFDs) and design Bayes models that compute probabilities of dependencies according to data from various sources. As an application, we study how to normalize a mediated schema based on the pFDs we generate. Experiments on real-world data sets with tens or hundreds of data sources show that our techniques obtain high precision and recall in dependency discovery and generate high-quality results in mediated-schema normalization.

  • bootstrapping pay as you go data integration systems
    International Conference on Management of Data, 2008
    Co-Authors: Anish Das Sarma, Xin Dong, Alon Halevy
    Abstract:

    Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a Pay-as-You-Go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a Pay-as-You-Go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

  • pay as you go user feedback for dataspace systems
    International Conference on Management of Data, 2008
    Co-Authors: Shawn R. Jeffery, Michael J. Franklin, Alon Halevy
    Abstract:

    A primary challenge to large-scale data integration is creating semantic equivalences between elements from different data sources that correspond to the same real-world entity or concept. Dataspaces propose a Pay-as-You-Go approach: automated mechanisms such as schema matching and reference reconciliation provide initial correspondences, termed candidate matches, and then user feedback is used to incrementally confirm these matches. The key to this approach is to determine in what order to solicit user feedback for confirming candidate matches. In this paper, we develop a decision-theoretic framework for ordering candidate matches for user confirmation using the concept of the value of perfect information (VPI). At the core of this concept is a utility function that quantifies the desirability of a given state; thus, we devise a utility function for dataspaces based on query result quality. We show in practice how to efficiently apply VPI in concert with this utility function to order user confirmations. A detailed experimental evaluation on both real and synthetic datasets shows that the ordering of user feedback produced by this VPI-based approach yields a dataspace with a significantly higher utility than a wide range of other ordering strategies. Finally, we outline the design of Roomba, a system that utilizes this decision-theoretic framework to guide a dataspace in soliciting user feedback in a Pay-as-You-Go manner.

  • SIGMOD Conference - Bootstrapping Pay-as-You-Go data integration systems
    Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08, 2008
    Co-Authors: Anish Das Sarma, Xin Dong, Alon Halevy
    Abstract:

    Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a Pay-as-You-Go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a Pay-as-You-Go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

Øystein Thøgersen - One of the best experts on this subject based on the ideXlab platform.