Schema Information

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 41250 Experts worldwide ranked by ideXlab platform

H V Jagadish - One of the best experts on this subject based on the ideXlab platform.

  • scaling entity resolution a loosely Schema aware approach
    2019
    Co-Authors: Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Abstract In big data sources, real-world entities are typically represented with a variety of Schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent Information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naive all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features, under the assumption that handling the Schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose Schema Information by adopting an LSH-based step for efficiently handling volume and Schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

  • blast a loosely Schema aware meta blocking approach for entity resolution
    2016
    Co-Authors: Giovanni Simonini, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches.Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features.In this paper, we demonstrate how "loose" Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose Information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

Giovanni Simonini - One of the best experts on this subject based on the ideXlab platform.

  • scaling entity resolution a loosely Schema aware approach
    2019
    Co-Authors: Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Abstract In big data sources, real-world entities are typically represented with a variety of Schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent Information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naive all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features, under the assumption that handling the Schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose Schema Information by adopting an LSH-based step for efficiently handling volume and Schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

  • blast a loosely Schema aware meta blocking approach for entity resolution
    2016
    Co-Authors: Giovanni Simonini, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches.Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features.In this paper, we demonstrate how "loose" Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose Information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

Sonia Bergamaschi - One of the best experts on this subject based on the ideXlab platform.

  • scaling entity resolution a loosely Schema aware approach
    2019
    Co-Authors: Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Abstract In big data sources, real-world entities are typically represented with a variety of Schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent Information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naive all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features, under the assumption that handling the Schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose Schema Information by adopting an LSH-based step for efficiently handling volume and Schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

  • blast a loosely Schema aware meta blocking approach for entity resolution
    2016
    Co-Authors: Giovanni Simonini, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches.Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features.In this paper, we demonstrate how "loose" Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose Information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

Igor Tatarinov - One of the best experts on this subject based on the ideXlab platform.

  • Schema mediation in peer data management systems
    2003
    Co-Authors: Alon Halevy, Zachary G Ives, Dan Suciu, Igor Tatarinov
    Abstract:

    Intuitively, data management and data integration tools should be well-suited for exchanging Information in a semantically meaningful way. Unfortunately, they suffer from two significant problems: they typically require a comprehensive Schema design before they can be used to store or share Information, and they are difficult to extend because Schema evolution is heavyweight and may break backwards compatibility. As a result, many small-scale data sharing tasks are more easily facilitated by nondatabase-oriented tools that have little support for semantics. The goal of the peer data management system (PDMS) is to address this need: we propose the use of a decentralized, easily extensible data management architecture in which any user can contribute new data, Schema Information, or even mappings between other peer's Schemas. PDMSs represent a natural step beyond data integration systems, replacing their single logical Schema with an interlinked collection of semantic mappings between peer's individual Schemas. We consider the problem of Schema mediation in a PDMS. Our first contribution is a flexible language for mediating between peer Schemas, which extends known data integration formalisms to our more complex architecture. We precisely characterize the complexity of query answering for our language. Next, we describe a reformulation algorithm for our language that generalizes both global-as-view and local-as-view query answering algorithms. Finally, we describe several methods for optimizing the reformulation algorithm, and an initial set of experiments studying its performance.

Luca Gagliardelli - One of the best experts on this subject based on the ideXlab platform.

  • scaling entity resolution a loosely Schema aware approach
    2019
    Co-Authors: Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, H V Jagadish
    Abstract:

    Abstract In big data sources, real-world entities are typically represented with a variety of Schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent Information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naive all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on Schema-agnostic features, under the assumption that handling the Schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” Schema Information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely Schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose Schema Information by adopting an LSH-based step for efficiently handling volume and Schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.