Synthetic Datasets

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 42654 Experts worldwide ranked by ideXlab platform

Jorg Drechsler - One of the best experts on this subject based on the ideXlab platform.

  • new data dissemination approaches in old europe Synthetic Datasets for a german establishment survey
    Journal of Applied Statistics, 2012
    Co-Authors: Jorg Drechsler
    Abstract:

    Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed Synthetic Datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of Datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially Synthetic Datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated Datasets. A variance-inflated imputation model is introduced that incorporates addit...

  • New data dissemination approaches in old Europe – Synthetic Datasets for a German establishment survey
    Journal of Applied Statistics, 2012
    Co-Authors: Jorg Drechsler
    Abstract:

    Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed Synthetic Datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of Datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially Synthetic Datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated Datasets. A variance-inflated imputation model is introduced that incorporates addit...

  • Synthetic Datasets for statistical disclosure control theory and implementation
    2011
    Co-Authors: Jorg Drechsler
    Abstract:

    Introduction.- Background on Multiply Imputed Synthetic Datasets.- Background on Multiple Imputation.- The IAB Establishment Panel.- Multiple Imputation for Nonresponse.- Fully Synthetic Datasets.- Partially Synthetic Datasets.- Multiple Imputation for Nonresponse and Statistical Disclosure Control.- A Two-Stage Imputation Procedure to Balance the Risk-Utility Trade-Off.- Chances and Obstacles for Multiply Imputed Synthetic Datasets.

  • Background on Multiply Imputed Synthetic Datasets
    Synthetic Datasets for Statistical Disclosure Control, 2011
    Co-Authors: Jorg Drechsler
    Abstract:

    In 1993, the Journal of Official Statistics published a special issue on data confidentiality. Two articles in this volume laid the foundation for the development of multiply imputed Synthetic Datasets (MISDs). In his discussion “Statistical Disclosure Limitation,” Rubin (1993) for the first time suggested generating Synthetic Datasets based on his ideas of multiple imputation for missing values (Rubin, 1987). He proposed to treat all the observations from the sampling frame that are not part of the sample as missing data and to impute them according to the multiple imputation framework. Afterwards, simple random samples from these fully imputed Datasets should be released to the public. Because the released dataset does not contain any real data, disclosure of sensitive information is very difficult. On the other hand, if the imputation models are selected carefully and the predictive power of the models is high, most of the information contained in the original data will be preserved. This approach is now called generating fully Synthetic Datasets in the literature.

  • Fully Synthetic Datasets
    Synthetic Datasets for Statistical Disclosure Control, 2011
    Co-Authors: Jorg Drechsler
    Abstract:

    In 1993, Rubin suggested creating fully Synthetic Datasets based on the multipleimputation framework. His idea was to treat all units in the population that have not been selected in the sample as missing data, impute them according to the multipleimputation approach, and draw simple random samples from these imputed populations for release to the public. Most surveys are conducted using complex sampling designs. Releasing simple random samples simplifies research for the potential user of the data since the design doesn’t have to be incorporated in the model. It is not necessary, however, to release simple random samples.

Bianchi Serique Meiguins - One of the best experts on this subject based on the ideXlab platform.

  • Synthetic Datasets generator for testing information visualization and machine learning techniques and tools
    IEEE Access, 2020
    Co-Authors: Sandro De Paula Mendonca, Yvan Pereira Dos Santos Brito, Carlos Gustavo Resque Dos Santos, Rodrigo Santos Do Amor Divino Lima, Tiago Araujo, Bianchi Serique Meiguins
    Abstract:

    Data generators are applications that produce Synthetic Datasets, which are useful for testing data analytics applications, such as machine learning algorithms and information visualization techniques. Each data generator application has a different approach to generate data. Consequently, each one has functionality gaps that make it unsuitable for some tasks (e.g., lack of ways to create outliers and non-random noise). This paper presents a data generator application that aims to fill relevant gaps scattered across other applications, providing a flexible tool to assist researchers in exhaustively testing their techniques in more diverse ways. The proposed system allows users to define and compose known statistical distributions to produce the desired outcome, visualizing the behavior of the data in real-time to analyze if it has the characteristics needed for efficient testing. This paper presents in detail the tool functionalities and how to create Datasets, as well as a usage scenario to illustrate the process of data creation.

  • IV - A Prototype Application to Generate Synthetic Datasets for Information Visualization Evaluations
    2018 22nd International Conference Information Visualisation (IV), 2018
    Co-Authors: Yvan Pereira Dos Santos Briro, Sandro De Paula Mendonca, Carlos Gustavo Resque Dos Santos, Tiago Araujo, Alexandre Abreu De Freitas, Bianchi Serique Meiguins
    Abstract:

    The evaluation is an essential step of works that propose new information visualization techniques or tools. A common type is the controlled experiment in which the researchers measure the user performance to execute specific tasks using the proposed method. Furthermore, the dataset used for these tests must contain known desired features to be evaluated (e.g., level of noise, the percentage of missing values) in a controlled way. Thus, this article proposes an application to generate Synthetic databases for evaluating information visualization techniques and tools. The system aims to create a dataset generator model that allows the construction of Datasets with a diversity of profiles in a controlled manner. The creator of the model can save it for future experiments or updates and can export it enabling other groups to replicate the experiments easily. For a better understanding of application features and how to use it, this work also shows two use scenarios explaining the created model for each situation.

Emiliano A. Valdez - One of the best experts on this subject based on the ideXlab platform.

  • Nested Stochastic Valuation of Large Variable Annuity Portfolios: Monte Carlo Simulation and Synthetic Datasets
    2018
    Co-Authors: Guojun Gan, Emiliano A. Valdez
    Abstract:

    Dynamic hedging has been adopted by many insurance companies to mitigate the financial risks associated with variable annuity guarantees. To simulate the performance of dynamic hedging for variable annuity products, insurance companies rely on nested stochastic projections, which is highly computationally intensive and often prohibitive for large variable annuity portfolios. Metamodeling techniques have recently been proposed to address the computational issues. However, it is difficult for researchers to obtain real Datasets from insurance companies to test metamodeling techniques and publish the results in academic journals. In this paper, we create Synthetic Datasets that can be used for the purpose of addressing the computational issues associated with the nested stochastic valuation of large variable annuity portfolios. The runtime used to create these Synthetic Datasets would be about three years if a single CPU were used. These Datasets are readily available to researchers and practitioners so that they can focus on testing metamodeling techniques.

  • Nested Stochastic Valuation of Large Variable Annuity Portfolios: Monte Carlo Simulation and Synthetic Datasets
    2018
    Co-Authors: Guojun Gan, Emiliano A. Valdez
    Abstract:

    Dynamic hedging has been adopted by many insurance companies to mitigate the financial risks associated with variable annuity guarantees. In order to simulate the performance of dynamic hedging for variable annuity products, insurance companies rely on nested stochastic projections, which is highly computationally intensive and often prohibitive for large variable annuity portfolios. Metamodeling techniques have recently been proposed to address the computational issues. However, it is difficult for researchers to obtain real Datasets from insurance companies to test metamodeling techniques and publish the results in academic journals. In this paper, we create Synthetic Datasets that can be used for the purpose of addressing the computational issues associated with the nested stochastic valuation of large variable annuity portfolios. The runtime used to create these Synthetic Datasets would be about 3 years if a single CPU were used. These Datasets are readily available to researchers and practitioners so that they can focus on testing metamodeling techniques.

  • valuation of large variable annuity portfolios monte carlo simulation and Synthetic Datasets
    Dependence Modeling, 2017
    Co-Authors: Guojun Gan, Emiliano A. Valdez
    Abstract:

    Metamodeling techniques have recently been proposed to address the computational issues related to the valuation of large portfolios of variable annuity contracts. However, it is extremely diffcult, if not impossible, for researchers to obtain real Datasets frominsurance companies in order to test their metamodeling techniques on such real Datasets and publish the results in academic journals. To facilitate the development and dissemination of research related to the effcient valuation of large variable annuity portfolios, this paper creates a large Synthetic portfolio of variable annuity contracts based on the properties of real portfolios of variable annuities and implements a simple Monte Carlo simulation engine for valuing the Synthetic portfolio. In addition, this paper presents fair market values and Greeks for the Synthetic portfolio of variable annuity contracts that are important quantities for managing the financial risks associated with variable annuities. The resulting Datasets can be used by researchers to test and compare the performance of various metamodeling techniques.

Sanja Fidler - One of the best experts on this subject based on the ideXlab platform.

  • Meta-Sim: Learning to Generate Synthetic Datasets
    arXiv: Computer Vision and Pattern Recognition, 2019
    Co-Authors: Amlan Kar, Aayush Prakash, Ming-yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, Sanja Fidler
    Abstract:

    Training models to high-end performance requires availability of large labeled Datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled Datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of Synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

  • ICCV - Meta-Sim: Learning to Generate Synthetic Datasets
    2019 IEEE CVF International Conference on Computer Vision (ICCV), 2019
    Co-Authors: Amlan Kar, Aayush Prakash, Ming-yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, Sanja Fidler
    Abstract:

    Training models to high-end performance requires availability of large labeled Datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled Datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of Synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

David Bull - One of the best experts on this subject based on the ideXlab platform.

  • a deep learning approach to detecting volcano deformation from satellite imagery using Synthetic Datasets
    Remote Sensing of Environment, 2019
    Co-Authors: Nantheera Anantrasirichai, Juliet Biggs, F Albino, David Bull
    Abstract:

    Abstract Satellites enable widespread, regional or global surveillance of volcanoes and can provide the first indication of volcanic unrest or eruption. Here we consider Interferometric Synthetic Aperture Radar (InSAR), which can be employed to detect surface deformation with a strong statistical link to eruption. Recent developments in technology as well as improved computational power have resulted in unprecedented quantities of monitoring data, which can no longer be inspected manually. The ability of machine learning to automatically identify signals of interest in these large InSAR Datasets has already been demonstrated, but data-driven techniques, such as convolutional neutral networks (CNN) require balanced training Datasets of positive and negative signals to effectively differentiate between real deformation and noise. As only a small proportion of volcanoes are deforming and atmospheric noise is ubiquitous, the use of machine learning for detecting volcanic unrest is more challenging than many other applications. In this paper, we address this problem using Synthetic interferograms to train the AlexNet CNN. The Synthetic interferograms are composed of 3 parts: 1) deformation patterns based on a Monte Carlo selection of parameters for analytic forward models, 2) stratified atmospheric effects derived from weather models and 3) turbulent atmospheric effects based on statistical simulations of correlated noise. The AlexNet architecture trained with Synthetic data outperforms that trained using real interferograms alone, based on classification accuracy and positive predictive value (PPV). However, the models used to generate the Synthetic signals are a simplification of the natural processes, so we retrain the CNN with a combined dataset consisting of Synthetic models and selected real examples, achieving a final PPV of 82%. Although applying atmospheric corrections to the entire dataset is computationally expensive, it is relatively simple to apply them to the small subset of positive results. This further improves the detection performance without a significant increase in computational burden (PPV of 100%). Thus, we demonstrate that training with Synthetic examples can improve the ability of CNNs to detect volcano deformation in satellite images, and propose an efficient workflow for the development of automated systems.

  • a deep learning approach to detecting volcano deformation from satellite imagery using Synthetic Datasets
    arXiv: Computer Vision and Pattern Recognition, 2019
    Co-Authors: Nantheera Anantrasirichai, Juliet Biggs, F Albino, David Bull
    Abstract:

    Satellites enable widespread, regional or global surveillance of volcanoes and can provide the first indication of volcanic unrest or eruption. Here we consider Interferometric Synthetic Aperture Radar (InSAR), which can be employed to detect surface deformation with a strong statistical link to eruption. The ability of machine learning to automatically identify signals of interest in these large InSAR Datasets has already been demonstrated, but data-driven techniques, such as convolutional neutral networks (CNN) require balanced training Datasets of positive and negative signals to effectively differentiate between real deformation and noise. As only a small proportion of volcanoes are deforming and atmospheric noise is ubiquitous, the use of machine learning for detecting volcanic unrest is more challenging. In this paper, we address this problem using Synthetic interferograms to train the AlexNet. The Synthetic interferograms are composed of 3 parts: 1) deformation patterns based on a Monte Carlo selection of parameters for analytic forward models, 2) stratified atmospheric effects derived from weather models and 3) turbulent atmospheric effects based on statistical simulations of correlated noise. The AlexNet architecture trained with Synthetic data outperforms that trained using real interferograms alone, based on classification accuracy and positive predictive value (PPV). However, the models used to generate the Synthetic signals are a simplification of the natural processes, so we retrain the CNN with a combined dataset consisting of Synthetic models and selected real examples, achieving a final PPV of 82%. Although applying atmospheric corrections to the entire dataset is computationally expensive, it is relatively simple to apply them to the small subset of positive results. This further improves the detection performance without a significant increase in computational burden.