Data Mining Task

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 29325 Experts worldwide ranked by ideXlab platform

Mehdi Kaytoue - One of the best experts on this subject based on the ideXlab platform.

  • Anytime discovery of a diverse set of patterns with Monte Carlo tree search
    Data Mining and Knowledge Discovery, 2018
    Co-Authors: Guillaume Bosc, Chedy Raïssi, Jean-françois Boulicaut, Mehdi Kaytoue
    Abstract:

    The discovery of patterns that accurately discriminate one class label from another remains a challenging Data Mining Task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled Data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern Mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It outperforms other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern Mining Tasks.

  • Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search
    2016
    Co-Authors: Guillaume Bosc, Chedy Raïssi, Jean-françois Boulicaut, Mehdi Kaytoue
    Abstract:

    Discovering patterns that strongly distinguish one class label from another is a challenging Data-Mining Task. The unsupervised discovery of such patterns would enable the construction of intelligible classifiers and to elicit interesting hypotheses from the Data. Subgroup Discovery (SD) is one framework that formally defines this pattern Mining Task. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the uniqueness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by the Exceptional Model Mining (EMM) framework. This general framework aims to find patterns that cover tuples that locally induce a model that substantially differs from the model of the whole Dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality. In this article, we argue that the greedy nature of most of these approaches produce pattern sets that lack of diversity. Consequently, we propose to formally define pattern Mining as a single-player game, as in a puzzle, and to solve it with a Monte Carlo Tree Search (MCTS), a recent technique mainly used for artificial intelligence and planning problems. The exploitation/exploration trade-off and the power of random search of MCTS lead to an \emph{any-time Mining} approach which tends towards an exhaustive search if given enough time and memory. Given a reasonable time and memory budget, MCTS quickly drives the search towards a diverse pattern set of high quality. MCTS does not need any knowledge of the pattern quality measure, and we show to what extent it is agnostic to the pattern language. We assess our claims with an exhaustive set of experiments.

  • Biclustering meets triadic concept analysis
    Annals of Mathematics and Artificial Intelligence, 2014
    Co-Authors: Mehdi Kaytoue, Juraj Macko, Sergei O. Kuznetsov, Amedeo Napoli
    Abstract:

    Biclustering numerical Data became a popular Data-Mining Task at the beginning of 2000’s, especially for gene expression Data analysis and recommender systems. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute Data-table. So-called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a complete, correct and non-redundant enumeration of such patterns, a well-known intractable problem, while no for- mal framework exists.We introduce important links between biclustering and Formal Concept Analysis (FCA). Indeed, FCA is known to be, among others, a methodology for biclustering binary Data. Handling numerical Data is not direct, and we argue that Triadic Concept Analysis (TCA), the extension of FCA to ternary relations, provides a powerful mathematical and algorithmic framework for biclustering numerical Data. We discuss hence both theoretical and computational aspects on biclustering numerical Data with triadic concept analysis. These results also scale to n-dimensional numerical Datasets.

  • Mining Biclusters of Similar Values with Triadic Concept Analysis
    2011
    Co-Authors: Mehdi Kaytoue, Juraj Macko, Sergei O. Kuznetsov, Wagner Meira, Amedeo Napoli
    Abstract:

    Biclustering numerical Data became a popular Data-Mining Task in the beginning of 2000's, especially for analysing gene expression Data. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute Data-table. So called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a complete, correct and non redundant enumeration of such patterns, which is a well-known intractable problem, while no formal framework exists. In this paper, we introduce important links between biclustering and formal concept analysis. More specifically, we originally show that Triadic Concept Analysis (TCA), provides a nice mathematical framework for biclustering. Interestingly, existing algorithms of TCA, that usually apply on binary Data, can be used (directly or with slight modifications) after a preprocessing step for extracting maximal biclusters of similar values.

Raj P. Gopalan - One of the best experts on this subject based on the ideXlab platform.

  • Australian Conference on Artificial Intelligence - Clustering transactional Data streams
    Lecture Notes in Computer Science, 2006
    Co-Authors: Yanrong Li, Raj P. Gopalan
    Abstract:

    The challenge of Mining Data streams is three fold. Firstly, an algorithm for a particular Data Mining Task is subject to the sequential one-pass constraint; secondly, it must work under bounded resources such as memory and disk space; thirdly, it should have capabilities to answer time-sensitive queries. Dealing with transactional Data streams is even more challenging due to their high dimensionality and sparseness. In this paper, algorithms for clustering transactional Data streams are proposed by incorporating the incremental clustering algorithm INCLUS into the equal-width time window model and the elastic time window model. These algorithms can efficiently cluster a transactional Data stream in one pass and answer time sensitive queries at different granularities with limited resources.

Sergio Peignier - One of the best experts on this subject based on the ideXlab platform.

  • Evolutionary Subspace Clustering Using Variable Genome Length
    Computational Intelligence, 2020
    Co-Authors: Sergio Peignier, Christophe Rigotti, Guillaume Beslon
    Abstract:

    Subspace clustering is a Data-Mining Task that groups similar Data objects and at the same time searches the subspaces where similarities appear. For this reason, subspace clustering is recognized as more general and complicated than standard clustering. In this article, we present ChameleoClust+, a bioinspired evolutionary subspace clustering algorithm that takes advantage of an evolvable genome structure to detect various numbers of clusters located in different subspaces. ChameleoClust+ incorporates several biolike features such as a variable genome length, both functional and nonfunctional elements, and mutation operators including large rearrangements. It was assessed and compared with the state-of-the-art methods on a reference benchmark using both real-world and synthetic Data sets. Although other algorithms may need complex parameter settings, ChameleoClust+ needs to set only one subspace clustering ad hoc and intuitive parameter: the maximal number of clusters. The remaining parameters of ChameleoClust+ are related to the evolution strategy (eg, population size, mutation rate), and a single setting for all of them turned out to be effective for all the benchmark Data sets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality.

  • Subspace clustering on static Datasets and dynamic Data streams using bio-inspired algorithms
    2017
    Co-Authors: Sergio Peignier
    Abstract:

    An important Task that has been investigated in the context of high dimensional Data is subspace clustering. This Data Mining Task is recognized as more general and complicated than standard clustering, since it aims to detect groups of similar objects called clusters, and at the same time to find the subspaces where these similarities appear. Furthermore, subspace clustering approaches as well as traditional clustering ones have recently been extended to deal with Data streams by updating clustering models in an incremental way. The different algorithms that have been proposed in the literature, rely on very different algorithmic foundations. Among these approaches, evolutionary algorithms have been under-explored, even if these techniques have proven to be valuable addressing other NP-hard problems. The aim of this thesis was to take advantage of new knowledge from evolutionary biology in order to conceive evolutionary subspace clustering algorithms for static Datasets and dynamic Data streams. Chameleoclust, the first algorithm developed in this work, takes advantage of the large degree of freedom provided by bio-like features such as a variable genome length, the existence of functional and non-functional elements and mutation operators including chromosomal rearrangements. KymeroClust, our second algorithm, is a k-medians based approach that relies on the duplication and the divergence of genes, a cornerstone evolutionary mechanism. SubMorphoStream, the last one, tackles the subspace clustering Task over dynamic Data streams. It relies on two important mechanisms that favor fast adaptation of bacteria to changing environments, namely gene amplification and foreign genetic material uptake. All these algorithms were compared to the main state-of-the-art techniques, obtaining competitive results. Results suggest that these algorithms are useful complementary tools in the analyst toolbox. In addition, two applications called EvoWave and EvoMove have been developed to assess the capacity of these algorithms to address real world problems. EvoWave is an application that handles the analysis of Wi-Fi signals to detect different contexts. EvoMove, the second one, is a musical companion that produces sounds based on the clustering of dancer moves captured using motion sensors.

  • EvoEvo Deliverable 5.1
    2016
    Co-Authors: Guillaume Beslon, Sergio Peignier, Jonas Abernot, Christophe Rigotti
    Abstract:

    Subspace clustering is a Data Mining Task that searches for objects that share similar features and at the same time looks for the subspaces where these similarities appear. For this reason Subspace clustering is recognized as more general and complicated than standard clustering, since this last Task requires only to detect groups of similar objects or clusters. In this report we present ChameleoClust + , an evolutionary algorithm to tackle the subspace clustering problem. ChameleoClust + is a bio-inspired algorithm implementing an evolvable genome structure, including several bio-like features such as a variable genome length, both functional and non-functional elements and mutation operators including chromosomal rearrangements. The main purpose of the design of ChameleoClust + is to take advantage of the large degree of freedom provided by its evolvable structure to detect various number of clusters in subspaces of various dimensions. This algorithm was assessed and compared to the state of the art methods, with satisfying results, on a reference benchmark using both real world and synthetic Datasets. While other algorithms may need more complex parameter setting, ChameleoClust + needs to set only one sub-space clustering ad-hoc parameter: the maximal number of clusters. This single parameter is responsible for setting the maximal level of detail of the subspace clustering, and is a quite intuitive parameter. The remaining parameters of ChameleoClust+ are related to the evolution strategy (population size, mutation rate, ...) and it is possible to use a single setting for them, that turns out to be effective enough for all the benchmark Datasets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality. This report also presents Evowave, an application of ChameleoClust+ to analyze a real dynamic stream.

Yanrong Li - One of the best experts on this subject based on the ideXlab platform.

  • Australian Conference on Artificial Intelligence - Clustering transactional Data streams
    Lecture Notes in Computer Science, 2006
    Co-Authors: Yanrong Li, Raj P. Gopalan
    Abstract:

    The challenge of Mining Data streams is three fold. Firstly, an algorithm for a particular Data Mining Task is subject to the sequential one-pass constraint; secondly, it must work under bounded resources such as memory and disk space; thirdly, it should have capabilities to answer time-sensitive queries. Dealing with transactional Data streams is even more challenging due to their high dimensionality and sparseness. In this paper, algorithms for clustering transactional Data streams are proposed by incorporating the incremental clustering algorithm INCLUS into the equal-width time window model and the elastic time window model. These algorithms can efficiently cluster a transactional Data stream in one pass and answer time sensitive queries at different granularities with limited resources.

Guillaume Beslon - One of the best experts on this subject based on the ideXlab platform.

  • Evolutionary Subspace Clustering Using Variable Genome Length
    Computational Intelligence, 2020
    Co-Authors: Sergio Peignier, Christophe Rigotti, Guillaume Beslon
    Abstract:

    Subspace clustering is a Data-Mining Task that groups similar Data objects and at the same time searches the subspaces where similarities appear. For this reason, subspace clustering is recognized as more general and complicated than standard clustering. In this article, we present ChameleoClust+, a bioinspired evolutionary subspace clustering algorithm that takes advantage of an evolvable genome structure to detect various numbers of clusters located in different subspaces. ChameleoClust+ incorporates several biolike features such as a variable genome length, both functional and nonfunctional elements, and mutation operators including large rearrangements. It was assessed and compared with the state-of-the-art methods on a reference benchmark using both real-world and synthetic Data sets. Although other algorithms may need complex parameter settings, ChameleoClust+ needs to set only one subspace clustering ad hoc and intuitive parameter: the maximal number of clusters. The remaining parameters of ChameleoClust+ are related to the evolution strategy (eg, population size, mutation rate), and a single setting for all of them turned out to be effective for all the benchmark Data sets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality.

  • EvoEvo Deliverable 5.1
    2016
    Co-Authors: Guillaume Beslon, Sergio Peignier, Jonas Abernot, Christophe Rigotti
    Abstract:

    Subspace clustering is a Data Mining Task that searches for objects that share similar features and at the same time looks for the subspaces where these similarities appear. For this reason Subspace clustering is recognized as more general and complicated than standard clustering, since this last Task requires only to detect groups of similar objects or clusters. In this report we present ChameleoClust + , an evolutionary algorithm to tackle the subspace clustering problem. ChameleoClust + is a bio-inspired algorithm implementing an evolvable genome structure, including several bio-like features such as a variable genome length, both functional and non-functional elements and mutation operators including chromosomal rearrangements. The main purpose of the design of ChameleoClust + is to take advantage of the large degree of freedom provided by its evolvable structure to detect various number of clusters in subspaces of various dimensions. This algorithm was assessed and compared to the state of the art methods, with satisfying results, on a reference benchmark using both real world and synthetic Datasets. While other algorithms may need more complex parameter setting, ChameleoClust + needs to set only one sub-space clustering ad-hoc parameter: the maximal number of clusters. This single parameter is responsible for setting the maximal level of detail of the subspace clustering, and is a quite intuitive parameter. The remaining parameters of ChameleoClust+ are related to the evolution strategy (population size, mutation rate, ...) and it is possible to use a single setting for them, that turns out to be effective enough for all the benchmark Datasets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality. This report also presents Evowave, an application of ChameleoClust+ to analyze a real dynamic stream.