Low Cardinality

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2142 Experts worldwide ranked by ideXlab platform

Hasan M. Jamil - One of the best experts on this subject based on the ideXlab platform.

  • DEXA (1) - A hybrid index structure for set-valued attributes using itemset tree and inverted list
    Lecture Notes in Computer Science, 2010
    Co-Authors: Shahriyar Hossain, Hasan M. Jamil
    Abstract:

    The use of set-valued objects is becoming increasingly commonplace in modern application domains, multimedia, genetics, the stock market, etc. Recent research on set indexing has focused mainly on containment joins and data mining without considering basic set operations on set-valued attributes. In this paper, we propose a novel indexing scheme for processing superset, subset and equality queries on set-valued attributes. The proposed index structure is a hybrid of itemset-transaction set tree of "frequent items" and an inverted list of "infrequent items" that take advantage of the developments in itemset research in data mining. In this hybrid scheme, the expectation is that basic set operations with frequent Low Cardinality sets will yield superior retrieval performance and avoid the high costs of construction and maintenance of item-set tree for infrequent large item-sets. We demonstrate, through extensive experiments, that the proposed method performs as expected, and yields superior overall performance compared to the state of the art indexing scheme for set-valued attributes, i.e., inverted lists.

Giorgio Valentini - One of the best experts on this subject based on the ideXlab platform.

  • Model order selection for bio-molecular data clustering.
    BMC bioinformatics, 2007
    Co-Authors: Alberto Bertoni, Giorgio Valentini
    Abstract:

    Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively Low Cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A chi2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures). The experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.

  • Bio-molecular diagnosis through random subspace ensembles of learning machines
    2006
    Co-Authors: Giorgio Valentini, Alberto Bertoni, Raffaella Folgieri
    Abstract:

    High-throughput bio-technologies (e.g. DNA microarray) generate data characterized by high dimensionality and Low Cardinality. The bio-molecular diagnosis of malignancies, based on these biotechnologies, is a difficult learning task, due to the characteristics of these high-dimensional data. Many supervised machime learning techniques, among them support vector machines (SVMs), have been experimented, using also feature selection methods to reduce the dimensionality of the data. In this paper we investigate an alternative approach based on random subspace ensemble methods. The high dimensionality of the data is reduced by randomly sampling subsets of features (gene expression levels), and accuracy is improved by aggregating the resulting base classifiers. Considering the high computational cost of the proposed technique, we used the High-Performance C.I.L.E.A. Avogadro cluster of Xeon double processor workstations to perform all our computational experiments.

  • Letters: Bio-molecular cancer prediction with random subspace ensembles of support vector machines
    Neurocomputing, 2005
    Co-Authors: Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
    Abstract:

    Support vector machines (SVMs), and other supervised learning techniques have been experimented for the bio-molecular diagnosis of malignancies, using also feature selection methods. The classification task is particularly difficult because of the high dimensionality and Low Cardinality of gene expression data. In this paper we investigate a different approach based on random subspace ensembles of SVMs: a set of base learners is trained and aggregated using subsets of features randomly drawn from the available DNA microarray data. Experimental results on the colon adenocarcinoma diagnosis and medulloblastoma clinical outcome prediction show the effectiveness of the proposed approach.

  • WIRN - Feature Selection Combined with Random Subspace Ensemble for Gene Expression Based Diagnosis of Malignancies
    Biological and Artificial Intelligence Environments, 1
    Co-Authors: Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
    Abstract:

    The bio-molecular diagnosis of malignancies represents a difficult learning task, because of the high dimensionality and Low Cardinality of the data. Many supervised learning techniques, among them support vector machines, have been experimented, using also feature selection methods to reduce the dimensionality of the data. In alternative to feature selection methods, we proposed to apply random subspace ensembles, reducing the dimensionality of the data by randomly sampling subsets of features and improving accuracy by aggregating the resulting base classifiers. In this paper we experiment the combination of random subspace with feature selection methods, showing preliminary experimental results that seem to confirm the effectiveness of the proposed approach.

Yannis Kotidis - One of the best experts on this subject based on the ideXlab platform.

  • building space efficient inverted indexes on Low Cardinality dimensions
    Database and Expert Systems Applications, 2015
    Co-Authors: Vasilis Spyropoulos, Yannis Kotidis
    Abstract:

    Many modern applications naturally lead to the implementation of inverted indexes for effectively managing large collections of data items. Creating an inverted index on a Low Cardinality data domain results in replication of data descriptors, leading to increased storage overhead. For example, the use of RFID or similar sensing devices in supply-chains results in massive tracking datasets that need effective spatial or spatio-temporal indexes on them. As the volume of data grows proportionally larger than the number of spatial locations or time epochs, it is unavoidable that many of the resulting lists share large subsets of common items. In this paper we present techniques that exploit this characteristic of modern big-data applications in order to losslessly compress the resulting inverted indexes by discovering large common item sets and adapting the index so as to store just one copy of them. We apply our method in the supply chain domain using modern big-data tools and show that our techniques in many cases achieve compression ratios that exceed 50i¾?%.

  • DEXA (1) - Building Space-Efficient Inverted Indexes on Low-Cardinality Dimensions
    Lecture Notes in Computer Science, 2015
    Co-Authors: Vasilis Spyropoulos, Yannis Kotidis
    Abstract:

    Many modern applications naturally lead to the implementation of inverted indexes for effectively managing large collections of data items. Creating an inverted index on a Low Cardinality data domain results in replication of data descriptors, leading to increased storage overhead. For example, the use of RFID or similar sensing devices in supply-chains results in massive tracking datasets that need effective spatial or spatio-temporal indexes on them. As the volume of data grows proportionally larger than the number of spatial locations or time epochs, it is unavoidable that many of the resulting lists share large subsets of common items. In this paper we present techniques that exploit this characteristic of modern big-data applications in order to losslessly compress the resulting inverted indexes by discovering large common item sets and adapting the index so as to store just one copy of them. We apply our method in the supply chain domain using modern big-data tools and show that our techniques in many cases achieve compression ratios that exceed 50i¾?%.

Shahriyar Hossain - One of the best experts on this subject based on the ideXlab platform.

  • DEXA (1) - A hybrid index structure for set-valued attributes using itemset tree and inverted list
    Lecture Notes in Computer Science, 2010
    Co-Authors: Shahriyar Hossain, Hasan M. Jamil
    Abstract:

    The use of set-valued objects is becoming increasingly commonplace in modern application domains, multimedia, genetics, the stock market, etc. Recent research on set indexing has focused mainly on containment joins and data mining without considering basic set operations on set-valued attributes. In this paper, we propose a novel indexing scheme for processing superset, subset and equality queries on set-valued attributes. The proposed index structure is a hybrid of itemset-transaction set tree of "frequent items" and an inverted list of "infrequent items" that take advantage of the developments in itemset research in data mining. In this hybrid scheme, the expectation is that basic set operations with frequent Low Cardinality sets will yield superior retrieval performance and avoid the high costs of construction and maintenance of item-set tree for infrequent large item-sets. We demonstrate, through extensive experiments, that the proposed method performs as expected, and yields superior overall performance compared to the state of the art indexing scheme for set-valued attributes, i.e., inverted lists.

Alberto Bertoni - One of the best experts on this subject based on the ideXlab platform.

  • Model order selection for bio-molecular data clustering.
    BMC bioinformatics, 2007
    Co-Authors: Alberto Bertoni, Giorgio Valentini
    Abstract:

    Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively Low Cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A chi2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures). The experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.

  • Bio-molecular diagnosis through random subspace ensembles of learning machines
    2006
    Co-Authors: Giorgio Valentini, Alberto Bertoni, Raffaella Folgieri
    Abstract:

    High-throughput bio-technologies (e.g. DNA microarray) generate data characterized by high dimensionality and Low Cardinality. The bio-molecular diagnosis of malignancies, based on these biotechnologies, is a difficult learning task, due to the characteristics of these high-dimensional data. Many supervised machime learning techniques, among them support vector machines (SVMs), have been experimented, using also feature selection methods to reduce the dimensionality of the data. In this paper we investigate an alternative approach based on random subspace ensemble methods. The high dimensionality of the data is reduced by randomly sampling subsets of features (gene expression levels), and accuracy is improved by aggregating the resulting base classifiers. Considering the high computational cost of the proposed technique, we used the High-Performance C.I.L.E.A. Avogadro cluster of Xeon double processor workstations to perform all our computational experiments.

  • Letters: Bio-molecular cancer prediction with random subspace ensembles of support vector machines
    Neurocomputing, 2005
    Co-Authors: Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
    Abstract:

    Support vector machines (SVMs), and other supervised learning techniques have been experimented for the bio-molecular diagnosis of malignancies, using also feature selection methods. The classification task is particularly difficult because of the high dimensionality and Low Cardinality of gene expression data. In this paper we investigate a different approach based on random subspace ensembles of SVMs: a set of base learners is trained and aggregated using subsets of features randomly drawn from the available DNA microarray data. Experimental results on the colon adenocarcinoma diagnosis and medulloblastoma clinical outcome prediction show the effectiveness of the proposed approach.

  • WIRN - Feature Selection Combined with Random Subspace Ensemble for Gene Expression Based Diagnosis of Malignancies
    Biological and Artificial Intelligence Environments, 1
    Co-Authors: Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
    Abstract:

    The bio-molecular diagnosis of malignancies represents a difficult learning task, because of the high dimensionality and Low Cardinality of the data. Many supervised learning techniques, among them support vector machines, have been experimented, using also feature selection methods to reduce the dimensionality of the data. In alternative to feature selection methods, we proposed to apply random subspace ensembles, reducing the dimensionality of the data by randomly sampling subsets of features and improving accuracy by aggregating the resulting base classifiers. In this paper we experiment the combination of random subspace with feature selection methods, showing preliminary experimental results that seem to confirm the effectiveness of the proposed approach.