Automatic Categorization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6714 Experts worldwide ranked by ideXlab platform

Gillian Millburn - One of the best experts on this subject based on the ideXlab platform.

  • Automatic Categorization of diverse experimental information in the bioscience literature
    BMC bioinformatics, 2012
    Co-Authors: Ruihua Fang, Mary Ann Tuli, Paul Davis, Wen Chen, Kimberly Van Auken, Xiaodong Wang, Gary Schindelman, Jolene S. Fernandes, Steven J. Marygold, Gillian Millburn
    Abstract:

    Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an Automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely Automatic and can be readily applied to diverse experimental data types. It has been in use in production for Automatic Categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for Automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely Automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.

  • Automatic Categorization of diverse experimental information in the bioscience literature
    BMC Bioinformatics, 2012
    Co-Authors: Ruihua Fang, Mary Ann Tuli, Paul Davis, Wen Chen, Kimberly Van Auken, Xiaodong Wang, Gary Schindelman, Jolene S. Fernandes, Steven J. Marygold, Gillian Millburn
    Abstract:

    Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an Automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely Automatic and can be readily applied to diverse experimental data types. It has been in use in production for Automatic Categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for Automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely Automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.

Patrick Ruch - One of the best experts on this subject based on the ideXlab platform.

  • Comparing a Rule Based vs. Statistical System for Automatic Categorization of MEDLINE ® Documents According to Biomedical Specialty
    2020
    Co-Authors: Susanne M Humphrey, Aurelie Neveol, Julien Gobeil, Patrick Ruch, Stefan Jacques Darmoni, Allen Browne
    Abstract:

    Abstract Automatic document Categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such Categorization. This paper focuses on Automatic Categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings ® (MeSH ® ) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human Categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another. 2Corresponding Author. 1 Retired from U.S. National Library of Medicine. Current address: 2123 Arcola Avenue, Wheaton, MD 20902, USA; susannehumphrey@yahoo.com 3 Humphrey and Darmoni, experts in the JD and MT approaches, respectively, spent several weeks working to achieve the consensus by telephone and email. Their work was based on the title and abstract of the documents. About two weeks total were necessary to establish a correspondence between JDs and MTs. An additional two weeks were spent developing the gold standard consensus: both experts spent about one hour on each document. The documents were not categorized by either of their systems prior to their work, so that the gold standard was obtained independently from the Automatic methods. NIH Public Acces

  • comparing a rule based versus statistical system for Automatic Categorization of medline documents according to biomedical specialty
    Journal of the Association for Information Science and Technology, 2009
    Co-Authors: Susanne M Humphrey, Aurelie Neveol, Allen C Browne, Julien Gobeil, Patrick Ruch, Stefan Jacques Darmoni
    Abstract:

    Automatic document Categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including, Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such Categorization. This paper focuses on Automatic Categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings (MeSH) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI), based on human Categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for 100 MEDLINE documents, using six measures selected from trecleval. The results show that for five of the measures performance is comparable, and for one measure JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule-based) might be combined and then evaluated showing they are complementary to one another. © 2009 Wiley Periodicals, Inc.

  • unsupervised documents Categorization using new threshold sensitive weighting technique
    Artificial Intelligence in Medicine in Europe, 2007
    Co-Authors: Frederic Ehrler, Patrick Ruch
    Abstract:

    As the number of published documents increase quickly, there is a crucial need for fast and sensitive Categorization methods to manage the produced information. In this paper, we focused on the Categorization of biomedical documents with concepts of the Gene Ontology, an ontology dedicated to gene description. Our approach discovers associations between the predefined concepts and the documents using string matching techniques. The assignations are ranked according to a score computed given several strategies. The effects of these different scoring strategies on the Categorization effectiveness are evaluated. More especially a new weighting technique based on term frequency is presented. This new weighting technique improves the Categorization effectiveness on most of the experiment performed. This paper shows that a cleaver use of the frequency can bring substantial benefits when performing Automatic Categorization on large collection of documents.

  • query and document translation by Automatic text Categorization a simple approach to establish a strong textual baseline for imageclefmed 2006
    CLEF (Working Notes), 2006
    Co-Authors: Julien Gobeill, Henning Muller, Patrick Ruch
    Abstract:

    In this paper, we report on the fusion of simple retrieval strategies with thesaural resources in order to perform document and query translation for cross{language retrieval in a collection of medical cases. The collection contains textual and visual contents. In this paper, we focus on the textual contents of the collection, which contains documents in three languages: French, English and German. The fusion of visual and textual content will also be treated. Unlike most Automatic Categorization systems, which rely on training data in order to infer text{to{concept relationships, our approach can be applied with any controlled vocabulary and does not use any training data. For the 2006 ImageCLEFmed experiments we use the Medical Subject Headings (MeSH), a terminology maintained by the National Library of Medicine and which exists in a dozen languages. The basic idea consists of annotating every textual content of the collection (documents and queries) with a set of MeSH concepts using an Automatic text categoriser. Thus, allowing an interlingual mapping between queries and documents. For tuning purposes, the system uses a sample of MEDLINE from the OHSUMED collection. Our results, conrmed that such a simple approach is competitive with best performing cross-language retrieval methods for such a collection. Several simple linear approaches were used to combine textual and visual features

Malay K Kundu - One of the best experts on this subject based on the ideXlab platform.

  • automated classification of pap smear images to detect cervical dysplasia
    Computer Methods and Programs in Biomedicine, 2017
    Co-Authors: Kangkana Bora, Manish Chowdhury, Lipi B Mahanta, Malay K Kundu
    Abstract:

    An automated Pap smear classifier is proposed to detect cervical dysplasia.Study performed on both cell as well as smear level indigenous real images collected from two diagnostic centers.Analysis is being performed on shape, texture and color features which includes 121 total numbers of features.An ensemble classifier is designed using LSSVM, MLP and Random Forest using weighted majority voting.Classification reflects the established Bethesda pathological classification of cervical cancer. Background and objectivesThe present study proposes an intelligent system for Automatic Categorization of Pap smear images to detect cervical dysplasia, which has been an open problem ongoing for last five decades. MethodsThe classification technique is based on shape, texture and color features. It classifies the cervical dysplasia into two-level (normal and abnormal) and three-level (Negative for Intraepithelial Lesion or Malignancy, Low-grade Squamous Intraepithelial Lesion and High-grade Squamous Intraepithelial Lesion) classes reflecting the established Bethesda system of classification used for diagnosis of cancerous or precancerous lesion of cervix. The system is evaluated on two generated databases obtained from two diagnostic centers, one containing 1610 single cervical cells and the other 1320 complete smear level images. The main objective of this database generation is to categorize the images according to the Bethesda system of classification both of which require lots of training and expertise. The system is also trained and tested on the benchmark Herlev University database which is publicly available. In this contribution a new segmentation technique has also been proposed for extracting shape features. Ripplet Type I transform, Histogram first order statistics and Gray Level Co-occurrence Matrix have been used for color and texture features respectively. To improve classification results, ensemble method is used, which integrates the decision of three classifiers. Assessments are performed using 5 fold cross validation. ResultsExtended experiments reveal that the proposed system can successfully classify Pap smear images performing significantly better when compared with other existing methods. ConclusionThis type of automated cancer classifier will be of particular help in early detection of cancer.

  • automated classification of pap smear images to detect cervical dysplasia
    Computer Methods and Programs in Biomedicine, 2017
    Co-Authors: Kangkana Bora, Manish Chowdhury, Lipi B Mahanta, Malay K Kundu, Anup K Das
    Abstract:

    An automated Pap smear classifier is proposed to detect cervical dysplasia.Study performed on both cell as well as smear level indigenous real images collected from two diagnostic centers.Analysis is being performed on shape, texture and color features which includes 121 total numbers of features.An ensemble classifier is designed using LSSVM, MLP and Random Forest using weighted majority voting.Classification reflects the established Bethesda pathological classification of cervical cancer. Background and objectivesThe present study proposes an intelligent system for Automatic Categorization of Pap smear images to detect cervical dysplasia, which has been an open problem ongoing for last five decades. MethodsThe classification technique is based on shape, texture and color features. It classifies the cervical dysplasia into two-level (normal and abnormal) and three-level (Negative for Intraepithelial Lesion or Malignancy, Low-grade Squamous Intraepithelial Lesion and High-grade Squamous Intraepithelial Lesion) classes reflecting the established Bethesda system of classification used for diagnosis of cancerous or precancerous lesion of cervix. The system is evaluated on two generated databases obtained from two diagnostic centers, one containing 1610 single cervical cells and the other 1320 complete smear level images. The main objective of this database generation is to categorize the images according to the Bethesda system of classification both of which require lots of training and expertise. The system is also trained and tested on the benchmark Herlev University database which is publicly available. In this contribution a new segmentation technique has also been proposed for extracting shape features. Ripplet Type I transform, Histogram first order statistics and Gray Level Co-occurrence Matrix have been used for color and texture features respectively. To improve classification results, ensemble method is used, which integrates the decision of three classifiers. Assessments are performed using 5 fold cross validation. ResultsExtended experiments reveal that the proposed system can successfully classify Pap smear images performing significantly better when compared with other existing methods. ConclusionThis type of automated cancer classifier will be of particular help in early detection of cancer.

Kangkana Bora - One of the best experts on this subject based on the ideXlab platform.

  • automated classification of pap smear images to detect cervical dysplasia
    Computer Methods and Programs in Biomedicine, 2017
    Co-Authors: Kangkana Bora, Manish Chowdhury, Lipi B Mahanta, Malay K Kundu
    Abstract:

    An automated Pap smear classifier is proposed to detect cervical dysplasia.Study performed on both cell as well as smear level indigenous real images collected from two diagnostic centers.Analysis is being performed on shape, texture and color features which includes 121 total numbers of features.An ensemble classifier is designed using LSSVM, MLP and Random Forest using weighted majority voting.Classification reflects the established Bethesda pathological classification of cervical cancer. Background and objectivesThe present study proposes an intelligent system for Automatic Categorization of Pap smear images to detect cervical dysplasia, which has been an open problem ongoing for last five decades. MethodsThe classification technique is based on shape, texture and color features. It classifies the cervical dysplasia into two-level (normal and abnormal) and three-level (Negative for Intraepithelial Lesion or Malignancy, Low-grade Squamous Intraepithelial Lesion and High-grade Squamous Intraepithelial Lesion) classes reflecting the established Bethesda system of classification used for diagnosis of cancerous or precancerous lesion of cervix. The system is evaluated on two generated databases obtained from two diagnostic centers, one containing 1610 single cervical cells and the other 1320 complete smear level images. The main objective of this database generation is to categorize the images according to the Bethesda system of classification both of which require lots of training and expertise. The system is also trained and tested on the benchmark Herlev University database which is publicly available. In this contribution a new segmentation technique has also been proposed for extracting shape features. Ripplet Type I transform, Histogram first order statistics and Gray Level Co-occurrence Matrix have been used for color and texture features respectively. To improve classification results, ensemble method is used, which integrates the decision of three classifiers. Assessments are performed using 5 fold cross validation. ResultsExtended experiments reveal that the proposed system can successfully classify Pap smear images performing significantly better when compared with other existing methods. ConclusionThis type of automated cancer classifier will be of particular help in early detection of cancer.

  • automated classification of pap smear images to detect cervical dysplasia
    Computer Methods and Programs in Biomedicine, 2017
    Co-Authors: Kangkana Bora, Manish Chowdhury, Lipi B Mahanta, Malay K Kundu, Anup K Das
    Abstract:

    An automated Pap smear classifier is proposed to detect cervical dysplasia.Study performed on both cell as well as smear level indigenous real images collected from two diagnostic centers.Analysis is being performed on shape, texture and color features which includes 121 total numbers of features.An ensemble classifier is designed using LSSVM, MLP and Random Forest using weighted majority voting.Classification reflects the established Bethesda pathological classification of cervical cancer. Background and objectivesThe present study proposes an intelligent system for Automatic Categorization of Pap smear images to detect cervical dysplasia, which has been an open problem ongoing for last five decades. MethodsThe classification technique is based on shape, texture and color features. It classifies the cervical dysplasia into two-level (normal and abnormal) and three-level (Negative for Intraepithelial Lesion or Malignancy, Low-grade Squamous Intraepithelial Lesion and High-grade Squamous Intraepithelial Lesion) classes reflecting the established Bethesda system of classification used for diagnosis of cancerous or precancerous lesion of cervix. The system is evaluated on two generated databases obtained from two diagnostic centers, one containing 1610 single cervical cells and the other 1320 complete smear level images. The main objective of this database generation is to categorize the images according to the Bethesda system of classification both of which require lots of training and expertise. The system is also trained and tested on the benchmark Herlev University database which is publicly available. In this contribution a new segmentation technique has also been proposed for extracting shape features. Ripplet Type I transform, Histogram first order statistics and Gray Level Co-occurrence Matrix have been used for color and texture features respectively. To improve classification results, ensemble method is used, which integrates the decision of three classifiers. Assessments are performed using 5 fold cross validation. ResultsExtended experiments reveal that the proposed system can successfully classify Pap smear images performing significantly better when compared with other existing methods. ConclusionThis type of automated cancer classifier will be of particular help in early detection of cancer.

Ruihua Fang - One of the best experts on this subject based on the ideXlab platform.

  • Automatic Categorization of diverse experimental information in the bioscience literature
    BMC bioinformatics, 2012
    Co-Authors: Ruihua Fang, Mary Ann Tuli, Paul Davis, Wen Chen, Kimberly Van Auken, Xiaodong Wang, Gary Schindelman, Jolene S. Fernandes, Steven J. Marygold, Gillian Millburn
    Abstract:

    Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an Automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely Automatic and can be readily applied to diverse experimental data types. It has been in use in production for Automatic Categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for Automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely Automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.

  • Automatic Categorization of diverse experimental information in the bioscience literature
    BMC Bioinformatics, 2012
    Co-Authors: Ruihua Fang, Mary Ann Tuli, Paul Davis, Wen Chen, Kimberly Van Auken, Xiaodong Wang, Gary Schindelman, Jolene S. Fernandes, Steven J. Marygold, Gillian Millburn
    Abstract:

    Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an Automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely Automatic and can be readily applied to diverse experimental data types. It has been in use in production for Automatic Categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for Automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely Automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.