Database Selection

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 82275 Experts worldwide ranked by ideXlab platform

Minjie Zhang - One of the best experts on this subject based on the ideXlab platform.

  • Two-stage statistical language models for text Database Selection
    Information Retrieval, 2006
    Co-Authors: Hui Yang, Minjie Zhang
    Abstract:

    As the number and diversity of distributed Web Databases on the Internet exponentially increase, it is difficult for user to know which Databases are appropriate to search. Given Database language models that describe the content of each Database, Database Selection services can provide assistance in locating Databases relevant to the information needs of users. In this paper, we propose a Database Selection approach based on statistical language modeling. The basic idea behind the approach is that, for Databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the Databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art Database Selection approaches.

  • PRICAI - Association-rule based information source Selection
    PRICAI 2004: Trends in Artificial Intelligence, 2004
    Co-Authors: Hui Yang, Minjie Zhang, Zhongzhi Shi
    Abstract:

    The proliferation of information sources available on the Wide World Web has resulted in a need for Database Selection tools to locate the potential useful information sources with respect to the user's information need. Current Database Selection tools always treat each Database independently, ignoring the implicit, useful associations between distributed Databases. To overcome this shortcoming, in this paper, we introduce a data-mining approach to assist the process of Database Selection by extracting potential interesting association rules between web Databases from a collection of previous Selection results. With a topic hierarchy, we exploit intraclass and interclass associations between distributed Databases, and use the discovered knowledge on distributed Databases to refine the original Selection results. We present experimental results to demonstrate that this technique is useful in improving the effectiveness of Database Selection.

  • Australian Conference on Artificial Intelligence - A Language Modeling Approach to Search Distributed Text Databases
    Lecture Notes in Computer Science, 2003
    Co-Authors: Hui Yang, Minjie Zhang
    Abstract:

    As the number and diversity of distributed information sources on the Internet exponentially increase, it is difficult for the user to know which Databases are appropriate to search. Given Database language models that describe the content of each Database, Database Selection services can provide assistance in locate relevant Databases of the user’s information need. In this paper, we propose a Database Selection approach based on statistical language modeling. The basic idea behind the approach is that, for the Databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the Databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent the inaccuracy due to word sparseness. Experimental results demonstrate such a language modeling approach is competitive with current state-of-the-art Database Selection approaches.

Hui Yang - One of the best experts on this subject based on the ideXlab platform.

  • Two-stage statistical language models for text Database Selection
    Information Retrieval, 2006
    Co-Authors: Hui Yang, Minjie Zhang
    Abstract:

    As the number and diversity of distributed Web Databases on the Internet exponentially increase, it is difficult for user to know which Databases are appropriate to search. Given Database language models that describe the content of each Database, Database Selection services can provide assistance in locating Databases relevant to the information needs of users. In this paper, we propose a Database Selection approach based on statistical language modeling. The basic idea behind the approach is that, for Databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the Databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art Database Selection approaches.

  • Methodologies for information source Selection under distributed information environments
    2005
    Co-Authors: Hui Yang
    Abstract:

    The information revolution is upon us. In fact, we are increasingly overwhelmed by the exponential growth of information on the Web. The profusion of resources on the Web has given rise to considerable interest in the research of information retrieval. Traditional information retrieval techniques are facing new challenges in distributed information environments such as the Internet. One of the more important research issues is information source Selection, which is to select a small number of information sources that may contain most of the potentially useful documents when a user information need is presented. This thesis investigates new methodologies for information source Selection in distributed information environments. We have identified potential Selection cases within the context of distributed textual Databases, and have classified the types of textual Databases. The connection between Selection cases and Database types is analysed, and necessary constraints are given for each Selection case. The above research results could be used as the guidance for developing effective Database Selection algorithms. A framework for a topic-based Database Selection system is proposed by the use of a topic hierarchy. In this framework, firstly, distributed textual Databases are hierarchically categorised into a topic hierarchy for convenience of access and management. Secondly, two-stage Database language models are presented to employ topic-based Database Selection within the context of the hierarchy of topics. At the category-specific search stage, a smoothed class-based language model is developed to determine the appropriate topic categories with respect to the user query. A number of Databases associated with the chosen topics are selected as candidate Databases for the next search stage. At the term-specific search stage, a smooth term-based language model is used to find the Databases that are likely to contain the specified query terms. Finally, the original Selection result is further refined by a set of topic-based association rules. These topic-based association rules contain useful information about the relationships between Databases, which are extracted from a collection of previous Selection results. To overcome the drawback of the keyword-based search, which treats words as independent of each other, ignoring potential semantic relationships between words, in this thesis, we propose a concept-based search mechanism to search distributed web Databases using domain-specific ontologies. A domainspecific ontology provides rich information about the semantic relationships between concepts in a specific topic domain. This information is used for the generation of concept-related resource descriptions of web Databases, query disambiguation and concept-based query matching in Database Selection.

  • PRICAI - Association-rule based information source Selection
    PRICAI 2004: Trends in Artificial Intelligence, 2004
    Co-Authors: Hui Yang, Minjie Zhang, Zhongzhi Shi
    Abstract:

    The proliferation of information sources available on the Wide World Web has resulted in a need for Database Selection tools to locate the potential useful information sources with respect to the user's information need. Current Database Selection tools always treat each Database independently, ignoring the implicit, useful associations between distributed Databases. To overcome this shortcoming, in this paper, we introduce a data-mining approach to assist the process of Database Selection by extracting potential interesting association rules between web Databases from a collection of previous Selection results. With a topic hierarchy, we exploit intraclass and interclass associations between distributed Databases, and use the discovered knowledge on distributed Databases to refine the original Selection results. We present experimental results to demonstrate that this technique is useful in improving the effectiveness of Database Selection.

  • Australian Conference on Artificial Intelligence - A Language Modeling Approach to Search Distributed Text Databases
    Lecture Notes in Computer Science, 2003
    Co-Authors: Hui Yang, Minjie Zhang
    Abstract:

    As the number and diversity of distributed information sources on the Internet exponentially increase, it is difficult for the user to know which Databases are appropriate to search. Given Database language models that describe the content of each Database, Database Selection services can provide assistance in locate relevant Databases of the user’s information need. In this paper, we propose a Database Selection approach based on statistical language modeling. The basic idea behind the approach is that, for the Databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the Databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent the inaccuracy due to word sparseness. Experimental results demonstrate such a language modeling approach is competitive with current state-of-the-art Database Selection approaches.

Jamie Callan - One of the best experts on this subject based on the ideXlab platform.

  • the impact of Database Selection on distributed searching
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • SIGIR - The impact of Database Selection on distributed searching
    Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • comparing the performance of Database Selection algorithms
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999
    Co-Authors: James C. French, Allison L. Powell, Jamie Callan, Charles L Viles, Travis Emmitt, Kevin J Prey, Yun Mou
    Abstract:

    We compare the performance of two Database Selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for Database Selection techniques. The testbed is a decomposition of the TREC/TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The Databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance-based ranks well. We also find that CORI is a uniformly better estimator of relevance-based ranks than gGlOSS for the test environment used in this study. Part of the advantage of the CORI algorithm can be explained by a strong correlation between gGlOSS and a size-based baseline (SBR). We also find that CORI produces consistently accurate rankings on testbeds ranging from 100--921 sites. However for a given level of recall, search effort appears to scale linearly with the number of Databases.

  • Effective and Efficient Automatic Database Selection
    1999
    Co-Authors: James C. French, Allison L. Powell, Jamie Callan
    Abstract:

    We examine a class of Database Selection algorithms that require only document frequency information. The CORI algorithm is an instance of this class of algorithms. In previous work, we showed that CORI is more effective than gGlOSS when evaluated against a relevance-based standard. In this paper, we introduce a family of other algorithms in this class and examine components of these algorithms and of the CORI algorithm to begin identifying the factors responsible for their performance. We establish that the class of algorithms studied here is more effective and efficient than gGlOSS and is applicable to a wider variety of operational environments. In particular, this methodology is completely decoupled from the Database indexing technology so is as useful in heterogeneous environments as in homogeneous environments.

  • SIGIR - Comparing the performance of Database Selection algorithms
    Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '99, 1999
    Co-Authors: James C. French, Allison L. Powell, Jamie Callan, Charles L Viles, Travis Emmitt, Kevin J Prey, Yun Mou
    Abstract:

    We compare the performance of two Database Selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for Database Selection techniques. The testbed is a decomposition of the TREC/TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The Databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance-based ranks well. We also find that CORI is a uniformly better estimator of relevance-based ranks than gGlOSS for the test environment used in this study. Part of the advantage of the CORI algorithm can be explained by a strong correlation between gGlOSS and a size-based baseline (SBR). We also find that CORI produces consistently accurate rankings on testbeds ranging from 100--921 sites. However for a given level of recall, search effort appears to scale linearly with the number of Databases.

Margaret E Connell - One of the best experts on this subject based on the ideXlab platform.

  • the impact of Database Selection on distributed searching
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • SIGIR - The impact of Database Selection on distributed searching
    Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • SIGMOD Conference - Automatic discovery of language models for text Databases
    Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD '99, 1999
    Co-Authors: Jamie Callan, Margaret E Connell
    Abstract:

    The proliferation of text Databases within large organizations and on the Internet makes it difficult for a person to know which Databases to search. Given language models that describe the contents of each Database, a Database Selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate Databases for an information need. Current practice is that each Database provides its language model upon request, but this cooperative approach has important limitations. This paper demonstrates that cooperation is not required. Instead, the Database Selection service can construct its own language models by sampling Database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents.

Allison L. Powell - One of the best experts on this subject based on the ideXlab platform.

  • Metrics for evaluating Database Selection techniques
    World Wide Web, 2000
    Co-Authors: James C. French, Allison L. Powell
    Abstract:

    The increasing availability of online Databases and other information resources in digital libraries and on the World Wide Web has created the need for efficient and effective algorithms for selecting Databases to search. A number of techniques have been proposed for query routing or Database Selection. We have developed a methodology and metrics that can be used to directly compare competing techniques. They can also be used to isolate factors that influence the performance of these techniques so that we can better understand performance issues. In this paper we describe the methodology we have used to examine the performance of Database Selection algorithms such as gGlOSS and CORI. In addition we develop the theory behind a “random” Database Selection algorithm and show how it can be used to help analyze the behavior of realistic Database Selection algorithms.

  • the impact of Database Selection on distributed searching
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • Database Selection Using Document and Collection Term Frequencies
    2000
    Co-Authors: Tram Phan, Allison L. Powell, Nisanti Mohanraj, James C. French
    Abstract:

    We examine the impact of two types of information - document frequency (df) and collection term frequency (ctf) - on the effectiveness of Database Selection. We introduce a family of Database Selection algorithms based on this information, and compare their effectiveness to two existing Database Selection approaches, CORI and gGlOSS. We demonstrate that a simple Selection algorithm that uses only document frequency information is more effective than gGlOSS, and achieves effectiveness that is very close to that of CORI.

  • SIGIR - The impact of Database Selection on distributed searching
    Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00, 2000
    Co-Authors: Allison L. Powell, James C. French, Jamie Callan, Margaret E Connell, Charles L Viles
    Abstract:

    The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — Database Selection, query processing, and results merging. In this paper we examine the effect of Database Selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good Database Selection can result in better retrieval effectiveness than can be achieved in a centralized Database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when Database Selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in Database Selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single Database) systems. Given a centralized Database and a good Selection mechanism, retrieval performance can be improved by decomposing that Database conceptually and employing a Selection step.

  • comparing the performance of Database Selection algorithms
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999
    Co-Authors: James C. French, Allison L. Powell, Jamie Callan, Charles L Viles, Travis Emmitt, Kevin J Prey, Yun Mou
    Abstract:

    We compare the performance of two Database Selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for Database Selection techniques. The testbed is a decomposition of the TREC/TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The Databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance-based ranks well. We also find that CORI is a uniformly better estimator of relevance-based ranks than gGlOSS for the test environment used in this study. Part of the advantage of the CORI algorithm can be explained by a strong correlation between gGlOSS and a size-based baseline (SBR). We also find that CORI produces consistently accurate rankings on testbeds ranging from 100--921 sites. However for a given level of recall, search effort appears to scale linearly with the number of Databases.