Relevance Judgment

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2907 Experts worldwide ranked by ideXlab platform

Matthew Lease - One of the best experts on this subject based on the ideXlab platform.

  • crowd vs expert what can Relevance Judgment rationales teach us about assessor disagreement
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018
    Co-Authors: Mucahid Kutlu, Tyler Mcdonnell, Yassmine Barkallah, Tamer Elsayed, Matthew Lease
    Abstract:

    While crowdsourcing offers a low-cost, scalable way to collect Relevance Judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected Judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each Relevance Judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd Judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd Judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd Judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

  • SIGIR - Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?
    The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018
    Co-Authors: Mucahid Kutlu, Tyler Mcdonnell, Yassmine Barkallah, Tamer Elsayed, Matthew Lease
    Abstract:

    While crowdsourcing offers a low-cost, scalable way to collect Relevance Judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected Judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each Relevance Judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd Judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd Judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd Judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

Zhiwei Chen - One of the best experts on this subject based on the ideXlab platform.

  • Relevance Judgment what do information users consider beyond topicality
    Journal of the Association for Information Science and Technology, 2006
    Co-Authors: Zhiwei Chen
    Abstract:

    How does an information user perceive a document as relevant? The literature on Relevance has identified numerous factors affecting such a Judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making Relevance Judgment beyond topicality. On the basis of Grice's theory of communication, we propose a five-factor model of Relevance: topicality, novelty, reliability, understandability, and scope. Data are collected from a semicontrolled survey and analyzed by following a psychometric procedure. Topicality and novelty are found to be the two essential Relevance criteria. Understandability and reliability are also found to be significant, but scope is not. The theoretical and practical implications of this study are discussed. © 2006 Wiley Periodicals, Inc.

  • user oriented Relevance Judgment a conceptual model
    Hawaii International Conference on System Sciences, 2005
    Co-Authors: Zhiwei Chen
    Abstract:

    The concept of Relevance has been heatedly debated in last decade. Not satisfied with the narrow and technical definition of system Relevance, researchers turn to the subjective and situational aspect of this concept. How does a user perceive a document as relevant? The literature on Relevance has identified numerous factors affecting such Judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making Relevance Judgment. Based on Grice's theory of communication, this paper proposes a five-factor model of Relevance: topicality, novelty, reliability, understandability, and scope. Data are collected from a semi-controlled survey study and analyzed following a psychometric procedure. The result supports topicality and novelty as the key Relevance criteria. Theoretical and practical implications of this study are discussed.

  • HICSS - User-Oriented Relevance Judgment: A Conceptual Model
    Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 1
    Co-Authors: Zhiwei Chen
    Abstract:

    The concept of Relevance has been heatedly debated in last decade. Not satisfied with the narrow and technical definition of system Relevance, researchers turn to the subjective and situational aspect of this concept. How does a user perceive a document as relevant? The literature on Relevance has identified numerous factors affecting such Judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making Relevance Judgment. Based on Grice's theory of communication, this paper proposes a five-factor model of Relevance: topicality, novelty, reliability, understandability, and scope. Data are collected from a semi-controlled survey study and analyzed following a psychometric procedure. The result supports topicality and novelty as the key Relevance criteria. Theoretical and practical implications of this study are discussed.

Sally Mcclean - One of the best experts on this subject based on the ideXlab platform.

  • several methods of ranking retrieval systems with partial Relevance Judgment
    International Conference on Digital Information Management, 2007
    Co-Authors: Sally Mcclean
    Abstract:

    Some measures such as mean average precision and recall level precision are considered as good system-oriented measures, because they concern both precision and recall that are two important aspects for effectiveness evaluation of information retrieval systems. However, such good system-oriented measures suffer from some shortcomings when partial Relevance Judgment is used. In this paper, we discuss how to rank retrieval systems in the condition of partial Relevance Judgment, which is common in major retrieval evaluation events such as TREC conferences and NTCIR workshops. Four system-oriented measures, which are mean average precision, recall level precision, normalized discount cumulative gain, and normalized average precision over all documents, are discussed. Our investigation shows that averaging values over a set of queries may not be the most reliable approach to rank a group of retrieval systems. Some alternatives such as Bar da count. Condorcet voting, and the zero-one normalization method, are investigated. Experimental results are also presented for the evaluation of these methods.

  • ICDIM - Several methods of ranking retrieval systems with partial Relevance Judgment
    2007 2nd International Conference on Digital Information Management, 2007
    Co-Authors: Sally Mcclean
    Abstract:

    Some measures such as mean average precision and recall level precision are considered as good system-oriented measures, because they concern both precision and recall that are two important aspects for effectiveness evaluation of information retrieval systems. However, such good system-oriented measures suffer from some shortcomings when partial Relevance Judgment is used. In this paper, we discuss how to rank retrieval systems in the condition of partial Relevance Judgment, which is common in major retrieval evaluation events such as TREC conferences and NTCIR workshops. Four system-oriented measures, which are mean average precision, recall level precision, normalized discount cumulative gain, and normalized average precision over all documents, are discussed. Our investigation shows that averaging values over a set of queries may not be the most reliable approach to rank a group of retrieval systems. Some alternatives such as Bar da count. Condorcet voting, and the zero-one normalization method, are investigated. Experimental results are also presented for the evaluation of these methods.

  • information retrieval evaluation with partial Relevance Judgment
    British National Conference on Databases, 2006
    Co-Authors: Sally Mcclean
    Abstract:

    Mean Average Precision has been widely used by researchers in information retrieval evaluation events such as TREC, and it is believed to be a good system measure because of its sensitivity and reliability. However, its drawbacks as regards partial Relevance Judgment has been largely ignored. In many cases, partial Relevance Judgment is probably the only reasonable solution due to the large document collections involved. In this paper, we will address this issue through analysis and experiment. Our investigation shows that when only partial Relevance Judgment is available, mean average precision suffers from several drawbacks: inaccurate values, no explicit explanation, and being subject to the evaluation environment. Further, mean average precision is not superior to some other measures such as precision at a given document level for sensitivity and reliability, both of which are believed to be the major advantages of mean average precision. Our experiments also suggest that average precision over all documents would be a good measure for such a situation.

  • Evaluation of system measures for incomplete Relevance Judgment in IR
    Lecture Notes in Computer Science, 2006
    Co-Authors: Sally Mcclean
    Abstract:

    Incomplete Relevance Judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete Relevance Judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded Relevance Judgment. These four measures have a common characteristic: complete Relevance Judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete Relevance Judgment on them. From these experiments, we conclude that incomplete Relevance Judgment affects all these four measures' values significantly. When using the pooling method in TREC, the more incomplete the Relevance Judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.

  • FQAS - Evaluation of system measures for incomplete Relevance Judgment in IR
    Flexible Query Answering Systems, 2006
    Co-Authors: Sally Mcclean
    Abstract:

    Incomplete Relevance Judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete Relevance Judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded Relevance Judgment. These four measures have a common characteristic: complete Relevance Judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete Relevance Judgment on them. From these experiments, we conclude that incomplete Relevance Judgment affects all these four measures' values significantly. When using the pooling method in TREC, the more incomplete the Relevance Judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.

Mucahid Kutlu - One of the best experts on this subject based on the ideXlab platform.

  • crowd vs expert what can Relevance Judgment rationales teach us about assessor disagreement
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018
    Co-Authors: Mucahid Kutlu, Tyler Mcdonnell, Yassmine Barkallah, Tamer Elsayed, Matthew Lease
    Abstract:

    While crowdsourcing offers a low-cost, scalable way to collect Relevance Judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected Judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each Relevance Judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd Judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd Judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd Judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

  • SIGIR - Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?
    The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018
    Co-Authors: Mucahid Kutlu, Tyler Mcdonnell, Yassmine Barkallah, Tamer Elsayed, Matthew Lease
    Abstract:

    While crowdsourcing offers a low-cost, scalable way to collect Relevance Judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected Judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each Relevance Judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd Judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd Judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd Judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

Dawei Song - One of the best experts on this subject based on the ideXlab platform.

  • QI - Investigating Bell Inequalities for Multidimensional Relevance Judgments in Information Retrieval
    Quantum Interaction, 2019
    Co-Authors: Sagar Uprety, Dimitrios Gkoumas, Dawei Song
    Abstract:

    Relevance Judgment in Information Retrieval is influenced by multiple factors. These include not only the topicality of the documents but also other user oriented factors like trust, user interest, etc. Recent works have identified and classified these various factors into seven dimensions of Relevance. In a previous work, these Relevance dimensions were quantified and user’s cognitive state with respect to a document was represented as a state vector in a Hilbert Space, with each Relevance dimension representing a basis. It was observed that Relevance dimensions are incompatible in some documents, when making a Judgment. Incompatibility being a fundamental feature of Quantum Theory, this motivated us to test the Quantum nature of Relevance Judgments using Bell type inequalities. However, none of the Bell-type inequalities tested have shown any violation. We discuss our methodology to construct incompatible basis for documents from real world query log data, the experiments to test Bell inequalities on this dataset and possible reasons for the lack of violation.

  • Investigating Bell Inequalities for Multidimensional Relevance Judgments in Information Retrieval.
    arXiv: Information Retrieval, 2018
    Co-Authors: Sagar Uprety, Dimitrios Gkoumas, Dawei Song
    Abstract:

    Relevance Judgment in Information Retrieval is influenced by multiple factors. These include not only the topicality of the documents but also other user oriented factors like trust, user interest, etc. Recent works have identified these various factors into seven dimensions of Relevance. In a previous work, these Relevance dimensions were quantified and user's cognitive state with respect to a document was represented as a state vector in a Hilbert Space, with each Relevance dimension representing a basis. It was observed that Relevance dimensions are incompatible in some documents, when making a Judgment. Incompatibility being a fundamental feature of Quantum Theory, this motivated us to test the Quantum nature of Relevance Judgments using Bell type inequalities. However, none of the Bell-type inequalities tested have shown any violation. We discuss our methodology to construct incompatible basis for documents from real world query log data, the experiments to test Bell inequalities on this dataset and possible reasons for the lack of violation.

  • investigating order effects in multidimensional Relevance Judgment using query logs
    International Conference on the Theory of Information Retrieval, 2018
    Co-Authors: Sagar Uprety, Dawei Song
    Abstract:

    There is a growing body of research which has investigated Relevance Judgment in IR being influenced by multiple factors or dimensions. At the same time, the Order Effects in sequential decision making have been quantitatively detected and studied in Mathematical Psychology. Combining the two phenomena, there have been some user studies carried out which investigate the Order Effects and thus incompatibility in different dimensions of Relevance. In this work, we propose a methodology for carrying out such an investigation in large scale and real world data using query logs of a web search engine, and device a test to detect the presence of an irrational user behavior in Relevance Judgment of documents. We further validate this behavior through a Quantum Cognitive explanation of the Order and Context effects.

  • ICTIR - Investigating Order Effects in Multidimensional Relevance Judgment using Query Logs
    Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval, 2018
    Co-Authors: Sagar Uprety, Dawei Song
    Abstract:

    There is a growing body of research which has investigated Relevance Judgment in IR being influenced by multiple factors or dimensions. At the same time, the Order Effects in sequential decision making have been quantitatively detected and studied in Mathematical Psychology. Combining the two phenomena, there have been some user studies carried out which investigate the Order Effects and thus incompatibility in different dimensions of Relevance. In this work, we propose a methodology for carrying out such an investigation in large scale and real world data using query logs of a web search engine, and device a test to detect the presence of an irrational user behavior in Relevance Judgment of documents. We further validate this behavior through a Quantum Cognitive explanation of the Order and Context effects.

  • investigating the dynamic decision mechanisms of users Relevance Judgment for information retrieval via log analysis
    Pacific Rim International Conference on Artificial Intelligence, 2018
    Co-Authors: Dawei Song, Pengqing Zhang, Yazhou Zhang
    Abstract:

    Measuring Relevance of documents with respect to a user’s query is at the heart of information retrieval (IR), where the user’s Relevance Judgment criteria have been recognized as multi-dimensional. A set of Relevance dimensions that are considered as critical factors in document Relevance Judgment have been investigated, such as topicality, novelty, and reliability. However, most existing work focuses on individual Relevance dimensions, yet neglecting how different dimensions would interact with each other to influence the overall Relevance Judgment in real-world search scenarios. This paper aims at an initial step to fill the gap. Specifically, we divide 7 Relevance dimensions in an enriched Multidimensional User Relevance Model (MURM) into three categories according to three main requirements for document Relevance, i.e., document content requirement, document quality requirement and personalization requirement. We then exploit the Learning to Rank framework to conduct document ranking experiments on a query log dataset from a prominent search engine. The experimental results indicate the existence of an order effect between different dimensions, and suggest that considering different dimensions across categories in different orders for document Relevance Judgment could lead to distinct search results. Our findings provide valuable insights to build more intelligent and user-centric information retrieval systems, and potentially benefit other natural language processing tasks that involve decision making from multiple perspectives.