Video Retrieval

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 24276 Experts worldwide ranked by ideXlab platform

Alexander G Hauptmann - One of the best experts on this subject based on the ideXlab platform.

  • beyond audio and Video Retrieval towards multimedia summarization
    International Conference on Multimedia Retrieval, 2012
    Co-Authors: Duo Ding, Michael G Christel, Florian Metze, Shourabh Rawat, Peter Schulam, Susanne Burger, Ehsan Younessian, Lei Bao, Alexander G Hauptmann
    Abstract:

    Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and Video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a Video (such as visual concepts and ASR transcripts) a TOMS system will automatically generate a paragraph of natural language ("a recounting"), which summarizes the important information in a Video belonging to a certain topic area, and provides explanations for why a Video was matched and retrieved. We see this as a first step towards systems that will be able to discriminate visually similar, but semantically different Videos, compare two Videos and provide textual output or summarize a large number of Videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract visual concept features and ASR transcription features from a given Video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS systems, and present results of a pilot evaluation of our initial system.

  • Video Retrieval Based on Semantic Concepts
    Proceedings of the IEEE, 2008
    Co-Authors: Alexander G Hauptmann, Michael G Christel
    Abstract:

    An approach using many intermediate semantic concepts is proposed with the potential to bridge the semantic gap between what a color, shape, and texture-based ldquolow-levelrdquo image analysis can extract from Video and what users really want to find, most likely using text descriptions of their information needs. Semantic concepts such as cars, planes, roads, people, animals, and different types of scenes (outdoor, night time, etc.) can be automatically detected in the Video with reasonable accuracy. This leads us to ask how can they be used automatically and how does a user (or a Retrieval system) translate the user's information need into a selection of related concepts that would help find the relevant Video clips, from the large list of available concepts. We illustrate how semantic concept Retrieval can be automatically exploited by mapping queries into query classes and through pseudo-relevance feedback. We also provide evidence how a semantic concept can be utilized by users in interactive Retrieval, through interfaces that provide affordances of explicit concept selection and search, concept filtering, and relevance feedback. How many concepts we actually need and how accurately they need to be detected and linked through various relationships is specified in the ontology structure.

  • can high level concepts fill the semantic gap in Video Retrieval a case study with broadcast news
    IEEE Transactions on Multimedia, 2007
    Co-Authors: Alexander G Hauptmann, Michael G Christel, Rong Yan, Weihao Lin, Howard D Wactlar
    Abstract:

    A number of researchers have been building high-level semantic concept detectors such as outdoors, face, building, to help with semantic Video Retrieval. Our goal is to examine how many concepts would be needed, and how they should be selected and used. Simulating performance of Video Retrieval under different assumptions of concept detection accuracy, we find that good Retrieval can be achieved even when detection accuracy is low, if sufficiently many concepts are combined. We also derive suggestions regarding the types of concepts that would be most helpful for a large concept lexicon. Since our user study finds that people cannot predict which concepts will help their query, we also suggest ways to find the best concepts to use. Ultimately, this paper concludes that "concept-based" Video Retrieval with fewer than 5000 concepts, detected with a minimal accuracy of 10% mean average precision is likely to provide high accuracy results in broadcast news Retrieval.

  • how many high level concepts will fill the semantic gap in news Video Retrieval
    Conference on Image and Video Retrieval, 2007
    Co-Authors: Alexander G Hauptmann, Rong Yan, Weihao Lin
    Abstract:

    A number of researchers have been building high-level semantic concept detectors such as outdoors, face, building, etc., to help with semantic Video Retrieval. Using the TRECVID Video collection and LSCOM truth annotations from 300 concepts, we simulate performance of Video Retrieval under different assumptions of concept detection accuracy. Even low detection accuracy provides good Retrieval results, when sufficiently many concepts are used. Considering this extrapolation under reasonable assumptions, this paper arrives at the conclusion that "concept-based" Video Retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text Retrieval on the web, in a typical broadcast news collection. We also derive evidence that it should be feasible to find sufficiently many new, useful concepts that would be helpful for Retrieval.

  • the use and utility of high level semantic features in Video Retrieval
    Conference on Image and Video Retrieval, 2005
    Co-Authors: Michael G Christel, Alexander G Hauptmann
    Abstract:

    This paper investigates the applicability of high-level semantic features for Video Retrieval using the benchmarked data from TRECVID 2003 and 2004, addressing the contributions of features like outdoor, face, and animal in Retrieval, and if users can correctly decide on which features to apply for a given need. Pooled truth data gives evidence that some topics would benefit from features. A study with 12 subjects found that people often disagree on the relevance of a feature to a particular topic, including disagreement within the 8% of positive feature-topic associations strongly supported by truth data. When subjects concur, their judgments are correct, and for those 51 topic-feature pairings identified as significant we conduct an investigation into the best interactive search submissions showing that for 29 pairs, topic performance would have improved had users had access to ideal classifiers for those features. The benefits derive from generic features applied to generic topics (27 pairs), and in one case a specific feature applied to a specific topic. Re-ranking submitted shots based on features shows promise for automatic search runs, but not for interactive runs where a person already took care to rank shots well.

Michael G Christel - One of the best experts on this subject based on the ideXlab platform.

  • beyond audio and Video Retrieval towards multimedia summarization
    International Conference on Multimedia Retrieval, 2012
    Co-Authors: Duo Ding, Michael G Christel, Florian Metze, Shourabh Rawat, Peter Schulam, Susanne Burger, Ehsan Younessian, Lei Bao, Alexander G Hauptmann
    Abstract:

    Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and Video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a Video (such as visual concepts and ASR transcripts) a TOMS system will automatically generate a paragraph of natural language ("a recounting"), which summarizes the important information in a Video belonging to a certain topic area, and provides explanations for why a Video was matched and retrieved. We see this as a first step towards systems that will be able to discriminate visually similar, but semantically different Videos, compare two Videos and provide textual output or summarize a large number of Videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract visual concept features and ASR transcription features from a given Video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS systems, and present results of a pilot evaluation of our initial system.

  • Video Retrieval Based on Semantic Concepts
    Proceedings of the IEEE, 2008
    Co-Authors: Alexander G Hauptmann, Michael G Christel
    Abstract:

    An approach using many intermediate semantic concepts is proposed with the potential to bridge the semantic gap between what a color, shape, and texture-based ldquolow-levelrdquo image analysis can extract from Video and what users really want to find, most likely using text descriptions of their information needs. Semantic concepts such as cars, planes, roads, people, animals, and different types of scenes (outdoor, night time, etc.) can be automatically detected in the Video with reasonable accuracy. This leads us to ask how can they be used automatically and how does a user (or a Retrieval system) translate the user's information need into a selection of related concepts that would help find the relevant Video clips, from the large list of available concepts. We illustrate how semantic concept Retrieval can be automatically exploited by mapping queries into query classes and through pseudo-relevance feedback. We also provide evidence how a semantic concept can be utilized by users in interactive Retrieval, through interfaces that provide affordances of explicit concept selection and search, concept filtering, and relevance feedback. How many concepts we actually need and how accurately they need to be detected and linked through various relationships is specified in the ontology structure.

  • can high level concepts fill the semantic gap in Video Retrieval a case study with broadcast news
    IEEE Transactions on Multimedia, 2007
    Co-Authors: Alexander G Hauptmann, Michael G Christel, Rong Yan, Weihao Lin, Howard D Wactlar
    Abstract:

    A number of researchers have been building high-level semantic concept detectors such as outdoors, face, building, to help with semantic Video Retrieval. Our goal is to examine how many concepts would be needed, and how they should be selected and used. Simulating performance of Video Retrieval under different assumptions of concept detection accuracy, we find that good Retrieval can be achieved even when detection accuracy is low, if sufficiently many concepts are combined. We also derive suggestions regarding the types of concepts that would be most helpful for a large concept lexicon. Since our user study finds that people cannot predict which concepts will help their query, we also suggest ways to find the best concepts to use. Ultimately, this paper concludes that "concept-based" Video Retrieval with fewer than 5000 concepts, detected with a minimal accuracy of 10% mean average precision is likely to provide high accuracy results in broadcast news Retrieval.

  • the use and utility of high level semantic features in Video Retrieval
    Conference on Image and Video Retrieval, 2005
    Co-Authors: Michael G Christel, Alexander G Hauptmann
    Abstract:

    This paper investigates the applicability of high-level semantic features for Video Retrieval using the benchmarked data from TRECVID 2003 and 2004, addressing the contributions of features like outdoor, face, and animal in Retrieval, and if users can correctly decide on which features to apply for a given need. Pooled truth data gives evidence that some topics would benefit from features. A study with 12 subjects found that people often disagree on the relevance of a feature to a particular topic, including disagreement within the 8% of positive feature-topic associations strongly supported by truth data. When subjects concur, their judgments are correct, and for those 51 topic-feature pairings identified as significant we conduct an investigation into the best interactive search submissions showing that for 29 pairs, topic performance would have improved had users had access to ideal classifiers for those features. The benefits derive from generic features applied to generic topics (27 pairs), and in one case a specific feature applied to a specific topic. Re-ranking submitted shots based on features shows promise for automatic search runs, but not for interactive runs where a person already took care to rank shots well.

  • successful approaches in the trec Video Retrieval evaluations
    ACM Multimedia, 2004
    Co-Authors: Alexander G Hauptmann, Michael G Christel
    Abstract:

    This paper reviews successful approaches in evaluations of Video Retrieval over the last three years. The task involves the search and Retrieval of shots from MPEG digitized Video recordings using a combination of automatic speech, image and Video analysis and information Retrieval technologies. The search evaluations are grouped into interactive (with a human in the loop) and non-interactive (where the human merely enters the query into the system) submissions. Most non-interactive search approaches have relied extensively on text Retrieval, and only recently have image-based features contributed reliably to improved search performance. Interactive approaches have substantially outperformed all non-interactive approaches, with most systems relying heavily on the user's ability to refine queries and reject spurious answers. We will examine both the successful automatic search approaches and the user interface techniques that have enabled high performance Video Retrieval.

Rong Yan - One of the best experts on this subject based on the ideXlab platform.

  • can high level concepts fill the semantic gap in Video Retrieval a case study with broadcast news
    IEEE Transactions on Multimedia, 2007
    Co-Authors: Alexander G Hauptmann, Michael G Christel, Rong Yan, Weihao Lin, Howard D Wactlar
    Abstract:

    A number of researchers have been building high-level semantic concept detectors such as outdoors, face, building, to help with semantic Video Retrieval. Our goal is to examine how many concepts would be needed, and how they should be selected and used. Simulating performance of Video Retrieval under different assumptions of concept detection accuracy, we find that good Retrieval can be achieved even when detection accuracy is low, if sufficiently many concepts are combined. We also derive suggestions regarding the types of concepts that would be most helpful for a large concept lexicon. Since our user study finds that people cannot predict which concepts will help their query, we also suggest ways to find the best concepts to use. Ultimately, this paper concludes that "concept-based" Video Retrieval with fewer than 5000 concepts, detected with a minimal accuracy of 10% mean average precision is likely to provide high accuracy results in broadcast news Retrieval.

  • how many high level concepts will fill the semantic gap in news Video Retrieval
    Conference on Image and Video Retrieval, 2007
    Co-Authors: Alexander G Hauptmann, Rong Yan, Weihao Lin
    Abstract:

    A number of researchers have been building high-level semantic concept detectors such as outdoors, face, building, etc., to help with semantic Video Retrieval. Using the TRECVID Video collection and LSCOM truth annotations from 300 concepts, we simulate performance of Video Retrieval under different assumptions of concept detection accuracy. Even low detection accuracy provides good Retrieval results, when sufficiently many concepts are used. Considering this extrapolation under reasonable assumptions, this paper arrives at the conclusion that "concept-based" Video Retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text Retrieval on the web, in a typical broadcast news collection. We also derive evidence that it should be feasible to find sufficiently many new, useful concepts that would be helpful for Retrieval.

  • learning query class dependent weights in automatic Video Retrieval
    ACM Multimedia, 2004
    Co-Authors: Rong Yan, Jun Yang, Alexander G Hauptmann
    Abstract:

    Combining Retrieval results from multiple modalities plays a crucial role for Video Retrieval systems, especially for automatic Video Retrieval systems without any user feedback and query expansion. However, most of current systems only utilize query independent combination or rely on explicit user weighting. In this work, we propose using query-class dependent weights within a hierarchial mixture-of-expert framework to combine multiple Retrieval results. We first classify each user query into one of the four predefined categories and then aggregate the Retrieval results with query-class associated weights, which can be learned from the development data efficiently and generalized to the unseen queries easily. Our experimental results demonstrate that the performance with query-class dependent weights can considerably surpass that with the query independent weights.

  • negative pseudo relevance feedback in content based Video Retrieval
    ACM Multimedia, 2003
    Co-Authors: Rong Yan, Alexander G Hauptmann, Rong Jin
    Abstract:

    Video information Retrieval requires a system to find information relevant to a query which may be represented simultaneously in different ways through a text description, audio, still images and/or Video sequences. We present a novel approach that uses pseudo-relevance feedback from retrieved items that are NOT similar to the query items without further inquiring user feedback. We provide insight into this approach using a statistical model and suggest a score combination scheme via posterior probability estimation. An evaluation on the 2002 TREC Video Track queries shows that this technique can improve Video Retrieval performance on a real collection. We believe that negative pseudo-relevance feedback shows great promise for very difficult multimedia Retrieval tasks, especially when combined with other different Retrieval algorithms.

Yi Yang - One of the best experts on this subject based on the ideXlab platform.

  • semantics aware spatial temporal binaries for cross modal Video Retrieval
    IEEE Transactions on Image Processing, 2021
    Co-Authors: Jie Qin, Yi Yang, Yunhong Wang, Jiebo Luo
    Abstract:

    With the current exponential growth of Video-based social networks, Video Retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole Video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in Videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries ( $\text{S}^{2}$ Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal Video Retrieval. By exploiting the semantic relationships between two modalities, $\text{S}^{2}$ Bin can efficiently and effectively generate binary codes for both Videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three Video datasets and the experimental results demonstrate that $\text{S}^{2}$ Bin outperforms the state-of-the-art methods in terms of various cross-modal Video Retrieval tasks.

  • t2vlad global local sequence alignment for text Video Retrieval
    Computer Vision and Pattern Recognition, 2021
    Co-Authors: Xiaohan Wang, Linchao Zhu, Yi Yang
    Abstract:

    Text-Video Retrieval is a challenging task that aims to search relevant Video contents based on natural language descriptions. The key to this problem is to measure text-Video similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Some works incorporate the local comparisons through cross-modal local matching and reasoning. These complex operations introduce tremendous computation. In this paper, we design an efficient global-local alignment method. The multi-modal Video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local cross-modal similarities are computed between the Video feature and text feature within the same center. This design enables the meticulous local comparison and reduces the computational cost of the interaction between each text-Video pair. Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective. The global aggregated visual features also provide additional supervision, which is indispensable to the optimization of the learnable semantic centers. We achieve consistent improvements on three standard text-Video Retrieval benchmarks and outperform the state-of-the-art by a clear margin.

  • effective multiple feature hashing for large scale near duplicate Video Retrieval
    IEEE Transactions on Multimedia, 2013
    Co-Authors: Jingkuan Song, Zi Huang, Yi Yang, Heng Tao Shen, Jiebo Luo
    Abstract:

    Near-duplicate Video Retrieval (NDVR) has recently attracted much research attention due to the exponential growth of online Videos. It has many applications, such as copyright protection, automatic Video tagging and online Video monitoring. Many existing approaches use only a single feature to represent a Video for NDVR. However, a single feature is often insufficient to characterize the Video content. Moreover, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale Video datasets has been rarely addressed. In this paper, we present a novel approach-Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structural information of each individual feature and also globally considers the local structures for all the features to learn a group of hash functions to map the Video keyframes into the Hamming space and generate a series of binary codes to represent the Video dataset. We evaluate our approach on a public Video dataset and a large scale Video dataset consisting of 132,647 Videos collected from YouTube by ourselves. This dataset has been released (http://itee.uq.edu.au/shenht/UQ_Video/). The experimental results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.

  • multiple feature hashing for real time large scale near duplicate Video Retrieval
    ACM Multimedia, 2011
    Co-Authors: Jingkuan Song, Zi Huang, Yi Yang, Heng Tao Shen, Richang Hong
    Abstract:

    Near-duplicate Video Retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online Videos. It helps in many areas, such as copyright protection, Video tagging, online Video usage monitoring, etc. Most of existing approaches use only a single feature to represent a Video for NDVR. However, a single feature is often insufficient to characterize the Video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale Video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the Video keyframes into the Hamming space and generate a series of binary codes to represent the Video dataset. We evaluate our approach on a public Video dataset and a large scale Video dataset consisting of 132,647 Videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.

Richang Hong - One of the best experts on this subject based on the ideXlab platform.

  • stochastic multiview hashing for large scale near duplicate Video Retrieval
    IEEE Transactions on Multimedia, 2017
    Co-Authors: Yanbin Hao, Meng Wang, Richang Hong, J Y Goulermas
    Abstract:

    Near-duplicate Video Retrieval (NDVR) has been a significant research task in multimedia given its high impact in applications, such as Video search, recommendation, and copyright protection. In addition to accurate Retrieval performance, the exponential growth of online Videos has imposed heavy demands on the efficiency and scalability of the existing systems. Aiming at improving both the Retrieval accuracy and speed, we propose a novel stochastic multiview hashing algorithm to facilitate the construction of a large-scale NDVR system. Reliable mapping functions, which convert multiple types of keyframe features, enhanced by auxiliary information such as Video-keyframe association and ground truth relevance to binary hash code strings, are learned by maximizing a mixture of the generalized Retrieval precision and recall scores. A composite Kullback–Leibler divergence measure is used to approximate the Retrieval scores, which aligns stochastically the neighborhood structures between the original feature and the relaxed hash code spaces. The efficiency and effectiveness of the proposed method are examined using two public near-duplicate Video collections and are compared against various classical and state-of-the-art NDVR systems.

  • multiple feature hashing for real time large scale near duplicate Video Retrieval
    ACM Multimedia, 2011
    Co-Authors: Jingkuan Song, Zi Huang, Yi Yang, Heng Tao Shen, Richang Hong
    Abstract:

    Near-duplicate Video Retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online Videos. It helps in many areas, such as copyright protection, Video tagging, online Video usage monitoring, etc. Most of existing approaches use only a single feature to represent a Video for NDVR. However, a single feature is often insufficient to characterize the Video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale Video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the Video keyframes into the Hamming space and generate a series of binary codes to represent the Video dataset. We evaluate our approach on a public Video dataset and a large scale Video dataset consisting of 132,647 Videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.