Video Representation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 43728 Experts worldwide ranked by ideXlab platform

Cordelia Schmid - One of the best experts on this subject based on the ideXlab platform.

  • composable augmentation encoding for Video Representation learning
    arXiv: Computer Vision and Pattern Recognition, 2021
    Co-Authors: Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
    Abstract:

    We focus on contrastive methods for self-supervised Video Representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of Representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained Video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the Video Representations for contrastive learning. We show that Representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of Video benchmarks.

  • A Robust and Efficient Video Representation for Action Recognition
    International Journal of Computer Vision, 2016
    Co-Authors: Heng Wang, Dan Oneata, Jakob Verbeek, Cordelia Schmid
    Abstract:

    This paper introduces a state-of-the-art Video Representation and applies it to efficient action recognition and detection. We first propose to improve the popular dense trajectory features by explicit camera motion estimation. More specifically, we extract feature point matches between frames using SURF descriptors and dense optical flow. The matches are used to estimate a homography with RANSAC. To improve the robustness of homography estimation, a human detector is employed to remove outlier matches from the human body as human motion is not constrained by the camera. Trajectories consistent with the homography are considered as due to camera motion, and thus removed. We also use the homography to cancel out camera motion from the optical flow. This results in significant improvement on motion-based HOF and MBH descriptors. We further explore the recent Fisher vector as an alternative feature encoding approach to the standard bag-of-words (BOW) histogram, and consider different ways to include spatial layout information in these encodings. We present a large and varied set of evaluations, considering (i) classification of short basic actions on six datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that our improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to BOW encodings for Video recognition tasks. In all three tasks, we show substantial improvements over the state-of-the-art results.

Yunhui Liu - One of the best experts on this subject based on the ideXlab platform.

  • Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics.
    IEEE transactions on pattern analysis and machine intelligence, 2021
    Co-Authors: Jiangliu Wang, Jianbo Jiao, Wei Liu, Linchao Bao, Yunhui Liu
    Abstract:

    This paper proposes a novel pretext task to address the self-supervised Video Representation learning problem. Specifically, given an unlabeled Video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the Video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream Video analysis tasks including action recognition, Video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/Video_repres_sts.

  • Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
    arXiv: Computer Vision and Pattern Recognition, 2020
    Co-Authors: Jiangliu Wang, Jianbo Jiao, Wei Liu, Linchao Bao, Yunhui Liu
    Abstract:

    This paper proposes a novel pretext task to address the self-supervised Video Representation learning problem. Specifically, given an unlabeled Video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the Video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream Video analytic tasks including action recognition, Video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at: this https URL.

  • Self-supervised Video Representation Learning by Pace Prediction
    arXiv: Computer Vision and Pattern Recognition, 2020
    Co-Authors: Jiangliu Wang, Jianbo Jiao, Yunhui Liu
    Abstract:

    This paper addresses the problem of self-supervised Video Representation learning from a new perspective -- by Video pace prediction. It stems from the observation that human visual system is sensitive to Video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a Video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each Video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying Video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar Video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and Video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised Video Representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at this https URL.

Wei Liu - One of the best experts on this subject based on the ideXlab platform.

  • VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples.
    arXiv: Computer Vision and Pattern Recognition, 2021
    Co-Authors: Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, Wei Liu
    Abstract:

    MoCo is effective for unsupervised image Representation learning. In this paper, we propose VideoMoCo for unsupervised Video Representation learning. Given a Video sequence as an input sample, we improve the temporal feature Representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample temporally. The discriminator is then learned to encode similar feature Representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. As the momentum encoder updates after keys enqueue, the Representation ability of these keys degrades when we use the current input sample for contrastive learning. This degradation is reflected via temporal decay to attend the input sample to recent keys in the queue. As a result, we adapt MoCo to learn Video Representations without empirically designing pretext tasks. By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning. Experiments on benchmark datasets including UCF101 and HMDB51 show that VideoMoCo stands as a state-of-the-art Video Representation learning method.

  • Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics.
    IEEE transactions on pattern analysis and machine intelligence, 2021
    Co-Authors: Jiangliu Wang, Jianbo Jiao, Wei Liu, Linchao Bao, Yunhui Liu
    Abstract:

    This paper proposes a novel pretext task to address the self-supervised Video Representation learning problem. Specifically, given an unlabeled Video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the Video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream Video analysis tasks including action recognition, Video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/Video_repres_sts.

  • Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
    arXiv: Computer Vision and Pattern Recognition, 2020
    Co-Authors: Jiangliu Wang, Jianbo Jiao, Wei Liu, Linchao Bao, Yunhui Liu
    Abstract:

    This paper proposes a novel pretext task to address the self-supervised Video Representation learning problem. Specifically, given an unlabeled Video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the Video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream Video analytic tasks including action recognition, Video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at: this https URL.

Shihfu Chang - One of the best experts on this subject based on the ideXlab platform.

  • spatio temporal Video search using the object based Video Representation
    International Conference on Image Processing, 1997
    Co-Authors: Di Zhong, Shihfu Chang
    Abstract:

    Object-based Video Representation provides great promises for new search and editing functionalities. Feature regions in Video sequences are automatically segmented, tracked, and grouped to form the basis for content-based Video search and higher levels of abstraction. We present a new system for Video object segmentation and tracking using feature fusion and region grouping. We also present efficient techniques for spatio-temporal Video query based on the automatically segmented Video objects.

  • ICIP (1) - Spatio-temporal Video search using the object based Video Representation
    Proceedings of International Conference on Image Processing, 1
    Co-Authors: Di Zhong, Shihfu Chang
    Abstract:

    Object-based Video Representation provides great promises for new search and editing functionalities. Feature regions in Video sequences are automatically segmented, tracked, and grouped to form the basis for content-based Video search and higher levels of abstraction. We present a new system for Video object segmentation and tracking using feature fusion and region grouping. We also present efficient techniques for spatio-temporal Video query based on the automatically segmented Video objects.

Tao Mei - One of the best experts on this subject based on the ideXlab platform.

  • Learning hierarchical Video Representation for action recognition
    International Journal of Multimedia Information Retrieval, 2017
    Co-Authors: Zhaofan Qiu, Tao Mei, Ting Yao, Yong Rui, Jiebo Luo
    Abstract:

    Video analysis is an important branch of computer vision due to its wide applications, ranging from Video surveillance, Video indexing, and retrieval to human computer interaction. All of the applications are based on a good Video Representation, which encodes Video content into a feature vector with fixed length. Most existing methods treat Video as a flat image sequence, but from our observations we argue that Video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of Video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire Video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and Video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical Representation that captures spatial and temporal information of Video. To validate its effectiveness in Video analysis, we apply this Video Representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods.

  • IJCAI - Learning deep intrinsic Video Representation by exploring temporal coherence and graph structure
    2016
    Co-Authors: Yingwei Pan, Tao Mei, Ting Yao, Yong Rui
    Abstract:

    Learning Video Representation is not a trivial task, as Video is an information-intensive media where each frame does not exist independently. Locally, a Video frame is visually and semantically similar with its adjacent frames. Holistically, a Video has its inherent structure--the correlations among Video frames. For example, even the frames far from each other may also hold similar semantics. Such context information is therefore important to characterize the intrinsic Representation of a Video frame. In this paper, we present a novel approach to learn the deep Video Representation by exploring both local and holistic contexts. Specifically, we propose a triplet sampling mechanism to encode the local temporal relationship of adjacent frames based on their deep Representations. In addition, we incorporate the graph structure of the Video, as a priori, to holistically preserve the inherent correlations among Video frames. Our approach is fully unsupervised and trained in an end-to-end deep convolutional neural network architecture. By extensive experiments, we show that our learned Representation can significantly boost several Video recognition tasks (retrieval, classification, and highlight detection) over traditional Video Representations.

  • action recognition by learning deep multi granular spatio temporal Video Representation
    International Conference on Multimedia Retrieval, 2016
    Co-Authors: Zhaofan Qiu, Tao Mei, Ting Yao, Yong Rui, Jiebo Luo
    Abstract:

    Recognizing actions in Videos is a challenging task as Video is an information-intensive media with complex variations. Most existing methods have treated Video as a flat data sequence while ignoring the intrinsic hierarchical structure of the Video content. In particular, an action may span different granularities in this hierarchy including, from small to large, a single frame, consecutive frames (motion), a short clip, and the entire Video. In this paper, we present a novel framework to boost action recognition by learning a deep spatio-temporal Video Representation at hierarchical multi-granularity. Specifically, we model each granularity as a single stream by 2D (for frame and motion streams) or 3D (for clip and Video streams) convolutional neural networks (CNNs). The framework therefore consists of multi-stream 2D or 3D CNNs to learn both the spatial and temporal Representations. Furthermore, we employ the Long Short-Term Memory (LSTM) networks on the frame, motion, and clip streams to exploit long-term temporal dynamics. With a softmax layer on the top of each stream, the classification scores can be predicted from all the streams, followed by a novel fusion scheme based on the multi-granular score distribution. Our networks are learned in an end-to-end fashion. On two Video action benchmarks of UCF101 and HMDB51, our framework achieves promising performance compared with the state-of-the-art.

  • ICMR - Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation
    Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016
    Co-Authors: Zhaofan Qiu, Tao Mei, Ting Yao, Yong Rui, Jiebo Luo
    Abstract:

    Recognizing actions in Videos is a challenging task as Video is an information-intensive media with complex variations. Most existing methods have treated Video as a flat data sequence while ignoring the intrinsic hierarchical structure of the Video content. In particular, an action may span different granularities in this hierarchy including, from small to large, a single frame, consecutive frames (motion), a short clip, and the entire Video. In this paper, we present a novel framework to boost action recognition by learning a deep spatio-temporal Video Representation at hierarchical multi-granularity. Specifically, we model each granularity as a single stream by 2D (for frame and motion streams) or 3D (for clip and Video streams) convolutional neural networks (CNNs). The framework therefore consists of multi-stream 2D or 3D CNNs to learn both the spatial and temporal Representations. Furthermore, we employ the Long Short-Term Memory (LSTM) networks on the frame, motion, and clip streams to exploit long-term temporal dynamics. With a softmax layer on the top of each stream, the classification scores can be predicted from all the streams, followed by a novel fusion scheme based on the multi-granular score distribution. Our networks are learned in an end-to-end fashion. On two Video action benchmarks of UCF101 and HMDB51, our framework achieves promising performance compared with the state-of-the-art.

  • Multi-Video synopsis for Video Representation
    Signal Processing, 2009
    Co-Authors: Tao Mei, In So Kweon, Xian-sheng Hua
    Abstract:

    The world is covered with millions of cameras with each recording a huge amount of Video. It is a time-consuming task to watch these Videos, as most of them are of little interest due to the lack of activity. Video Representation is thus an important technology to tackle with this issue. However, conventional Video Representation methods mainly focus on a single Video, aiming at reducing the spatiotemporal redundancy as much as possible. In contrast, this paper describes a novel approach to present the dynamics of multiple Videos simultaneously, aiming at a less intrusive viewing experience. Given a main Video and multiple supplementary Videos, the proposed approach automatically constructs a synthesized multi-Video synopsis by integrating the supplementary Videos into the most suitable spatiotemporal portions within this main Video. The problem of finding suitable integration between the main Video and supplementary Videos is formulated as the maximum a posterior (MAP) problem, in which the desired properties related to a less intrusive viewing experience, i.e., informativeness, consistency, visual naturalness, and stability, are maximized. This problem is solved by using an efficient Viterbi beam search algorithm. Furthermore, an informative blending algorithm that naturalizes the connecting boundary between different Videos is proposed. The proposed method has a wide variety of applications such as visual information Representation, surveillance Video browsing, Video summarization, and Video advertising. The effectiveness of multi-Video synopsis is demonstrated in extensive experiments over different types of Videos with different synopsis cases.