Temporal Modeling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 100929 Experts worldwide ranked by ideXlab platform

Ya Li - One of the best experts on this subject based on the ideXlab platform.

  • long short term memory recurrent neural network based multimodal dimensional emotion recognition
    ACM Multimedia, 2015
    Co-Authors: Linlin Chao, Minghao Yang, Ya Li
    Abstract:

    This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for dimensional recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. Except regular LSTM-RNN prediction architecture, two techniques are investigated for dimensional emotion recognition problem. The first one is e -insensitive loss is utilized as the loss function to optimize. Compared to squared loss function, which is the most widely used loss function for dimension emotion recognition, e -insensitive loss is more robust for the label noises and it can ignore small errors to get stronger correlation between predictions and labels. The other one is Temporal pooling. This technique enables Temporal Modeling in the input features and increases the diversity of the features fed into the forward prediction architecture. Experiments results show the efficiency of key points of the proposed method and competitive results are obtained.

Andrew W - One of the best experts on this subject based on the ideXlab platform.

  • convolutional long short term memory fully connected deep neural networks
    International Conference on Acoustics Speech and Signal Processing, 2015
    Co-Authors: Tara N Sainath, Oriol Vinyals, Andrew W
    Abstract:

    Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their Modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at Temporal Modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4–6% relative improvement in WER over an LSTM, the strongest of the three individual models.

Stacy Marsella - One of the best experts on this subject based on the ideXlab platform.

  • predicting co verbal gestures a deep and Temporal Modeling approach
    Intelligent Virtual Agents, 2015
    Co-Authors: Chungcheng Chiu, Louisphilippe Morency, Stacy Marsella
    Abstract:

    Gestures during spoken dialog play a central role in human communication. As a consequence, models of gesture generation are a key challenge in research on virtual humans, embodied agents capable of face-to-face interaction with people. Machine learning approaches to gesture generation must take into account the conceptual content in utterances, physical properties of speech signals and the physical properties of the gestures themselves. To address this challenge, we proposed a gestural sign scheme to facilitate supervised learning and presented the DCNF model, a model to jointly learn deep neural networks and second order linear chain Temporal contingency. The approach we took realizes both the mapping relation between speech and gestures while taking account Temporal relations among gestures. Our experiments on human co-verbal dataset shows significant improvement over previous work on gesture prediction. A generalization experiment performed on handwriting recognition also shows that DCNFs outperform the state-of-the-art approaches.

  • IVA - Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach
    Intelligent Virtual Agents, 2015
    Co-Authors: Chungcheng Chiu, Louisphilippe Morency, Stacy Marsella
    Abstract:

    Gestures during spoken dialog play a central role in human communication. As a consequence, models of gesture generation are a key challenge in research on virtual humans, embodied agents capable of face-to-face interaction with people. Machine learning approaches to gesture generation must take into account the conceptual content in utterances, physical properties of speech signals and the physical properties of the gestures themselves. To address this challenge, we proposed a gestural sign scheme to facilitate supervised learning and presented the DCNF model, a model to jointly learn deep neural networks and second order linear chain Temporal contingency. The approach we took realizes both the mapping relation between speech and gestures while taking account Temporal relations among gestures. Our experiments on human co-verbal dataset shows significant improvement over previous work on gesture prediction. A generalization experiment performed on handwriting recognition also shows that DCNFs outperform the state-of-the-art approaches.

Shilei Wen - One of the best experts on this subject based on the ideXlab platform.

  • stnet local and global spatial Temporal Modeling for action recognition
    National Conference on Artificial Intelligence, 2019
    Co-Authors: Zhichao Zhou, Chuang Gan, Xiao Liu, Liming Wang, Shilei Wen
    Abstract:

    Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-Temporal Modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialTemporal network (StNet) architecture for both local and global Modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-Temporal relationship. To model global spatialTemporal structure, we apply Temporal convolution on the local spatial-Temporal feature maps. Specifically, a novel Temporal Xception block is proposed in StNet, which employs a separate channel-wise and Temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

  • StNet: Local and Global Spatial-Temporal Modeling for Action Recognition.
    arXiv: Computer Vision and Pattern Recognition, 2018
    Co-Authors: Zhou Zhichao, Chuang Gan, Xiao Liu, Liming Wang, Shilei Wen
    Abstract:

    Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-Temporal Modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial Temporal network (StNet) architecture for both local and global spatial-Temporal Modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-Temporal relationship. To model global spatial-Temporal relationship, we apply Temporal convolution on the local spatial-Temporal feature maps. Specifically, a novel Temporal Xception block is proposed in StNet. It employs a separate channel-wise and Temporal-wise convolution over the feature sequence of video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

  • revisiting the effectiveness of off the shelf Temporal Modeling approaches for large scale video classification
    arXiv: Computer Vision and Pattern Recognition, 2017
    Co-Authors: Yunlong Bian, Chuang Gan, Xiao Liu, Xiang Long, Jie Zhou, Shilei Wen, Yuanqing Lin
    Abstract:

    This paper describes our solution for the video recognition task of ActivityNet Kinetics challenge that ranked the 1st place. Most of existing state-of-the-art video recognition approaches are in favor of an end-to-end pipeline. One exception is the framework of DevNet. The merit of DevNet is that they first use the video data to learn a network (i.e. fine-tuning or training from scratch). Instead of directly using the end-to-end classification scores (e.g. softmax scores), they extract the features from the learned network and then fed them into the off-the-shelf machine learning models to conduct video classification. However, the effectiveness of this line work has long-term been ignored and underestimated. In this submission, we extensively use this strategy. Particularly, we investigate four Temporal Modeling approaches using the learned features: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model and Fast-Forward Sequence Model. Experiment results on the challenging Kinetics dataset demonstrate that our proposed Temporal Modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks. Most remarkably, our best single Multi-group Shifting Attention Network can achieve 77.7% in term of top-1 accuracy and 93.2% in term of top-5 accuracy on the validation set.

  • Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding.
    arXiv: Computer Vision and Pattern Recognition, 2017
    Co-Authors: Chuang Gan, Xiao Liu, Bian Yunlong, Xiang Long, Jie Zhou, Shilei Wen
    Abstract:

    This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of the raw videos, we mainly investigate various Temporal Modeling approaches to aggregate the frame-level features for multi-label video recognition. Our system contains three major components: two-stream sequence model, fast-forward sequence model and Temporal residual neural networks. Experiment results on the challenging Youtube-8M dataset demonstrate that our proposed Temporal Modeling approaches can significantly improve existing Temporal Modeling approaches in the large-scale video recognition tasks. To be noted, our fast-forward LSTM with a depth of 7 layers achieves 82.75% in term of GAP@20 on the Kaggle Public test set.

Linlin Chao - One of the best experts on this subject based on the ideXlab platform.

  • long short term memory recurrent neural network based multimodal dimensional emotion recognition
    ACM Multimedia, 2015
    Co-Authors: Linlin Chao, Minghao Yang, Ya Li
    Abstract:

    This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for dimensional recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. Except regular LSTM-RNN prediction architecture, two techniques are investigated for dimensional emotion recognition problem. The first one is e -insensitive loss is utilized as the loss function to optimize. Compared to squared loss function, which is the most widely used loss function for dimension emotion recognition, e -insensitive loss is more robust for the label noises and it can ignore small errors to get stronger correlation between predictions and labels. The other one is Temporal pooling. This technique enables Temporal Modeling in the input features and increases the diversity of the features fed into the forward prediction architecture. Experiments results show the efficiency of key points of the proposed method and competitive results are obtained.