Language Modelling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 61869 Experts worldwide ranked by ideXlab platform

Marcal Rusinol - One of the best experts on this subject based on the ideXlab platform.

  • candidate fusion integrating Language Modelling into a sequence to sequence handwritten word recognition architecture
    Pattern Recognition, 2021
    Co-Authors: Lei Kang, Pau Riba, Mauricio Villegas, Alicia Fornes, Marcal Rusinol
    Abstract:

    Abstract Sequence-to-sequence models have recently become very popular for tackling handwritten word recognition problems. However, how to effectively integrate an external Language model into such recognizer is still a challenging problem. The main challenge while training a Language model is to deal with the Language model corpus which is usually different to the one used for training the handwritten word recognition system. Thus, the bias between both word corpora leads to incorrectness on the transcriptions, providing similar or even worse performances on the recognition task. In this work, we introduce Candidate Fusion, a novel way to integrate an external Language model to a sequence-to-sequence architecture. Moreover, it provides suggestions from an external Language knowledge, as a new input to the sequence-to-sequence recognizer. Hence, Candidate Fusion provides two improvements. On the one hand, the sequence-to-sequence recognizer has the flexibility to not only combine the information from itself and the Language model, but also choose the importance of the information provided by the Language model. On the other hand, the external Language model has the ability to adapt itself to the training corpus and even learn the most common errors produced from the recognizer. Finally, by conducting comprehensive experiments, the Candidate Fusion proves to outperform the state-of-the-art Language models for handwritten word recognition tasks.

  • candidate fusion integrating Language Modelling into a sequence to sequence handwritten word recognition architecture
    arXiv: Computer Vision and Pattern Recognition, 2019
    Co-Authors: Lei Kang, Pau Riba, Mauricio Villegas, Alicia Fornes, Marcal Rusinol
    Abstract:

    Sequence-to-sequence models have recently become very popular for tackling handwritten word recognition problems. However, how to effectively integrate an external Language model into such recognizer is still a challenging problem. The main challenge faced when training a Language model is to deal with the Language model corpus which is usually different to the one used for training the handwritten word recognition system. Thus, the bias between both word corpora leads to incorrectness on the transcriptions, providing similar or even worse performances on the recognition task. In this work, we introduce Candidate Fusion, a novel way to integrate an external Language model to a sequence-to-sequence architecture. Moreover, it provides suggestions from an external Language knowledge, as a new input to the sequence-to-sequence recognizer. Hence, Candidate Fusion provides two improvements. On the one hand, the sequence-to-sequence recognizer has the flexibility not only to combine the information from itself and the Language model, but also to choose the importance of the information provided by the Language model. On the other hand, the external Language model has the ability to adapt itself to the training corpus and even learn the most commonly errors produced from the recognizer. Finally, by conducting comprehensive experiments, the Candidate Fusion proves to outperform the state-of-the-art Language models for handwritten word recognition tasks.

Andriy Mnih - One of the best experts on this subject based on the ideXlab platform.

  • the lipschitz constant of self attention
    International Conference on Machine Learning, 2021
    Co-Authors: Hyunjik Kim, George Papamakarios, Andriy Mnih
    Abstract:

    Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence Modelling. We prove that the standard dot-product self-attention is *not* Lipschitz, and propose an alternative L2 self-attention that *is* Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level Language Modelling task.

  • the lipschitz constant of self attention
    arXiv: Machine Learning, 2020
    Co-Authors: Hyunjik Kim, George Papamakarios, Andriy Mnih
    Abstract:

    Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence Modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level Language Modelling task.

  • learning distributed representations for statistical Language Modelling and collaborative filtering
    2010
    Co-Authors: Andriy Mnih
    Abstract:

    With the increasing availability of large datasets machine learning techniques are be- coming an increasingly attractive alternative to expert-designed approaches to solving complex problems in domains where data is abundant. In this thesis we introduce several models for large sparse discrete datasets. Our approach, which is based on probabilistic models that use distributed representations to alleviate the effects of data sparsity, is applied to statistical Language Modelling and collaborative filtering. We introduce three probabilistic Language models that represent words using learned real-valued vectors. Two of the models are based on the Restricted Boltzmann Machine (RBM) architecture while the third one is a simple deterministic model. We show that the deterministic model outperforms the widely used n-gram models and learns sensible word representations. To reduce the time complexity of training and making predictions with the deterministic model, we introduce a hierarchical version of the model, that can be exponentially faster. The speedup is achieved by structuring the vocabulary as a tree over words and taking advantage of this structure. We propose a simple feature-based algorithm for automatic construction of trees over words from data and show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. We then turn our attention to collaborative filtering and show how RBM models can be used to model the distribution of sparse high-dimensional user rating vectors efficiently, presenting inference and learning algorithms that scale linearly in the number of observed ratings. We also introduce the Probabilistic Matrix Factorization model which is based on the probabilistic formulation of the low-rank matrix approximation problem for partially observed matrices. The two models are then extended to allow conditioning on the identities of the rated items whether or not the actual rating values are known. Our results on the Netflix Prize dataset show that both RBM and PMF models outperform online SVD models.

  • three new graphical models for statistical Language Modelling
    International Conference on Machine Learning, 2007
    Co-Authors: Andriy Mnih, Geoffrey E Hinton
    Abstract:

    The supremacy of n-gram models in statistical Language Modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic Language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models.

Yoshua Bengio - One of the best experts on this subject based on the ideXlab platform.

  • learning hierarchical structures on the fly with a recurrent recursive model for sequences
    Meeting of the Association for Computational Linguistics, 2018
    Co-Authors: Athul Paul Jacob, Zhouhan Lin, Alessandro Sordoni, Yoshua Bengio
    Abstract:

    We propose a hierarchical model for sequential data that learns a tree on-the-fly, i.e. while reading the sequence. In the model, a recurrent network adapts its structure and reuses recurrent weights in a recursive manner. This creates adaptive skip-connections that ease the learning of long-term dependencies. The tree structure can either be inferred without supervision through reinforcement learning, or learned in a supervised manner. We provide preliminary experiments in a novel Math Expression Evaluation (MEE) task, which is created to have a hierarchical tree structure that can be used to study the effectiveness of our model. Additionally, we test our model in a well-known propositional logic and Language Modelling tasks. Experimental results have shown the potential of our approach.

  • multiscale sequence modeling with a learned dictionary
    arXiv: Machine Learning, 2017
    Co-Authors: Bart Van Merrienboer, Amartya Sanyal, Hugo Larochelle, Yoshua Bengio
    Abstract:

    We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to Language Modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on Language modeling tasks, especially for smaller models.

  • hierarchical multiscale recurrent neural networks
    International Conference on Learning Representations, 2017
    Co-Authors: Junyoung Chung, Yoshua Bengio
    Abstract:

    Learning both hierarchical and temporal representation has been among the long- standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural network, that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that the proposed model can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level Language Modelling and handwriting sequence generation.

  • End-to-end attention-based large vocabulary speech recognition
    ICASSP IEEE International Conference on Acoustics Speech and Signal Processing - Proceedings, 2016
    Co-Authors: Jan Chorowski, Philemon Brakel, Dmitriy Serdyuk, Yoshua Bengio
    Abstract:

    Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic Modelling, Language Modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames, thereby reducing source sequence length. Integrating an n-gram Language model into the decoding process yields recognition accuracies similar to other HMM-free RNN-based approaches.

  • batch normalized recurrent neural networks
    arXiv: Machine Learning, 2015
    Co-Authors: Cesar Laurent, Philemon Brakel, Gabriel Pereyra, Ying Zhang, Yoshua Bengio
    Abstract:

    Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our Language Modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.

Lei Kang - One of the best experts on this subject based on the ideXlab platform.

  • candidate fusion integrating Language Modelling into a sequence to sequence handwritten word recognition architecture
    Pattern Recognition, 2021
    Co-Authors: Lei Kang, Pau Riba, Mauricio Villegas, Alicia Fornes, Marcal Rusinol
    Abstract:

    Abstract Sequence-to-sequence models have recently become very popular for tackling handwritten word recognition problems. However, how to effectively integrate an external Language model into such recognizer is still a challenging problem. The main challenge while training a Language model is to deal with the Language model corpus which is usually different to the one used for training the handwritten word recognition system. Thus, the bias between both word corpora leads to incorrectness on the transcriptions, providing similar or even worse performances on the recognition task. In this work, we introduce Candidate Fusion, a novel way to integrate an external Language model to a sequence-to-sequence architecture. Moreover, it provides suggestions from an external Language knowledge, as a new input to the sequence-to-sequence recognizer. Hence, Candidate Fusion provides two improvements. On the one hand, the sequence-to-sequence recognizer has the flexibility to not only combine the information from itself and the Language model, but also choose the importance of the information provided by the Language model. On the other hand, the external Language model has the ability to adapt itself to the training corpus and even learn the most common errors produced from the recognizer. Finally, by conducting comprehensive experiments, the Candidate Fusion proves to outperform the state-of-the-art Language models for handwritten word recognition tasks.

  • candidate fusion integrating Language Modelling into a sequence to sequence handwritten word recognition architecture
    arXiv: Computer Vision and Pattern Recognition, 2019
    Co-Authors: Lei Kang, Pau Riba, Mauricio Villegas, Alicia Fornes, Marcal Rusinol
    Abstract:

    Sequence-to-sequence models have recently become very popular for tackling handwritten word recognition problems. However, how to effectively integrate an external Language model into such recognizer is still a challenging problem. The main challenge faced when training a Language model is to deal with the Language model corpus which is usually different to the one used for training the handwritten word recognition system. Thus, the bias between both word corpora leads to incorrectness on the transcriptions, providing similar or even worse performances on the recognition task. In this work, we introduce Candidate Fusion, a novel way to integrate an external Language model to a sequence-to-sequence architecture. Moreover, it provides suggestions from an external Language knowledge, as a new input to the sequence-to-sequence recognizer. Hence, Candidate Fusion provides two improvements. On the one hand, the sequence-to-sequence recognizer has the flexibility not only to combine the information from itself and the Language model, but also to choose the importance of the information provided by the Language model. On the other hand, the external Language model has the ability to adapt itself to the training corpus and even learn the most commonly errors produced from the recognizer. Finally, by conducting comprehensive experiments, the Candidate Fusion proves to outperform the state-of-the-art Language models for handwritten word recognition tasks.

Donna E Youngs - One of the best experts on this subject based on the ideXlab platform.

  • a Language Modelling approach to linking criminal styles with offender characteristics
    Data and Knowledge Engineering, 2010
    Co-Authors: Richard Bache, Fabio Crestani, David V Canter, Donna E Youngs
    Abstract:

    The ability to infer the characteristics of offenders from their criminal behaviour ('offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used. Although categories selected are gender, age group, ethnic appearance and broad occupation (employed or not), in principle this can be applied to any characteristic recorded. Results indicate that statistically significant relationships exist between all characteristics for many types of crime. Bernoulli models tend to perform better than multinomial ones. It is also possible to identify automatically specific terms which when taken together give insight into the style of offending related to a particular group.

  • a Language Modelling approach to linking criminal styles with offender characteristics
    Applications of Natural Language to Data Bases, 2008
    Co-Authors: Richard Bache, Fabio Crestani, David V Canter, Donna E Youngs
    Abstract:

    The ability to infer the characteristics of offenders from their criminal behaviour (`offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used. Although categories selected are gender and age group, in principle this can be applied to any characteristic recorded. Results indicate that statistically significant relationships exist between both age and sex in certain types of crime. Both types of Language model perform with similar effectiveness. It is also possible to identify automatically specific terms which when taken together give insight into the style of offending related to a particular group.

  • mining police digital archives to link criminal styles with offender characteristics
    International Conference on Asian Digital Libraries, 2007
    Co-Authors: Richard Bache, Fabio Crestani, David V Canter, Donna E Youngs
    Abstract:

    The partial success in inferring the characteristics of offenders from their criminal behaviour ('offender profiling') has relied on limited data and subjective judgments. We therefore sought to determine if Information Retrieval techniques and in particular Language Modelling could be applied directly to existing police digital records of criminal events to identify significant characteristics of offenders. The categories selected were gender and age group. Results showed that distinct differences in characteristics do exist.