Training Error

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 108006 Experts worldwide ranked by ideXlab platform

David Mcallester - One of the best experts on this subject based on the ideXlab platform.

  • Computable Shell Decomposition Bounds
    Journal of Machine Learning Research, 2004
    Co-Authors: John Langford, David Mcallester
    Abstract:

    Haussler, Kearns, Seung and Tishby introduced the notion of a shell decomposition of the union bound as a means of understanding certain empirical phenomena in learning curves such as phase transitions. Here we use a variant of their ideas to derive an upper bound on the generalization Error of a hypothesis computable from its Training Error and the histogram of Training Errors for the hypotheses in the class. In most cases this new bound is significantly tighter than traditional bounds computed from the Training Error and the cardinality of the class. Our results can also be viewed as providing a rigorous foundation for a model selection algorithm proposed by Scheffer and Joachims.

  • PAC-Bayesian Stochastic Model Selection
    Machine Learning, 2003
    Co-Authors: David Mcallester
    Abstract:

    PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a “posterior distribution” on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the Training Error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.

  • COLT - Computable Shell Decomposition Bounds
    2000
    Co-Authors: John Langford, David Mcallester
    Abstract:

    Haussler, Kearns, Seung and Tishby introduced the notion of a shell decomposition of the union bound as a means of understanding certain empirical phenomena in learning curves such as phase transitions. Here we use a variant of their ideas to derive an upper bound on the generalization Error of a hypothesis computable from its Training Error and the histogram of Training Errors for the hypotheses in the class. In most cases this new bound is significantly tighter than traditional bounds computed from the Training Error and the cardinality of the class. Our results can also be viewed as providing a rigorous foundation for a model selection algorithm proposed by Scheffer and Joachims.

Tianbao Yang - One of the best experts on this subject based on the ideXlab platform.

  • stagewise Training accelerates convergence of testing Error over sgd
    Neural Information Processing Systems, 2019
    Co-Authors: Zhuoning Yuan, Yan Yan, Rong Jin, Tianbao Yang
    Abstract:

    Stagewise Training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both Training Error and testing Error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise Training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of ``nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise Training than the vanilla SGD under the PL condition on both Training Error and testing Error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory.

  • stagewise Training accelerates convergence of testing Error over sgd
    arXiv: Machine Learning, 2018
    Co-Authors: Zhuoning Yuan, Yan Yan, Rong Jin, Tianbao Yang
    Abstract:

    Stagewise Training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both Training Error and testing Error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise Training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of "nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise Training than the vanilla SGD under the PL condition on both Training Error and testing Error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory. Of independent interest, the testing Error bounds for the considered non-convex loss functions are dimensionality and norm independent.

Zhuoning Yuan - One of the best experts on this subject based on the ideXlab platform.

  • stagewise Training accelerates convergence of testing Error over sgd
    Neural Information Processing Systems, 2019
    Co-Authors: Zhuoning Yuan, Yan Yan, Rong Jin, Tianbao Yang
    Abstract:

    Stagewise Training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both Training Error and testing Error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise Training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of ``nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise Training than the vanilla SGD under the PL condition on both Training Error and testing Error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory.

  • stagewise Training accelerates convergence of testing Error over sgd
    arXiv: Machine Learning, 2018
    Co-Authors: Zhuoning Yuan, Yan Yan, Rong Jin, Tianbao Yang
    Abstract:

    Stagewise Training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both Training Error and testing Error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise Training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of "nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise Training than the vanilla SGD under the PL condition on both Training Error and testing Error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory. Of independent interest, the testing Error bounds for the considered non-convex loss functions are dimensionality and norm independent.

Sumio Watanabe - One of the best experts on this subject based on the ideXlab platform.

  • Equations of states in singular statistical estimation
    Neural Networks, 2010
    Co-Authors: Sumio Watanabe
    Abstract:

    Learning machines that have hierarchical structures or hidden variables are singular statistical models because they are nonidentifiable and their Fisher information matrices are singular. In singular statistical models, neither does the Bayes a posteriori distribution converge to the normal distribution nor does the maximum likelihood estimator satisfy asymptotic normality. This is the main reason that it has been difficult to predict their generalization performance from trained states. In this paper, we study four Errors, (1) the Bayes generalization Error, (2) the Bayes Training Error, (3) the Gibbs generalization Error, and (4) the Gibbs Training Error, and prove that there are universal mathematical relations among these Errors. The formulas proved in this paper are equations of states in statistical estimation because they hold for any true distribution, any parametric model, and any a priori distribution. Also we show that the Bayes and Gibbs generalization Errors can be estimated by Bayes and Gibbs Training Errors, and we propose widely applicable information criteria that can be applied to both regular and singular statistical models.

  • Equations of States in Singular Statistical Estimation
    arXiv: Learning, 2007
    Co-Authors: Sumio Watanabe
    Abstract:

    Learning machines which have hierarchical structures or hidden variables are singular statistical models because they are nonidentifiable and their Fisher information matrices are singular. In singular statistical models, neither the Bayes a posteriori distribution converges to the normal distribution nor the maximum likelihood estimator satisfies asymptotic normality. This is the main reason why it has been difficult to predict their generalization performances from trained states. In this paper, we study four Errors, (1) Bayes generalization Error, (2) Bayes Training Error, (3) Gibbs generalization Error, and (4) Gibbs Training Error, and prove that there are mathematical relations among these Errors. The formulas proved in this paper are equations of states in statistical estimation because they hold for any true distribution, any parametric model, and any a priori distribution. Also we show that Bayes and Gibbs generalization Errors are estimated by Bayes and Gibbs Training Errors, and propose widely applicable information criteria which can be applied to both regular and singular statistical models.

Katsuyuki Hagiwara - One of the best experts on this subject based on the ideXlab platform.

  • On the problem in model selection of neural network regression in overrealizable scenario
    Neural computation, 2002
    Co-Authors: Katsuyuki Hagiwara
    Abstract:

    In considering a statistical model selection of neural networks and radial basis functions under an overrealizable case, the problem of unidentifiability emerges. Because the model selection criterion is an unbiased estimator of the generalization Error based on the Training Error, this article analyzes the expected Training Error and the expected generalization Error of neural networks and radial basis functions in overrealizable cases and clarifies the difference from regular models, for which identifiability holds. As a special case of an overrealizable scenario, we assumed a gaussian noise sequence as Training data. In the least-squares estimation under this assumption, we first formulated the problem, in which the calculation of the expected Errors of unidentifiable networks is reduced to the calculation of the expectation of the supremum of the χ2 process. Under this formulation, we gave an upper bound of the expected Training Error and a lower bound of the expected generalization Error, where the generalization is measured at a set of Training inputs. Furthermore, we gave stochastic bounds on the Training Error and the generalization Error. The obtained upper bound of the expected Training Error is smaller than in regular models, and the lower bound of the expected generalization Error is larger than in regular models. The result tells us that the degree of overfitting in neural networks and radial basis functions is higher than in regular models. Correspondingly, it also tells us that the generalization capability is worse than in the case of regular models. The article may be enough to show a difference between neural networks and regular models in the context of the least-squares estimation in a simple situation. This is a first step in constructing a model selection criterion in an overrealizable case. Further important problems in this direction are also included in this article.

  • Upper bound of the expected Training Error of neural network regression for a Gaussian noise sequence
    Neural networks : the official journal of the International Neural Network Society, 2001
    Co-Authors: Katsuyuki Hagiwara, Shiro Usui, Taichi Hayasaka, Naohiro Toda, Kazuhiro Kuno
    Abstract:

    Abstract In neural network regression problems, often referred to as additive noise models, NIC (Network Information Criterion) has been proposed as a general model selection criterion to determine the optimal network size with high generalization performance. Although NIC has been derived using asymptotic expansion, it has been pointed out that this technique cannot be applied under the assumption that a target function is in a family of assumed networks and the family is not minimal for representing the target true function, i.e. the overrealizable case, in which NIC reduces to the well-known AIC (Akaike Information Criterion) and others depending on a loss function. Because NIC is the unbiased estimator of generalization Error based on Training Error, it is required to derive the expectations of Errors for neural networks for such cases. This paper gives upper bounds of the expectations of Training Errors with respect to the distribution of Training data, which we call the expected Training Error, for some types of networks under the squared Error loss. In the overrealizable case, because the Errors are determined by fitting properties of networks to noise components, including in data, the target set of data is taken to be a Gaussian noise sequence. For radial basis function networks and 3-layered neural networks with bell shaped activation function in the hidden layer, the expected Training Error is bounded above by σ ∗ 2 −2nσ ∗ 2 log T/T , where σ ∗ 2 is the variance of noise, n is the number of basis functions or the number of hidden units and T is the number of data. Furthermore, for 3-layered neural networks with sigmoidal activation function in the hidden layer, we obtained the upper bound of σ ∗ 2 −O(log T/T) when n>2. If the number of data is large enough, these bounds of the expected Training Error are smaller than σ ∗ 2 −N(n)σ ∗ 2 /T as evaluated in NIC, where N(n) is the number of all network parameters.

  • IJCNN (6) - On the problem in model selection of neural network regression in overrealizable scenario
    Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for, 2000
    Co-Authors: Katsuyuki Hagiwara, Kazuhiro Kuno, Shiro Usui
    Abstract:

    In this article, we analyze the expected Training Error and the expected generalization Error in a special case of overrealizable scenario, in which output data is a Gaussian noise sequence. Firstly, we derived the upper bound of the expected Training Error of a network, which is independent of input probability distributions. Secondly, based on the first result, we derived the lower bound for the expected generalization Error of a network, provided that the inputs are not stochastic. From the first result, it is clear that we should evaluate the degree of overfitting of a network to noise component in data more larger than the evaluation in NIC. From the second result, the expected generalization Error, which is directly associated with the model selection criterion, is larger than in NIC. These results suggest that the model selection criterion in overrealizable scenario will be larger than NIC if inputs are not stochastic. Additionally, the results of numerical experiments agree with our theoretical results.