Gradient Descent

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 58719 Experts worldwide ranked by ideXlab platform

Quanquan Gu - One of the best experts on this subject based on the ideXlab platform.

  • Gradient Descent optimizes over-parameterized deep ReLU networks
    Machine Learning, 2019
    Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu
    Abstract:

    We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

Difan Zou - One of the best experts on this subject based on the ideXlab platform.

  • Gradient Descent optimizes over-parameterized deep ReLU networks
    Machine Learning, 2019
    Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu
    Abstract:

    We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

  • Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
    arXiv: Learning, 2018
    Co-Authors: Difan Zou, Yuan Cao, Dongruo Zhou
    Abstract:

    We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using Gradient Descent and stochastic Gradient Descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both Gradient Descent and stochastic Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) Gradient Descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

Dongruo Zhou - One of the best experts on this subject based on the ideXlab platform.

  • Gradient Descent optimizes over-parameterized deep ReLU networks
    Machine Learning, 2019
    Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu
    Abstract:

    We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

  • Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
    arXiv: Learning, 2018
    Co-Authors: Difan Zou, Yuan Cao, Dongruo Zhou
    Abstract:

    We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using Gradient Descent and stochastic Gradient Descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both Gradient Descent and stochastic Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) Gradient Descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

Alexander J Smola - One of the best experts on this subject based on the ideXlab platform.

  • parallelized stochastic Gradient Descent
    Neural Information Processing Systems, 2010
    Co-Authors: Martin Zinkevich, Markus Weimer, Alexander J Smola
    Abstract:

    With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic Gradient Descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic Gradient Descent algorithms reach the asymptotically normal regime [1, 8].

  • NIPS - Parallelized Stochastic Gradient Descent
    2010
    Co-Authors: Martin Zinkevich, Markus Weimer, Alexander J Smola
    Abstract:

    With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic Gradient Descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic Gradient Descent algorithms reach the asymptotically normal regime [1, 8].

Vincent C. S. Lee - One of the best experts on this subject based on the ideXlab platform.

  • Privacy-Preserving Gradient-Descent Methods
    IEEE Transactions on Knowledge and Data Engineering, 2010
    Co-Authors: Shuguo Han, Li Wan, Vincent C. S. Lee
    Abstract:

    Gradient Descent is a widely used paradigm for solving many optimization problems. Gradient Descent aims to minimize a target function in order to reach a local minimum. In machine learning or data mining, this function corresponds to a decision model that is to be discovered. In this paper, we propose a preliminary formulation of Gradient Descent with data privacy preservation. We present two approaches-stochastic approach and least square approach-under different assumptions. Four protocols are proposed for the two approaches incorporating various secure building blocks for both horizontally and vertically partitioned data. We conduct experiments to evaluate the scalability of the proposed secure building blocks and the accuracy and efficiency of the protocols for four different scenarios. The excremental results show that the proposed secure building blocks are reasonably scalable and the proposed protocols allow us to determine a better secure protocol for the applications for each scenario.

  • KDD - Privacy-preservation for Gradient Descent methods
    Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07, 2007
    Co-Authors: Li Wan, Shuguo Han, Vincent C. S. Lee
    Abstract:

    Gradient Descent is a widely used paradigm for solving many optimization problems. Stochastic Gradient Descent performs a series of iterations to minimize a target function in order to reach a local minimum. In machine learning or data mining, this function corresponds to a decision model that is to be discovered. The Gradient Descent paradigm underlies many commonly used techniques in data mining and machine learning, such as neural networks, Bayesian networks, genetic algorithms, and simulated annealing. To the best of our knowledge, there has not been any work that extends the notion of privacy preservation or secure multi-party computation to Gradient-Descent-based techniques. In this paper, we propose a preliminary approach to enable privacy preservation in Gradient Descent methods in general and demonstrate its feasibility in specific Gradient Descent methods.