Gradient Descent

The Experts below are selected from a list of 58719 Experts worldwide ranked by ideXlab platform

Quanquan Gu - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Machine Learning, 2019

Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu

Abstract:

We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

15 days free trial to Access Article

Difan Zou - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Machine Learning, 2019

Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu

Abstract:

We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

15 days free trial to Access Article
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

arXiv: Learning, 2018

Co-Authors: Difan Zou, Yuan Cao, Dongruo Zhou

Abstract:

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using Gradient Descent and stochastic Gradient Descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both Gradient Descent and stochastic Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) Gradient Descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

15 days free trial to Access Article

Dongruo Zhou - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Machine Learning, 2019

Co-Authors: Difan Zou, Dongruo Zhou, Yuan Cao, Quanquan Gu

Abstract:

We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using Gradient Descent. We show that with proper random weight initialization, Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of Gradient Descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for Gradient Descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient Descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of Gradient Descent for training deep neural networks.

15 days free trial to Access Article
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

arXiv: Learning, 2018

Co-Authors: Difan Zou, Yuan Cao, Dongruo Zhou

Abstract:

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using Gradient Descent and stochastic Gradient Descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both Gradient Descent and stochastic Gradient Descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) Gradient Descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) Gradient Descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

15 days free trial to Access Article

Alexander J Smola - One of the best experts on this subject based on the ideXlab platform.

parallelized stochastic Gradient Descent

Neural Information Processing Systems, 2010

Co-Authors: Martin Zinkevich, Markus Weimer, Alexander J Smola

Abstract:

With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic Gradient Descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic Gradient Descent algorithms reach the asymptotically normal regime [1, 8].

15 days free trial to Access Article
NIPS - Parallelized Stochastic Gradient Descent

2010

Co-Authors: Martin Zinkevich, Markus Weimer, Alexander J Smola

Abstract:

With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic Gradient Descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic Gradient Descent algorithms reach the asymptotically normal regime [1, 8].

15 days free trial to Access Article

Vincent C. S. Lee - One of the best experts on this subject based on the ideXlab platform.

Privacy-Preserving Gradient-Descent Methods

IEEE Transactions on Knowledge and Data Engineering, 2010

Co-Authors: Shuguo Han, Li Wan, Vincent C. S. Lee

Abstract:

Gradient Descent is a widely used paradigm for solving many optimization problems. Gradient Descent aims to minimize a target function in order to reach a local minimum. In machine learning or data mining, this function corresponds to a decision model that is to be discovered. In this paper, we propose a preliminary formulation of Gradient Descent with data privacy preservation. We present two approaches-stochastic approach and least square approach-under different assumptions. Four protocols are proposed for the two approaches incorporating various secure building blocks for both horizontally and vertically partitioned data. We conduct experiments to evaluate the scalability of the proposed secure building blocks and the accuracy and efficiency of the protocols for four different scenarios. The excremental results show that the proposed secure building blocks are reasonably scalable and the proposed protocols allow us to determine a better secure protocol for the applications for each scenario.

15 days free trial to Access Article
KDD - Privacy-preservation for Gradient Descent methods

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07, 2007

Co-Authors: Li Wan, Shuguo Han, Vincent C. S. Lee

Abstract:

Gradient Descent is a widely used paradigm for solving many optimization problems. Stochastic Gradient Descent performs a series of iterations to minimize a target function in order to reach a local minimum. In machine learning or data mining, this function corresponds to a decision model that is to be discovered. The Gradient Descent paradigm underlies many commonly used techniques in data mining and machine learning, such as neural networks, Bayesian networks, genetic algorithms, and simulated annealing. To the best of our knowledge, there has not been any work that extends the notion of privacy preservation or secure multi-party computation to Gradient-Descent-based techniques. In this paper, we propose a preliminary approach to enable privacy preservation in Gradient Descent methods in general and demonstrate its feasibility in specific Gradient Descent methods.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Quanquan Gu - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Difan Zou - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Dongruo Zhou - One of the best experts on this subject based on the ideXlab platform.

Gradient Descent optimizes over-parameterized deep ReLU networks

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Alexander J Smola - One of the best experts on this subject based on the ideXlab platform.

parallelized stochastic Gradient Descent

NIPS - Parallelized Stochastic Gradient Descent

Vincent C. S. Lee - One of the best experts on this subject based on the ideXlab platform.

Privacy-Preserving Gradient-Descent Methods

KDD - Privacy-preservation for Gradient Descent methods