Function Approximation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 241893 Experts worldwide ranked by ideXlab platform

Richard S Sutton - One of the best experts on this subject based on the ideXlab platform.

  • Average-Reward Off-Policy Policy Evaluation with Function Approximation.
    arXiv: Learning, 2021
    Co-Authors: Shangtong Zhang, Richard S Sutton, Yi Wan, Shimon Whiteson
    Abstract:

    We consider off-policy policy evaluation with Function Approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value Function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value Function, the algorithms are the first convergent off-policy linear Function Approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear Function Approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

  • toward off policy learning control with Function Approximation
    International Conference on Machine Learning, 2010
    Co-Authors: Hamid Reza Maei, Shalabh Bhatnagar, Richard S Sutton
    Abstract:

    We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear Function Approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear Approximation to the optimal action-value Function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear Function Approximation.

  • convergent temporal difference learning with arbitrary smooth Function Approximation
    Neural Information Processing Systems, 2009
    Co-Authors: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid Reza Maei, Csaba Szepesvari
    Abstract:

    We introduce the first temporal-difference learning algorithms that converge with smooth value Function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with Function Approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear Function Approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective Function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this Function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear Function Approximation methods. We generalize this work to nonlinear Function Approximation. We present a Bellman error objective Function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value Function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms' effectiveness.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.

Yishay Mansour - One of the best experts on this subject based on the ideXlab platform.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.

Ronald Parr - One of the best experts on this subject based on the ideXlab platform.

  • Value Function Approximation in Zero-Sum Markov Games
    arXiv: Artificial Intelligence, 2012
    Co-Authors: Michail G. Lagoudakis, Ronald Parr
    Abstract:

    This paper investigates value Function Approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value Function Approximation. We demonstrate the viability of value Function Approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

  • analyzing feature generation for value Function Approximation
    International Conference on Machine Learning, 2007
    Co-Authors: Ronald Parr, Christopher Painterwakefield, Michael L. Littman
    Abstract:

    We analyze a simple, Bellman-error-based approach to generating basis Functions for value-Function Approximation. We show that it generates orthogonal basis Functions that provably tighten Approximation error bounds. We also illustrate the use of this approach in the presence of noise on some sample problems.

  • UAI - Value Function Approximation in zero-sum markov games
    2002
    Co-Authors: Michail G. Lagoudakis, Ronald Parr
    Abstract:

    This paper investigates value Function Approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value Function Approximation. We demonstrate the viability of value Function Approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

Harm Van Seijen - One of the best experts on this subject based on the ideXlab platform.

  • Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation
    arXiv: Artificial Intelligence, 2016
    Co-Authors: Harm Van Seijen
    Abstract:

    Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear Function Approximation. The reason is that multi-step methods often yield substantially better performance than their single-step counter-parts, due to a lower bias of the update targets. For non-linear Function Approximation, however, single-step methods appear to be the norm. Part of the reason could be that on many domains the popular multi-step methods TD($\lambda$) and Sarsa($\lambda$) do not perform well when combined with non-linear Function Approximation. In particular, they are very susceptible to divergence of value estimates. In this paper, we identify the reason behind this. Furthermore, based on our analysis, we propose a new multi-step TD method for non-linear Function Approximation that addresses this issue. We confirm the effectiveness of our method using two benchmark tasks with neural networks as Function Approximation.

Satinder Singh - One of the best experts on this subject based on the ideXlab platform.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.

  • policy gradient methods for reinforcement learning with Function Approximation
    Neural Information Processing Systems, 1999
    Co-Authors: Richard S Sutton, David Mcallester, Satinder Singh, Yishay Mansour
    Abstract:

    Function Approximation is essential to reinforcement learning, but the standard approach of approximating a value Function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own Function approximator, independent of the value Function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage Function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable Function Approximation is convergent to a locally optimal policy.