Linear Function

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 663708 Experts worldwide ranked by ideXlab platform

Dongruo Zhou - One of the best experts on this subject based on the ideXlab platform.

  • uniform pac bounds for reinforcement learning with Linear Function approximation
    Neural Information Processing Systems, 2021
    Co-Authors: Dongruo Zhou
    Abstract:

    We study reinforcement learning (RL) with Linear Function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with Linear Function approximation. At the core of our algorithm is a novel minimax value Function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

  • reward free model based reinforcement learning with Linear Function approximation
    Neural Information Processing Systems, 2021
    Co-Authors: Weitong Zhang, Dongruo Zhou
    Abstract:

    We study the model-based reward-free reinforcement learning with Linear Function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward Function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a Linear Function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward Function, UCRL-RFE needs to sample at most $\tilde O(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde O(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of Linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.

  • variance aware off policy evaluation with Linear Function approximation
    Neural Information Processing Systems, 2021
    Co-Authors: Yifei Min, Tianhao Wang, Dongruo Zhou
    Abstract:

    We study the off-policy evaluation (OPE) problem in reinforcement learning with Linear Function approximation, which aims to estimate the value Function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value Function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic Linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value Function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.

  • provably efficient reinforcement learning with Linear Function approximation under adaptivity constraints
    Neural Information Processing Systems, 2021
    Co-Authors: Tianhao Wang, Dongruo Zhou
    Abstract:

    We study reinforcement learning (RL) with Linear Function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: batch learning model and rare policy switch model, and propose two efficient online RL algorithms for Linear Markov decision processes. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity.

  • logarithmic regret for reinforcement learning with Linear Function approximation
    International Conference on Machine Learning, 2021
    Co-Authors: Dongruo Zhou
    Abstract:

    Reinforcement learning (RL) with Linear Function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed Linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value Function. More specifically, under the Linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$ regret; and under the Linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with Linear Function approximation. We also establish gap-dependent lower bounds for the two Linear MDP models.

Daniela Tuninetti - One of the best experts on this subject based on the ideXlab platform.

  • a general coded caching scheme for scalar Linear Function retrieval
    arXiv: Information Theory, 2021
    Co-Authors: Daniela Tuninetti
    Abstract:

    Coded caching aims to minimize the network's peak-time communication load by leveraging the information pre-stored in the local caches at the users. The original single file retrieval setting by Maddah-Ali and Niesen has been recently extended to general Scalar Linear Function Retrieval (SLFR) by Wan et al., who proposed a Linear scheme that surprisingly achieves the same optimal load (under the constraint of uncoded cache placement) as in single file retrieval. This paper's goal is to characterize the conditions under which a general SLFR Linear scheme is optimal and gain practical insights into why the specific choices made by Wan et al. work. This paper shows that the optimal decoding coefficients are necessarily the product of two terms, one only involving the encoding coefficients and the other only the demands. In addition, the relationships among the encoding coefficients are shown to be captured by the cycles of certain graphs. Thus, a general Linear scheme for SLFR can be found by solving a spanning tree problem.

  • key superposition simultaneously achieves security and privacy in cache aided Linear Function retrieval
    arXiv: Information Theory, 2020
    Co-Authors: Qifa Yan, Daniela Tuninetti
    Abstract:

    This work investigates the problem of cache-aided content Secure and demand Private Linear Function Retrieval (SP-LFR),where three constrains are imposed on the coded caching system: a) each user is interested in retrieving an arbitrary Linear combination of the files in the server's library; b) the content of the library must be kept secure from a wiretapper who obtains the signal sent by the server; and c) any subset of users together can not obtain any information about the demands of the remaining users. A procedure is proposed to derive a SP-LFR scheme from a given Placement Delivery Array (PDA), known to give coded caching schemes with low subpacketization for systems with neither security nor privacy constraints. This procedure uses the superposition of security keys and privacy keys in both the cache placement and transmitted signal to guarantee content security and demand privacy, respectively. In particular, among all PDA-based SP-LFR schemes, the memory-load pairs achieved by the PDA describing the Maddah-Ali and Niesen's scheme are Pareto-optimal and have the lowest subpacketization. No such strong performance guarantees on PDA were known in the literature. Moreover, the achieved load-memory tradeoff is optimal to within a constant multiplicative gap except for the small memory regime when the number of file is smaller than the number of users. Remarkably, the memory-load tradeoff does not increase compared to the best known schemes that only guarantee content security in all regimes or only demand privacy in some regime.

  • cache aided scalar Linear Function retrieval
    International Symposium on Information Theory, 2020
    Co-Authors: Kai Wan, Hua Sun, Daniela Tuninetti, Giuseppe Caire
    Abstract:

    In the shared-link coded caching problem, formulated by Maddah-Ali and Niesen (MAN), each cache-aided user demands one file (i.e., single file retrieval). This paper generalizes the MAN problem so as to allow users to request scalar Linear Functions (aka, Linear combinations with scalar coefficients) of the files. We propose a novel coded delivery scheme, based on MAN uncoded cache placement, that allows for the decoding of arbitrary scalar Linear Functions of the files on arbitrary finite fields. Surprisingly, it is shown that the load for cache-aided scalar Linear Function retrieval depends on the number of Linearly independent Functions that are demanded, akin to the cache-aided single-file retrieval problem where the load depends on the number of distinct file requests. The proposed scheme is proved to be optimal under the constraint of uncoded cache placement, in terms of worst-case load, and within a factor 2 otherwise.

  • on optimal load memory tradeoff of cache aided scalar Linear Function retrieval
    arXiv: Information Theory, 2020
    Co-Authors: Kai Wan, Hua Sun, Daniela Tuninetti, Giuseppe Caire
    Abstract:

    Coded caching has the potential to greatly reduce network traffic by leveraging the cheap and abundant storage available in end-user devices so as to create multicast opportunities in the delivery phase. In the seminal work by Maddah-Ali and Niesen (MAN), the shared-link coded caching problem was formulated, where each user demands one file (i.e., single file retrieval). This paper generalizes the MAN problem so as to allow users to request scalar Linear Functions of the files. This paper proposes a novel coded delivery scheme that, based on MAN uncoded cache placement, is shown to allow for the decoding of arbitrary scalar Linear Functions of the files (on arbitrary finite fields). Interestingly, and quite surprisingly, it is shown that the load for cache-aided scalar Linear Function retrieval depends on the number of Linearly independent Functions that are demanded, akin to the cache-aided single-file retrieval problem where the load depends on the number of distinct file requests. The proposed scheme is optimal under the constraint of uncoded cache placement, in terms of worst-case load, and within a factor 2 otherwise. The key idea of this paper can be extended to all scenarios which the original MAN scheme has been extended to, including demand-private and/or device-to-device settings.

Giuseppe Caire - One of the best experts on this subject based on the ideXlab platform.

  • cache aided scalar Linear Function retrieval
    International Symposium on Information Theory, 2020
    Co-Authors: Kai Wan, Hua Sun, Daniela Tuninetti, Giuseppe Caire
    Abstract:

    In the shared-link coded caching problem, formulated by Maddah-Ali and Niesen (MAN), each cache-aided user demands one file (i.e., single file retrieval). This paper generalizes the MAN problem so as to allow users to request scalar Linear Functions (aka, Linear combinations with scalar coefficients) of the files. We propose a novel coded delivery scheme, based on MAN uncoded cache placement, that allows for the decoding of arbitrary scalar Linear Functions of the files on arbitrary finite fields. Surprisingly, it is shown that the load for cache-aided scalar Linear Function retrieval depends on the number of Linearly independent Functions that are demanded, akin to the cache-aided single-file retrieval problem where the load depends on the number of distinct file requests. The proposed scheme is proved to be optimal under the constraint of uncoded cache placement, in terms of worst-case load, and within a factor 2 otherwise.

  • on optimal load memory tradeoff of cache aided scalar Linear Function retrieval
    arXiv: Information Theory, 2020
    Co-Authors: Kai Wan, Hua Sun, Daniela Tuninetti, Giuseppe Caire
    Abstract:

    Coded caching has the potential to greatly reduce network traffic by leveraging the cheap and abundant storage available in end-user devices so as to create multicast opportunities in the delivery phase. In the seminal work by Maddah-Ali and Niesen (MAN), the shared-link coded caching problem was formulated, where each user demands one file (i.e., single file retrieval). This paper generalizes the MAN problem so as to allow users to request scalar Linear Functions of the files. This paper proposes a novel coded delivery scheme that, based on MAN uncoded cache placement, is shown to allow for the decoding of arbitrary scalar Linear Functions of the files (on arbitrary finite fields). Interestingly, and quite surprisingly, it is shown that the load for cache-aided scalar Linear Function retrieval depends on the number of Linearly independent Functions that are demanded, akin to the cache-aided single-file retrieval problem where the load depends on the number of distinct file requests. The proposed scheme is optimal under the constraint of uncoded cache placement, in terms of worst-case load, and within a factor 2 otherwise. The key idea of this paper can be extended to all scenarios which the original MAN scheme has been extended to, including demand-private and/or device-to-device settings.

Shalabh Bhatnagar - One of the best experts on this subject based on the ideXlab platform.

  • an online prediction algorithm for reinforcement learning with Linear Function approximation using cross entropy method
    Machine Learning, 2018
    Co-Authors: Ajin George Joseph, Shalabh Bhatnagar
    Abstract:

    In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value Function of a model-free Markov reward process using the Linear Function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued Function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.

  • an online prediction algorithm for reinforcement learning with Linear Function approximation using cross entropy method
    arXiv: Learning, 2018
    Co-Authors: Ajin George Joseph, Shalabh Bhatnagar
    Abstract:

    In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value Function of a model-free Markov reward process using the Linear Function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy (CE) optimization method which is a model based search method to find the global optimum of a real-valued Function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.

  • approximate dynamic programming with min Linear Function approximation for markov decision processes
    Conference on Decision and Control, 2014
    Co-Authors: L Chandrashekar, Shalabh Bhatnagar
    Abstract:

    Markov Decision Process (MDP) is a useful framework to study problems of optimal sequential decision making under uncertainty. Given any MDP the aim here is to find the optimal action selection mechanism i.e., the optimal policy. Typically, the optimal policy (u*) is obtained by substituting the optimal value-Function (J*) in the Bellman equation. Alternatively, u* is also obtained by learning the optimal state-action value Function Q* known as the Q value-Function. However, it is difficult to compute the exact values of J* or Q* for MDPs with large number of states. Approximate Dynamic Programming (ADP) methods address this difficulty by computing lower dimensional approximations of J*/Q*. Most ADP methods employ Linear Function approximation (LFA), i.e., the approximate solution lies in a subspace spanned by a family of pre-selected basis Functions. The approximation is obtained via a Linear least squares projection of higher dimensional quantities and the L 2 norm plays an important role in convergence and error analysis. In this paper, we discuss ADP methods for MDPs based on LFAs in the (min; +) algebra. Here the approximate solution is a (min; +) Linear combination of a set of basis Functions whose span constitutes a subsemimodule. Approximation is obtained via a projection operator onto the subsemimodule which is different from Linear least squares projection used in ADP methods based on conventional LFAs. MDPs are not (min; +) Linear systems, nevertheless, we show that the monotonicity property of the projection operator helps us establish the convergence of our ADP schemes. We also discuss future directions in ADP methods for MDPs based on the (min; +) LFAs.

  • approximate dynamic programming with min Linear Function approximation for markov decision processes
    arXiv: Systems and Control, 2014
    Co-Authors: Chandrashekar Lakshminarayanan, Shalabh Bhatnagar
    Abstract:

    Markov Decision Processes (MDP) is an useful framework to cast optimal sequential decision making problems. Given any MDP the aim is to find the optimal action selection mechanism i.e., the optimal policy. Typically, the optimal policy ($u^*$) is obtained by substituting the optimal value-Function ($J^*$) in the Bellman equation. Alternately $u^*$ is also obtained by learning the optimal state-action value Function $Q^*$ known as the $Q$ value-Function. However, it is difficult to compute the exact values of $J^*$ or $Q^*$ for MDPs with large number of states. Approximate Dynamic Programming (ADP) methods address this difficulty by computing lower dimensional approximations of $J^*$/$Q^*$. Most ADP methods employ Linear Function approximation (LFA), i.e., the approximate solution lies in a subspace spanned by a family of pre-selected basis Functions. The approximation is obtain via a Linear least squares projection of higher dimensional quantities and the $L_2$ norm plays an important role in convergence and error analysis. In this paper, we discuss ADP methods for MDPs based on LFAs in $(\min,+)$ algebra. Here the approximate solution is a $(\min,+)$ Linear combination of a set of basis Functions whose span constitutes a subsemimodule. Approximation is obtained via a projection operator onto the subsemimodule which is different from Linear least squares projection used in ADP methods based on conventional LFAs. MDPs are not $(\min,+)$ Linear systems, nevertheless, we show that the monotonicity property of the projection operator helps us to establish the convergence of our ADP schemes. We also discuss future directions in ADP methods for MDPs based on the $(\min,+)$ LFAs.

  • fast gradient descent methods for temporal difference learning with Linear Function approximation
    International Conference on Machine Learning, 2009
    Co-Authors: Richard S Sutton, Csaba Szepesvari, Shalabh Bhatnagar, Hamid Reza Maei, Doina Precup, David Silver, Eric Wiewiora
    Abstract:

    Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both Linear Function approximation and off-policy training, and whose complexity scales only Linearly in the size of the Function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional Linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective Function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, Linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend Linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Richard S Sutton - One of the best experts on this subject based on the ideXlab platform.

  • weighted importance sampling for off policy learning with Linear Function approximation
    Neural Information Processing Systems, 2014
    Co-Authors: Rupam A Mahmood, Hado Van Hasselt, Richard S Sutton
    Abstract:

    Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to Function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).

  • dyna style planning with Linear Function approximation and prioritized sweeping
    arXiv: Artificial Intelligence, 2012
    Co-Authors: Richard S Sutton, Csaba Szepesvari, Alborz Geramifard, Michael Bowling
    Abstract:

    We consider the problem of efficiently learning optimal control policies and value Functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to Linear Function approximation. Dynastyle planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that Linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the Linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with Linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

  • fast gradient descent methods for temporal difference learning with Linear Function approximation
    International Conference on Machine Learning, 2009
    Co-Authors: Richard S Sutton, Csaba Szepesvari, Shalabh Bhatnagar, Hamid Reza Maei, Doina Precup, David Silver, Eric Wiewiora
    Abstract:

    Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both Linear Function approximation and off-policy training, and whose complexity scales only Linearly in the size of the Function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional Linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective Function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, Linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend Linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

  • a convergent o n algorithm for off policy temporal difference learning with Linear Function approximation
    Neural Information Processing Systems, 2008
    Co-Authors: Richard S Sutton, Csaba Szepesvari, Hamid Reza Maei
    Abstract:

    We introduce the first temporal-difference learning algorithm that is stable with Linear Function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales Linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

  • a convergent o n temporal difference algorithm for off policy learning with Linear Function approximation
    Neural Information Processing Systems, 2008
    Co-Authors: Richard S Sutton, Hamid Reza Maei, Csaba Szepesvari
    Abstract:

    We introduce the first temporal-difference learning algorithm that is stable with Linear Function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales Linearly in the number of parameters. We consider an i.i.d.\ policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.