Policy Iteration

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 9960 Experts worldwide ranked by ideXlab platform

Derong Liu - One of the best experts on this subject based on the ideXlab platform.

  • continuous time time varying Policy Iteration
    IEEE Transactions on Systems Man and Cybernetics, 2020
    Co-Authors: Qinglai Wei, Zehua Liao, Zhanyu Yang, Derong Liu
    Abstract:

    A novel Policy Iteration algorithm, called the continuous-time time-varying (CTTV) Policy Iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV Policy Iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton–Jacobi–Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV Policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.

  • data based optimal control for weakly coupled nonlinear systems using Policy Iteration
    IEEE Transactions on Systems Man and Cybernetics, 2018
    Co-Authors: Derong Liu, Ding Wang
    Abstract:

    In this paper, a data-based online learning algorithm is established to solve the optimal control problem for weakly coupled continuous-time nonlinear systems with completely unknown dynamics. Using the weak coupling theory, we reformulate the original problem into three reduced-order optimal control problems. We establish an online model-free integral Policy Iteration algorithm to solve the decoupled optimal control problems without system dynamics. To implement the data-based online learning algorithm, the actor-critic technique based on neural networks and the least squares method are used. Two simulation examples are given to verify the effectiveness of the developed algorithm.

  • discrete time optimal control via local Policy Iteration adaptive dynamic programming
    IEEE Transactions on Systems Man and Cybernetics, 2017
    Co-Authors: Qinglai Wei, Derong Liu, Qiao Lin, Ruizhuo Song
    Abstract:

    In this paper, a discrete-time optimal control scheme is developed via a novel local Policy Iteration adaptive dynamic programming algorithm. In the discrete-time local Policy Iteration algorithm, the iterative value function and iterative control law can be updated in a subset of the state space, where the computational burden is relaxed compared with the traditional Policy Iteration algorithm. Convergence properties of the local Policy Iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions. The admissibility of the iterative control law is proven, which shows that the control system can be stabilized under any of the iterative control laws, even if the iterative control law is updated in a subset of the state space. Finally, two simulation examples are given to illustrate the performance of the developed method.

  • a novel optimal tracking control scheme for a class of discrete time nonlinear systems using generalised Policy Iteration adaptive dynamic programming algorithm
    International Journal of Systems Science, 2017
    Co-Authors: Qiao Lin, Qinglai Wei, Derong Liu
    Abstract:

    In this paper, a novel iterative adaptive dynamic programming ADP algorithm, called generalised Policy Iteration ADP algorithm, is developed to solve optimal tracking control problems for discrete-time nonlinear systems. The idea is to use two Iteration procedures, including an i-Iteration and a j-Iteration, to obtain the iterative tracking control laws and the iterative value functions. By system transformation, we first convert the optimal tracking control problem into an optimal regulation problem. Then the generalised Policy Iteration ADP algorithm, which is a general idea of interacting Policy and value Iteration algorithms, is introduced to deal with the optimal regulation problem. The convergence and optimality properties of the generalised Policy Iteration algorithm are analysed. Three neural networks are used to implement the developed algorithm. Finally, simulation examples are given to illustrate the performance of the present algorithm.

  • a novel Policy Iteration based deterministic q learning for discrete time nonlinear systems
    Science in China Series F: Information Sciences, 2015
    Co-Authors: Qinglai Wei, Derong Liu
    Abstract:

    In this chapter, a novel iterative Q-learning algorithm, called “Policy Iteration-based deterministic Q-learning algorithm,” is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming (ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically nonincreasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are used to implement the Policy Iteration-based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.

Qinglai Wei - One of the best experts on this subject based on the ideXlab platform.

  • Policy Iteration algorithm for constrained cost optimal control of discrete time nonlinear system
    International Joint Conference on Neural Network, 2021
    Co-Authors: Qinglai Wei, Ruizhuo Song
    Abstract:

    In this paper, optimal control problems with constraints on summation of auxiliary utility function are called constrained cost optimal control problems and a constrained cost Policy Iteration adaptive dynamic programming (ADP) algorithm is developed to solve constrained cost optimal control problems for discrete-time nonlinear systems. A convergence analysis is developed to guarantee that the iterative value functions nonin-creasingly convergent to the approximate optimal value function. It is also proven that any of the iterative control Policy is feasible and can stabilize the nonlinear systems. Finally, a simulation example is given to illustrate the performance of the developed constrained cost Policy Iteration algorithm.

  • continuous time distributed Policy Iteration for multicontroller nonlinear systems
    IEEE Transactions on Systems Man and Cybernetics, 2021
    Co-Authors: Qinglai Wei, Xiong Yang
    Abstract:

    In this article, a novel distributed Policy Iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems. In each Iteration of the developed distributed Policy Iteration algorithm, only one controller’s control law is updated and the other controllers’ control laws remain unchanged. The main contribution of the present algorithm is to improve the iterative control law one by one, instead of updating all the control laws in each Iteration of the traditional Policy Iteration algorithms, which effectively releases the computational burden in each Iteration. The properties of distributed Policy Iteration algorithm for continuous-time nonlinear systems are analyzed. The admissibility of the present methods has also been analyzed. Monotonicity, convergence, and optimality have been discussed, which show that the iterative value function is nonincreasingly convergent to the solution of the Hamilton–Jacobi–Bellman equation. Finally, numerical simulations are conducted to illustrate the effectiveness of the proposed method.

  • a partial Policy Iteration adp algorithm for nonlinear neuro optimal control with discounted total reward
    Neurocomputing, 2021
    Co-Authors: Mingming Liang, Qinglai Wei
    Abstract:

    Abstract This paper constructs a partial Policy Iteration adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of nonlinear systems with discounted total reward. Compared with traditional Policy Iteration ADP algorithm, the approach updates the iterative control law only in a local region of the global system state space. With the benefit of this feature, the overall computational burden at each Iteration for processing units can be significantly reduced. Hence, this feature enables our algorithm to be successfully executed on low-performance devices such as smartphones, smartwatches and the Internet of Things (IoT) objects. We provide the convergency analysis to show that the generated sequence of value functions is monotonically nonincreasing and can finally reach a local optimum. In addition, the corresponding local Policy space is developed theoretically for the first time. Besides, when the sequence of the local system state spaces is chosen properly, we prove that the developed algorithm is capable of finding the global optimal performance index function for the nonlinear systems. Finally, we present a numerical simulation to demonstrate the effectiveness of the proposed algorithm.

  • continuous time time varying Policy Iteration
    IEEE Transactions on Systems Man and Cybernetics, 2020
    Co-Authors: Qinglai Wei, Zehua Liao, Zhanyu Yang, Derong Liu
    Abstract:

    A novel Policy Iteration algorithm, called the continuous-time time-varying (CTTV) Policy Iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV Policy Iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton–Jacobi–Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV Policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.

  • discrete time optimal control via local Policy Iteration adaptive dynamic programming
    IEEE Transactions on Systems Man and Cybernetics, 2017
    Co-Authors: Qinglai Wei, Derong Liu, Qiao Lin, Ruizhuo Song
    Abstract:

    In this paper, a discrete-time optimal control scheme is developed via a novel local Policy Iteration adaptive dynamic programming algorithm. In the discrete-time local Policy Iteration algorithm, the iterative value function and iterative control law can be updated in a subset of the state space, where the computational burden is relaxed compared with the traditional Policy Iteration algorithm. Convergence properties of the local Policy Iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions. The admissibility of the iterative control law is proven, which shows that the control system can be stabilized under any of the iterative control laws, even if the iterative control law is updated in a subset of the state space. Finally, two simulation examples are given to illustrate the performance of the developed method.

Michail G. Lagoudakis - One of the best experts on this subject based on the ideXlab platform.

  • rollout sampling approximate Policy Iteration
    European conference on Machine Learning, 2008
    Co-Authors: Christos Dimitrakakis, Michail G. Lagoudakis
    Abstract:

    Several researchers [2,3] have recently investigated the connection between reinforcement learning and classification. Our work builds on [2], which suggests an approximate Policy Iteration algorithm for learning a good Policy represented as a classifier, without explicit value function representation. At each Iteration, a new Policy is produced using training data obtained through rollouts of the previous Policy on a simulator. These rollouts aim at identifying better action choices over a subset of states in order to form a set of data for training the classifier representing the improved Policy. Even though [2,3] examine how to distribute training states over the state space, their major limitation remains the large amount of sampling employed at each training state. We suggest methods to reduce the number of samples needed to obtain a high-quality training set. This is done by viewing the setting as akin to a bandit problem over the states from which rollouts are performed. Our contribution is two-fold: (a) we suitably adapt existing bandit techniques for rollout management, and (b) we suggest a more appropriate statistical test for identifying states with dominating actions early and with high confidence. Experiments on two classical domains (inverted pendulum, mountain car) demonstrate an improvement in sample complexity that substantially increases the applicability of rollout-based algorithms. In future work, we aim to obtain algorithms specifically tuned to this task with even lower sample complexity and to address the question of the choice of sampling distribution.

  • Rollout sampling approximate Policy Iteration
    Machine Learning, 2008
    Co-Authors: Christos Dimitrakakis, Michail G. Lagoudakis
    Abstract:

    Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate Policy Iteration schemes without value functions, which focus on Policy representation using classifiers and address Policy learning as a supervised learning problem. This paper proposes variants of an improved Policy Iteration scheme which addresses the core sampling problem in evaluating a Policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.

  • least squares Policy Iteration
    Journal of Machine Learning Research, 2003
    Co-Authors: Michail G. Lagoudakis, Ronald Parr
    Abstract:

    We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate Policy Iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed Policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares Policy Iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental Policy improvement within a Policy-Iteration framework. LSPI is a model-free, off-Policy method which can use efficiently (and reuse in each Iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.

  • model free least squares Policy Iteration
    Neural Information Processing Systems, 2001
    Co-Authors: Michail G. Lagoudakis, Ronald Parr
    Abstract:

    We propose a new approach to reinforcement learning which combines least squares function approximation with Policy Iteration. Our method is model-free and completely off Policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-Policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

Mohammad Ghavamzadeh - One of the best experts on this subject based on the ideXlab platform.

  • regularized Policy Iteration with nonparametric function spaces
    Journal of Machine Learning Research, 2016
    Co-Authors: Amirmassoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvari, Shie Mannor
    Abstract:

    We study two regularization-based approximate Policy Iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' Policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the Policy evaluation error and the performance loss of the Policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the Policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate Policy Iteration algorithm.

  • Approximate modified Policy Iteration and its application to the game of Tetris
    Journal of Machine Learning Research, 2015
    Co-Authors: Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, Matthieu Geist
    Abstract:

    Modified Policy Iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated Policy and value Iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms:~fitted-value Iteration, fitted-Q Iteration, and classification-based Policy Iteration. We provide error propagation analysis that unify those for approximate Policy and value Iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.

  • Approximate Modified Policy Iteration
    2012
    Co-Authors: Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Matthieu Geist
    Abstract:

    Modified Policy Iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated Policy and value Iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value Iteration, fitted-Q Iteration, and classification-based Policy Iteration. We provide error propagation analyses that unify those for approximate Policy and value Iteration. On the last classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation.

  • finite sample analysis of least squares Policy Iteration
    Journal of Machine Learning Research, 2012
    Co-Authors: Alessandro Lazaric, Mohammad Ghavamzadeh, Remi Munos
    Abstract:

    In this paper, we report a performance bound for the widely used least-squares Policy Iteration (LSPI) algorithm. We first consider the problem of Policy evaluation in reinforcement learning, that is, learning the value function of a fixed Policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each Policy evaluation step is propagated through the Iterations of a Policy Iteration method, and derive a performance bound for the LSPI algorithm.

  • analysis of a classification based Policy Iteration algorithm
    International Conference on Machine Learning, 2010
    Co-Authors: Alessandro Lazaric, Mohammad Ghavamzadeh, Mi R Munos
    Abstract:

    We present a classification-based Policy Iteration algorithm, called Direct Policy Iteration, and provide its finite-sample analysis. Our results state a performance bound in terms of the number of Policy improvement steps, the number of rollouts used in each Iteration, the capacity of the considered Policy space, and a new capacity measure which indicates how well the Policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based Policy Iteration setting. We also study the consistency of the method when there exists a sequence of Policy spaces with increasing capacity.

Dimitri P. Bertsekas - One of the best experts on this subject based on the ideXlab platform.

  • on line Policy Iteration for infinite horizon dynamic programming
    arXiv: Optimization and Control, 2021
    Co-Authors: Dimitri P. Bertsekas
    Abstract:

    In this paper we propose an on-line Policy Iteration (PI) algorithm for finite-state infinite horizon discounted dynamic programming, whereby the Policy improvement operation is done on-line, only for the states that are encountered during operation of the system. This allows the continuous updating/improvement of the current Policy, thus resulting in a form of on-line PI that incorporates the improved controls into the current Policy as new states and controls are generated. The algorithm converges in a finite number of stages to a type of locally optimal Policy, and suggests the possibility of variants of PI and multiagent PI where the Policy improvement is simplified. Moreover, the algorithm can be used with on-line replanning, and is also well-suited for on-line PI algorithms with value and Policy approximations.

  • lambda Policy Iteration a review and a new implementation
    arXiv: Systems and Control, 2015
    Co-Authors: Dimitri P. Bertsekas
    Abstract:

    In this paper we discuss $l$-Policy Iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value Iteration (VI) and Policy Iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each Policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE($l$), LSTD($l$), or TD($l$) for Policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.

  • a mixed value and Policy Iteration method for stochastic control with universally measurable policies
    Mathematics of Operations Research, 2015
    Co-Authors: Dimitri P. Bertsekas
    Abstract:

    We consider stochastic optimal control models with Borel spaces and universally measurable policies. For such models the standard Policy Iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and Policy Iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function in a manner that resembles Policy Iteration. It can also be used to address similar difficulties of Policy Iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems for the discounted case where the one-stage costs are bounded and for the undiscounted case where the one-stage costs are nonpositive or nonnegative. For undiscounted total cost problems with nonnegative one-stage costs, we also give a new convergence theorem for value Iteration that shows that value Iteration converges whenever it is initialized with a f...

  • A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies
    Mathematics of Operations Research, 2015
    Co-Authors: Dimitri P. Bertsekas
    Abstract:

    We consider stochastic optimal control models with Borel spaces and universally measurable policies. For such models the standard Policy Iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and Policy Iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function in a manner that resembles Policy Iteration. It can also be used to address similar difficulties of Policy Iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems for the discounted case where the one-stage costs are bounded and for the undiscounted case where the one-stage costs are nonpositive or nonnegative. For undiscounted total cost problems with nonnegative one-stage costs, we also give a new convergence theorem for value Iteration that shows that value Iteration converges whenever it is initialized with a function that is above the optimal cost function and yet bounded by a multiple of the optimal cost function. This condition resembles Whittle’s bridging condition and is partly motivated by it. The theorem is also partly motivated by a result of Maitra and Sudderth that showed that value Iteration, when initialized with the constant function zero, could require a transfinite number of Iterations to converge. We use the new convergence theorem for value Iteration to establish the convergence of our mixed value and Policy Iteration method for the nonnegative cost case.

  • A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies
    2014
    Co-Authors: Dimitri P. Bertsekas
    Abstract:

    We consider the stochastic control model with Borel spaces and universally measurable policies. For this model the standard Policy Iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and Policy Iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function, in a manner that resembles Policy Iteration. It can also be used to address similar difficulties of Policy Iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems, for the discounted case where the one-stage costs are bounded, and for the undiscounted case where the one-stage costs are nonpositive or nonnegative. For undiscounted total cost problems with nonnegative one-stage costs, we also give a new convergence theorem for value Iteration, which shows that value Iteration converges whenever it is initialized with a function that is above the optimal cost function and yet bounded by a multiple of the optimal cost function. This condition resembles Whittle’s bridging condition and is partly motivated by it. The theorem is also partly motivated by a result of Maitra and Sudderth, which showed that value Iteration, when initialized with the constant function zero, could require a transfinite number of Iterations to converge. We use the new convergence theorem for value Iteration to establish the convergence of our mixed value and Policy Iteration method for the nonnegative cost case