Reference
L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuška, "Online
least-squares policy iteration for reinforcement learning control,"
Proceedings of the 2010 American Control Conference,
Baltimore, Maryland, pp. 486-491, June-July 2010.
Abstract
Reinforcement learning is a promising paradigm for learning optimal control. We
consider policy iteration (PI) algorithms for reinforcement learning, which
iteratively
evaluate and
improve control policies. State-of-the-art, least-squares
techniques for policy evaluation are sample-efficient and have relaxed
convergence requirements. However, they are typically used in offline PI,
whereas a central goal of reinforcement learning is to develop
online algorithms. Therefore, we propose an online PI
algorithm that evaluates policies with the so-called least-squares temporal
difference for Q-functions (LSTD-Q). The crucial difference between this
online least-squares policy iteration (LSPI) algorithm and
its offline counterpart is that, in the online case, policy improvements must
be performed once every few state transitions, using only an incomplete
evaluation of the current policy. In an extensive experimental evaluation,
online LSPI is found to work well for a wide range of its parameters, and to
learn successfully in a real-time example. Online LSPI also compares favorably
with offline LSPI and with a different flavor of online PI, which instead of
LSTD-Q employs another least-squares method for policy evaluation.
Downloads
BibTeX
@inproceedings{BusErn:10-009,
author = {Bu{\c{s}}oniu, Lucian and Ernst, Damien and De Schutter, Bart
and Babu{\v{s}}ka, Robert},
title = {Online Least-Squares Policy Iteration for Reinforcement
Learning Control},
booktitle = {Proceedings of the 2010 American Control Conference},
address = {Baltimore, Maryland},
pages = {486--491},
month = jun # {--} # jul,
year = {2010}
}