Policy Gradient Methods

Peters, J; Bagnell, JA

doi:10.1007/978-0-387-30164-8_640

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Buchkapitel

Policy Gradient Methods

MPG-Autoren

/persons/resource/persons84135

Peters, J
Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society;
Max Planck Institute for Biological Cybernetics, Max Planck Society;

Externe Ressourcen

https://link.springer.com/content/pdf/10.1007%2F978-0-387-30164-8_640.pdf
(Verlagsversion)

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

Es sind keine frei zugänglichen Volltexte in PuRe verfügbar

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Peters, J., & Bagnell, J. (2010). Policy Gradient Methods. In C. Sammut, & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 774-776). Berlin, Germany: Springer.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-0013-BD40-C

Zusammenfassung

A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by a variant of gradient descent. These methods belong to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class, in contrast with traditional value function approximation approaches that derive policies from a value function. Policy gradient approaches have various advantages: they enable the straightforward incorporation of domain knowledge in policy parametrization and often an optimal policy is more compactly represented than the corresponding value function; many such methods guarantee to convergence to at least a locally optimal policy; the methods naturally handle continuous states and actions and often even imperfect state information. The counterveiling drawbacks include difficulties in off-policy settings, the potential for very slow convergence and high sample complexity, as well as identifying local optima that are not globally optimal.