Relative Entropy Inverse Reinforcement Learning

Boularias, A.; Kober, J.; Peters, J.

Item

ITEM ACTIONSEXPORT

Add to Basket

Please note that a newer version of this item is available:
https://pure.mpg.de/pubman/item/item_1577808_7

DetailsSummary

Released

Conference Paper

Relative Entropy Inverse Reinforcement Learning

MPS-Authors

/persons/resource/persons83823

Boularias, A.
Dept. Empirical Inference, Max Planck Institute for Intelligent Systems, Max Planck Society;

/persons/resource/persons84021

Kober, J.
Dept. Empirical Inference, Max Planck Institute for Intelligent Systems, Max Planck Society;

/persons/resource/persons84135

Peters, J.
Dept. Empirical Inference, Max Planck Institute for Intelligent Systems, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Boularias, A., Kober, J., & Peters, J. (2011). Relative Entropy Inverse Reinforcement Learning.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0010-4EFD-0

Abstract

We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)-optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a uniform policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using approximate MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.