English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Poster

Sweeping Improvements to Exploration

MPS-Authors
/persons/resource/persons252285

Antonov,  G
Department of Computational Neuroscience, Max Planck Institute for Biological Cybernetics, Max Planck Society;

/persons/resource/persons217460

Dayan,  P
Department of Computational Neuroscience, Max Planck Institute for Biological Cybernetics, Max Planck Society;

Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Antonov, G., & Dayan, P. (2022). Sweeping Improvements to Exploration. Poster presented at 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2022), Providence, RI, USA.


Cite as: https://hdl.handle.net/21.11116/0000-000A-825E-A
Abstract
A modern synthesis of many studies examining hippocampal replay in decision-making tasks suggests that such patterns of
behaviourally-relevant neural activity may support the sort of offline generative planning mechanisms such as DYNA that
have been postulated in reinforcement learning (RL). A key observation in favour of this suggestion is the apparently close
association between specific choices of replay experiences and both reward and the animal’s policy – i.e., the decisions it
subsequently makes; however, the rules that govern the selection of these experiences still remain poorly understood. A
recent theory which is based on key optimising ideas from RL provides an astute normative account of this prioritisation,
suggesting that replay experiences should be ordered according to their expected immediate impact on the value accrued by
applying the newly changed policy. This theory closely matches experimental data from both rodent and human experiments;
however, it focuses on exploitation to the exclusion of exploration, which limits its applicability. Here, we consider how offline,
replay-like, planning mechanisms can contribute to information-seeking behaviour in the form of directed exploration
by extending the theory to partially observable domains. We analyse the resulting exploratory replay choices in two cases: a
stateless bandit with uncertainty about arm outcomes and a dynamic maze with a removable barrier which we model as a
continual learning problem.