cost for switching options or for deliberating too
long (Bacon and Precup 2015) — an interpretation
that finds its roots in the bounded rationality framework (Simon 1957).
When departing from perfect rationality, boundedly rational systems are naturally pressured into
making use of the regularities of their environment.
When such systems are learning representations,
only the essential elements can be captured because
the resources — time, energy, computation, favorable
opportunities — are scarce. Evaluation platforms that
suitably reflect these conditions are not yet available
in reinforcement learning. However, we are hoping
to extend our experiments to a more naturalistic scenario by learning in a continuing fashion rather than
in a single task.
From the bounded rationality perspective, providing “good enough” behavior at all times in an efficient manner might be the raison d’être for options.
For example, consider a problem setting where the
world does not wait for a carefully thought out best
next action: maybe a rhinoceros is suddenly charging — no time to waste, acting is all that matters.
There is an inherent cost in nature, but also in artificial systems, for carrying out excessive computation.
Having options that are sufficiently temporally
extended seems to provide a balance between fast
decision making and high-level deliberative reasoning.
Initiation sets also provide a mechanism for managing computation. We avoided working with them
in our option-critic approach, however, by making
the assumption that options are available everywhere. By their very nature, initiation sets are not
parameterized functions, so it is difficult to use our
usual optimization toolkit to learn them end to end,
as we did with termination functions. This problem
needs to be addressed by first redefining the concept
of initiation sets to initiation functions. Then, to
derive a policy gradient–like theorem for initiation
functions, we should also be capable of representing
termination functions using a randomized and differentiable parameterization. In this case, the meaning of randomized initiation functions would have
to be clarified in relation to the call-and-return execution model. Finally, we might need to enforce
compositional properties of options so as to avoid an
option terminating in a region of the state space
where no other options can be taken. It is not clear
at this point how this property could be tractably
enforced in our optimization objective.
We gratefully acknowledge the funding received for
this work from the Canadian National Science and
Engineering Research Council (NSERC) and the
Fonds de Recherche Quebecois – Nature et Technolo-
gie (FRQNT). We are very grateful to Jean Harb for the
experimental results presented here, to Genevieve
Fried for her feedback on this article, and to Rich Sut-
ton for many inspiring conversations on options.
Araújo, D., and Davids, K. 2011. What Exactly Is Acquired
During Skill Acquisition? Journal of Consciousness Studies
18( 3–4): 7–23.
Bacon, P.-L. 2013. On the Bottleneck Concept for Options
Discovery: Theoretical Underpinnings and Extension in
Continuous State Spaces. Master’s thesis, Dept. of Computer Science, McGill University.
Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The Option-Critic Architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 1726–1734. Palo Alto, CA:
Bacon, P.-L., and Precup, D. 2015. Learning with Options:
Just Deliberate and Relax. Paper presented at the NIPS
Bounded Optimality and Rational Metareasoning Workshop, Montréal, Québec, Canada, December 11.
Baird, L. C. 1993. Advantage Updating. Technical Report
WL–TR-93-1146, Wright Laboratory, Wright-Patterson Air
Force Base, OH.
Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.
2013. The Arcade Learning Environment: An Evaluation
Platform for General Agents. Journal of Artificial Intelligence
Research 47( 1): 253–279.
Botvinick, M. M.; Niv, Y.; and Barto, A. C. 2009. Hierarchically Organized Behavior and Its Neural Foundations: A Reinforcement Learning Perspective. Cognition 113( 3): 262–280.
Bouvrie, J. V., and Maggioni, M. 2012. Efficient Solution of
Markov Decision Problems with Multiscale Representations.
In 50th Annual Allerton Conference on Communication, Control, and Computing, 474–481. Piscataway, NJ: Institute for
Electrical and Electronics Engineers. doi.org/10.1109/Aller-
Chaganty, A. T.; Gaur, P.; and Ravindran, B. 2012. Learning
in a Small World. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS’ 12), volume 1, 391–397. Richland, SC: International Foundation for Autonomous Agents and Multiagent
Dayan, P., and Hinton, G. E. 1992. Feudal Reinforcement
Learning. In Advances in Neural Information Processing Systems 5, 271–278. San Francisco: Morgan Kaufmann.
Dean, T., and Lin, S.-H. 1995. Decomposition Techniques
for Planning in Stochastic Domains. In Proceedings of the
14th International Joint Conference on Artificial Intelligence
(IJCAI’ 95), volume 2, 1121–1127. San Francisco: Morgan
Dietterich, T. G. 1998. The MAXQ Method for Hierarchical
Reinforcement Learning. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’ 98), 118–126.
San Francisco: Morgan Kaufmann Publishers.
Dietterich, T. G. 2000. Hierarchical Reinforcement Learning
with the MAXQ Value Function Decomposition. Journal of
Artificial Intelligence Research 13: 227–303.
Drescher, G. L. 1991. Made-Up Minds: A Constructivist
Approach to Artificial Intelligence. Cambridge, MA: The MIT
Fikes, R.; Hart, P. E.; and Nilsson, N. J. 1972. Learning and