option specialized in action sequences going
upwards on the way to replenish the oxygen, while
the other executed only when descending below the
water surface. Because nothing is specified a priori
about options beyond the control objective, there is
no mechanistic explanation for how these specific
options came to be. However, we postulate that the
elementary memory structure of options might represent aspects of the game dynamics having to do
with oxygen management. When nearing low levels
of oxygen, the agent must resist the urge to replenish
the tank long enough to reach the surface. This
would be more difficult to represent in a purely reactive fashion or without having recourse to temporal
abstraction.
Conclusion and Future Work
The option-critic architecture is based on the general
idea that the responsibility of learning should go to
the learner (Drescher 1991). Instead of requiring an
expert to make guesses about what aspects of the task
and environment might be useful for building
options, we let our system learn the right kind of
options for the task at hand directly from its stream
of experience. Building from the blueprints of policy
gradient methods, we provide gradient theorems for
options that allow their internal policies and termi-
nation conditions to be adjusted continually and
simultaneous in order to actually solve the task. If
desired, regularizers can also be added to this objec-
tive to make the system easily informable (Nilsson
1995). The option-critic architecture can then be
instantiated through different implementations of
the stochastic gradient ascent procedure associated
with these gradients.
In spite of the success of the option-critic approach
in Atari games, many questions have yet to be
answered. For example, a common problem observed
in practice is that, as the system becomes more proficient, the average duration of its options also tends
to decrease. Considering the fact that the option-critic learns options for maximizing the expected return,
this phenomenon is hardly surprising. From a pure
optimization perspective, options are indeed useless
for achieving optimal control: their optimal value
function cannot be greater than the optimal value
function of the MDP. As we also know, the optimal
value function in a discounted MDP is always attainable by a greedy policy using only primitive actions
(Puterman 1994). Hence, in a dynamic programming
setting, having long temporally extended actions
provides no benefit over primitive actions if optimal
control is the only goal.
To prevent options from collapsing to primitive
actions, we devised simple regularization strategies
that could be incorporated readily to the objective
without altering the learning architecture. For
instance, the approach used in the Atari games consisted of adding a scalar margin to the advantage
function used in the termination gradient. Intuitively, the effect of this margin term was to set a baseline
of advantageousness in favor of maintaining the
same option. We can also think of the margin as a
Figure 5. Interpretable and Specialized Options Found in the Game of Seaquest.
Transition from option 1 to 2
Action trajectory, time
Option 2: Upward shooting sequence Option 1: Downward shooting sequence
White: option 1 Black: option 2