In the gradient theorem for option policies, the
interpretation for the original policy gradient theorem is maintained: if an action chosen within an
option is useful, the gradient update will make it
more likely to be picked again in the same state. This
result also provides the global optimality property
that we initially required. In fact, the effect of considering the choice of option as part of the state space
results in a critic that provides estimates of the
expected discounted return for the system as a
whole, given that an action is taken in a certain state
and under a certain option. Therefore, the gradient
for option policies takes into account how a local
change in the choice of actions would impact performance of the entire system.
The gradient theorem for termination functions
also has a clear interpretation, but involves a different critic feedback than for option policies. The termination gradient makes an option more likely to
terminate if there is no longer an advantage in following it. Conversely, if committing to an option is
deemed advantageous by the critic, its probability of
terminating should be decreased, so as to lengthen
that option. The expression advantageous, loosely
used up to now, is defined precisely in terms of the
advantage function (Baird 1993): the difference
between the value of a given option at a state and the
expected value over all options. Interestingly, the termination gradient theorem for options can be seen as
another instantiation of the interruption execution
model (Sutton, Precup, and Singh 1999), whereby
the policy over options commits to an option unless
a better one can be taken.
Deep Options
In addition to options, a state representation can also
be learned end to end using the option-critic archi-
tecture. With the Arcade Learning Environment
(ALE) (Bellemare et al. 2013) in mind, we designed a
parameterization based on the deep network archi-
tecture of the DQN algorithm (Mnih et al. 2015).
Because in this environment the agent observes
images, the first few layers of the network (figure 3)
apply convolutions to a concatenation of the last
four frames. In the penultimate layer, the high-level
visual features extracted are combined in a shared
representation across all options, termination func-
tions, and value outputs.
While we could have chosen to also parameterize
the policy over options, we decided to use an epsilon-greedy (Sutton and Barto 1998) policy over options
derived from the value outputs. Therefore, the stream
of computation going from input to value output,
and epsilon-greedy policy, mirror the same design as
DQN. However, the second path of computation
ending in the option policies and termination functions necessitates randomization, according to the
gradient theorems for options. Because the action
space is discrete, we chose softmax (Sutton et al.
1999) for the option policies and sigmoids for the termination functions.
Different kinds of parameter updates are also necessary in each of the two streams. For the value
updates and control over options, we used the idea of
a target network from DQN, but in combination with
intra-option Q-learning (Sutton, Precup, and Singh
1999) instead of Q-learning (Watkins 1989). By freezing the network for a fixed interval, the target for the
value updates becomes more stationary and learning
is more stable. We computed both kinds of updates at
every step with samples coming from two different
sources: from an experience replay buffer (Lin 1992)
for learning values, and from fresh online samples for
the options updates. The reason for not using
replayed samples with option gradients (or policy
gradients in general) was to ensure that our gradient
estimates would truly come from the distribution of
Figure 3. Network Architecture for Option-Critic in the ALE Environment.
Policy over options
Termination functions
Internal policies
Shared
representation
Convolutional
layers
Last 4
frames