decouple the problem of policy evaluation in the crit-
ic from improvement of a policy in the actor. This
separation of concerns between the specification of
an objective and the means to achieve it is a power-
ful concept that eases our alignment goal: the ability
to set an optimization target and to learn the right
solution accordingly. Another benefit of the actor-
critic approach is the flexibility in choosing the class
of policies represented in the actor. Using a combi-
nation of randomized policies and approximate val-
ue function, it is possible to seamlessly handle con-
tinuous spaces of actions and states.
Given any differentiable parameterization of a randomized policy, the policy gradient theorem (Sutton
et al. 1999; Konda and Tsitsiklis 2000) provides an
expression for the gradient of either the expected discounted return or the average reward criterion with
respect to the parameters of the policy. The main
result can be stated rather simply: if an action is
good, the policy gradient will update parameters to
make that action more likely to be chosen again. The
determination of whether an action was good is
where the critic intervenes using value estimates. In
an actor-critic architecture, the action values are
learned in the critic in parallel with the policy
updates. Figure 1 shows an actor-critic architecture
with policy gradient updates and temporal difference
learning in the critic.
The option-critic architecture of figure 2 is our
adaptation (Bacon, Harb, and Precup 2017) of the
actor-critic architecture for learning Markov options
end to end by stochastic gradient ascent. As in regular policy gradient methods, we require the option
policies to be represented by differentiable randomized policies. Similarly, the termination conditions
need to be randomized, and the chosen parameterization must be differentiable. Therefore, we prefer to
refer to termination conditions with parameters as
simply termination functions.
Theoretical Results
Due to the requirements of our system, we cannot
directly apply the policy gradient theorem to learn
parameterized options, because working with
options brings us to the SMDP framework. Yet working only at the SMDP level prevents us from considering the structure within an option, where the policy of an option is executing. We addressed this
problem by using Markov options and by adopting
the intra-option learning perspective (Sutton, Precup, and Singh 1999). As a consequence, the Markov
property could be recovered both over actions and
over state-option pairs.
The first step towards deriving gradient theorems
for options was to describe precisely (Bacon, Harb,
and Precup 2017) the probabilistic structure of the
Markov chain, which takes the memory (stack) into
account. We focused on a Markov chain over an augmented space, consisting of states, the option that is
executing (in other words, the content of the stack),
and the primitive action choice. It then sufficed to
apply standard calculus tools in this chain, in order
to derive gradients for the policy of an option and its
termination function.
Figure 1. The Original Actor-Critic Architecture.
Value
function
Environment
Policy
ActionState
Actor
Reward
Gradient
Critic
TD error
Figure 2. Our Proposed Option-Critic Architecture.
μθ
Qθ, Aθ
Environment
ActionState
πθ, βθ
Reward
Gradients
Critic TD error
Selected
OptionOptions
Policy over Options