The agent will typically seek a way of choosing
actions, conditioned on states, that is rewarding in
the long run. Such a stochastic decision procedure is
called a policy (denoted by π). Rather than simply
maximizing the total reward, which may not be
bounded in general, the agent usually attempts to
maximize discounted returns. The discount factor γ
can be conceptualized as an inflation rate that deprecates rewards at every time step.
In the policy evaluation problem, the goal is to
compute the expected discounted return for a given,
fixed policy over the distribution of possible trajectories. This information is summarized in a value
function:
In the control problem, the goal is to find a policy
that maximizes the expected return.
A natural approach to these problems for agents
that are continually acting and learning is temporal
difference (TD) learning, introduced by Sutton (1984,
1988) in the context of policy evaluation, and later
adapted for control through the Q-learning (Watkins
1989) and Sarsa (Rummery and Niranjan 1994) algorithms. For the policy evaluation case, the core idea
is that, after learning has completed, the value of a
state should be equal, in expectation, to the reward
plus the discounted value of the next state. The temporal difference error quantifies how different the
estimated value of a state is at the current time step,
compared to one time step later (when a new sample
transition and reward have been observed). The algorithm uses this error to train an approximation of vπ.
In the case of control, this idea is supplemented with
a simple strategy for changing the policy π over time:
actions that lead to better-than-expected outcomes
should be taken more often (that is, reinforced).
Actions with Variable Duration
With the modern foundations of learning through
reinforcement established by the end of the 1980s, a
number of proposals were made to extend the scope
of reinforcement learning methods from actions of
fixed duration to actions temporally extended
(Watkins 1989; Singh 1992; Dayan and Hinton 1992;
Kaelbling 1993; Thrun and Schwartz 1995; Sutton
1995), culminating at the end of the ’90s (Parr and
Russell 1998; Hauskrecht et al. 1998; Dietterich 1998;
Sutton, Precup, and Singh 1999) with several formulations based on semi-Markov decision processes
(SMDP) (Howard 1963).
The MDP model makes the assumption that the
environment transitions to a new state in a single
time step (or, equivalently, in a constant amount of
time), while in an SMDP, the transition duration is a
random variable. The SMDP framework is therefore a
natural fit for representing temporally abstract
v;( s) = E; ;t
t =0
;
; r(St , At )| S0= s ;;; ;;;
actions, which can persist over time. More precisely,
in an SMDP, the choice of action at a given state
induces a joint probability distribution over both the
next state and the duration of the transition. Hence,
a trajectory in an SMDP includes, in addition to
states, actions, and rewards, the duration of each
transition. The name semi-Markov stems from the fact
that the process is only assumed to be Markovian
from decision point to decision point, which conveniently allows for existing dynamic programming
results to apply at the level of decisions (or action
choices). However, the evolution of the system
between two decisions may not even be Markovian,
and it is also allowed to unfold over continuous time.
In fact, when the transition duration is exponentially distributed, this leads to a decision process called
continuous-time Markov decision process (CTMDP) (
Puterman 1994).
Seeing Through the
Black Box with Options
Despite adopting the same SMDP formalism, the
options framework (Sutton, Precup, and Singh 1999)
differs from its contemporaries (Parr and Russell
1998; Dietterich 1998) in its emphasis on exposing
and leveraging the structure both within and over
temporally extended actions. The evolution of the
process between two decision points is no longer a
black box, which means it can be both observed and
controlled. The ability to seamlessly learn and plan at
different levels of abstraction stems from the assump-
tion that there exists a base MDP that is overlaid with
temporally extended actions, known as options: the
combination is shown to induce an SMDP. With the
expression “between MDPs and semi-MDPs” in the
title of their paper, Sutton, Precup, and Singh (1999)
tried to convey the idea that options provide a lens
of variable resolution.
An option is a combination of three components:
an initiation set, a policy (sometimes called internal),
and a termination condition. The initiation set Io for
an option o is a subset of the states in which the given option could be chosen. In the most common
execution model, when an option is executed, its
internal policy πo acts until the probabilistic termination condition βo is met. More precisely, upon
entering the next state St+ 1, the system has to toss a
coin, which indicates termination with probability
βo(St+ 1) and continuation with 1 – βo(St+ 1). Once an
option has terminated, a policy over options μ
chooses a new option, and the process is repeated.
Constructivist Influence
To get some perspective on what the options framework is and what it ought to be, it is useful to follow
its lineage into the schema mechanism of Drescher
(1991). Inspired by Piaget’s constructivism (Piaget
1937), Drescher puts forward the idea that all knowl-