ventional DP algorithms. Sample models, on the other hand, produce state transitions and rewards that
are sampled from these distributions. Clearly, if one
has a distribution model, one can sample as needed
from the distributions. But as the elevator task made
clear, it is possible to generate samples by a simulation program that does not contain explicit representations of the underlying probability distributions. While theoretically possible to make these
distributions explicit, it is not necessary. For this reason, a sample model is often much easier to create
than the corresponding distribution model.
I realized, then, that another advantage of RL over
optimization methods that depend on distribution
models, such as conventional DP, is that RL can
approximate optimal solutions through Monte Carlo optimization using only sample models. This
advantage would not have been news to those in other disciplines who already understood the advantages
of simulation-based optimization, but for me it was
an important realization.
One of the most exciting connections between RL
and another discipline is the result of what neurosci-
entists are learning about the brain’s reward system.
There is mounting evidence from neuroscience that
the nervous systems of humans and many other ani-
mals implement algorithms that correspond in strik-
ing ways to RL algorithms. The most remarkable
point of contact involves dopamine, a chemical fun-
damentally involved in reward processing in the
brains of mammals.
Experiments conducted in the late 1980s and the
1990s in the laboratory of neuroscientist Wolfram
Schultz (reviewed in Schultz ) showed that
neurons that produce dopamine as a neurotransmit-
ter respond to rewarding events with substantial
bursts of activity only if the animal does not expect
those events. This finding suggests that dopamine-
producing neurons are signaling reward prediction
errors instead of reward itself. Further, these experi-
SPRING 2019 11
Figure 2. Four Elevators in a Ten-Story Building.
From Sutton and Barto (1998).