ments showed that as an animal learns to predict a
rewarding event on the basis of preceding sensory
cues, the bursting activity of dopamine-producing
neurons shifts to earlier predictive cues while
decreasing to later predictive cues.
Researchers familiar with RL quickly recognized
that these results are strikingly similar to how the TD
error behaves as an RL agent learns to predict reward
(for example, Barto ; Schultz, Daylan, and
Montague ). It is not an exaggeration to say
that the results of the experiments of Schultz and colleagues, together with their correspondence to RL
algorithms, have revolutionized the neuroscience of
reward processing in the brain. It is now almost universally accepted that bursts of dopamine neuron
activity convey reward prediction errors to brain
structures where learning and decision-making take
place, and evidence supports the idea that the prediction errors might be TD errors.
RL theory provides a model for understanding the
functional significance of reward prediction errors. In
addition to driving the learning of reward predic-
tions, reward prediction errors are ideal signals
implementing trial-and-error learning. Actions fol-
lowed by greater-than-expected reward (a positive
reward prediction error) are selected for; actions fol-
lowed by less-than-expected reward (a negative
reward prediction error) are selected against. This
observation suggests that the brain might implement
something like an actor-critic algorithm in which TD
errors are both error signals to train the critic’s pre-
dictions and signals for encouraging or discouraging
the actor’s choice of actions.
Figure 3 illustrates a hypothesis about how the
brain might implement an actor-critic algorithm.
Panel (a) shows the actor-critic algorithm as an artificial neural network. The actor adjusts a policy based
on the TD error δ it receives from the critic; the critic adjusts reward predictions using the same δ. The
critic produces a TD error from the reward signal, r,
and its current reward predictions. Panel (b) shows a
hypothetical neural implementation of an actor-critic algorithm. The actor and the critic are respectively
placed in particular parts of the brain. The TD error
is transmitted by dopamine-producing neurons to
modulate changes in synaptic weights of input from
While these developments do not directly support
Klopf’s hypothesis that individual neurons implement a kind of law of effect, a recent study by Athalye et al. (2018), entitled “Evidence for a Neural Law
of Effect,” adds to the plausibility of Klopf’s idea.
Figure 3: Actor-Critic as an Artificial Neural Network and a Hypothetical Neural Implementation.
Adapted from Takahashi, Schoenbaum, and Niv (2008).