number of research projects have investigated the
scenario in which this reward comes as feedback
from a human user rather than a function predefined
by an expert (Isbell et al. 2006, Thomaz and Breazeal
2008, Knox and Stone 2012). In evaluating the feasibility of nonexpert users teaching through reward
signals, these researchers aimed to both leverage
human knowledge to improve learning speed and
permit users to customize an agent’s behavior to fit
their own needs.
Thomaz and Breazeal (2008) observed that people
have a strong tendency to give more positive rewards
than negative rewards. Knox and Stone (2012) later
confirmed this positive bias in their own experi-
ments. They further demonstrated that such bias
leads many agents to avoid the goal that users are
teaching it to reach (for example, the water in figure
4). This undesirable consequence occurs with a com-
mon class of reinforcement learning algorithms:
agents that value reward accrued over the long term
and are being taught to complete so-called episodic
tasks. This insight provided justification for the pre-
viously popular solution of making agents that hedo-
nistically pursue only short-term human reward, and
it led Knox and Stone (2013) to create an algorithm
that successfully learns by valuing human reward
that can be gained in the long term. Agents trained
through their novel approach were more robust to
environmental changes and behaved more appropri-
ately in unfamiliar states than did more hedonistic
(that is, myopic) variants. These agents and the algo-
rithmic design guidelines Knox and Stone created
were the result of multiple iterations of user studies,
People Want to Demonstrate
How Learners Should Behave
In an experiment by Thomaz and Breazeal (2008)
users trained a simulated agent to bake a cake
through a reinforcement learning framework. In their
interface, users gave feedback to the learner by click-
ing and dragging a mouse — longer drags gave larg-
er-magnitude reward values, and the drag direction
determined the valence (+/–) of the reward value (fig-
ure 4). Further, users could click on specific objects to
signal that the feedback was specific to that object,
but they were told that they could not communicate
which action the agent should take.
Thomaz and Breazeal found evidence that people
nonetheless gave positive feedback to objects that
they wanted the agent to manipulate, such as an
empty bowl that the agent is in position to pick up.
These users violated the instructions by applying
what could be considered an irrelevant degree of freedom — giving feedback to objects that had not been
recently manipulated — to provide guidance to the
agent about future actions, rather than actual feedback about previous actions. After Thomaz and
Breazeal adapted the agent’s interface and algorithm
to incorporate such guidance, the agent’s learning
performance significantly improved.
Other researchers have reached similar conclusions. In a Wizard-of-Oz study (that is, the agent’s
outputs were secretly provided by a human) by
Kaochar and colleagues (2011), users taught a simu-
Time: 16
Figure 4. Two Task Domains for Reinforcement Learning Agents Taught by Human Users.
Left: A cooking robot that must pick up and use the ingredients in an acceptable order (Thomaz and Breazeal 2006). The green vertical bar
displays positive feedback given by a click-and-drag interface. Right: A simulated robot frog that users teach how to navigate to the water
(Knox and Stone 2012).