Classic Approaches to Coordination
Through Reward Shaping
There are three classic approaches to solving complex
multiagent systems: robot totalitarianism, robot
socialism, and robot capitalism. Each has specific
advantages and drawbacks.
Robot Totalitarianism (Centralized Control)
First, consider a centralized system in which one
agent is making all necessary decisions for the entire
system as a whole, and all other agents are merely following orders. The advantages here are that perfect
coordination is possible and the pieces of the system
as a whole will cooperate to increase system performance. This typically works well for small systems
consisting of just a few agents (Sutton and Barto
1998). However, such a centralized system can fall
prey to complexities such as communication restrictions, component failures — especially where a single
point of failure can stop the entire system — and simply the difficulty of simultaneously solving a problem for hundreds or thousands of agents simultaneously. In most realistic situations, this is simply not
Robot Socialism (Global or Team Reward)
Next, consider a system in which each agent is
allowed to act autonomously in the way that it sees
fit, and every agent is given the same global reward,
which represents the system performance as a whole.
They will single-mindedly pursue improvements on
this reward, which means that their efforts are directed toward improving system performance, due to
this reward having perfect alignment. However,
because there may be hundreds or thousands of
agents acting simultaneously in the shared environment, it may not be clear what led to the reward. In
a completely linear system of n agents, each agent is
only responsible for 1/n of the reward that they all
receive, which can be entirely drowned out by the (n
– 1)/n portion for which that agent is not responsible.
In a system with 100 agents, that means an agent
might only have dominion over 1 percent of the
reward it receives! This could lead to situations in
which an agent chooses to do nothing, but the system reward increases, because other agents found
good actions to take. This would encourage that
agent to continue doing nothing, even though this
hurts the system, due to a lack of sensitivity of the
Robot Capitalism (Local
or Perfectly Learnable Reward)
Finally, consider a system in which each agent has a
local reward function related to how productive it is.
For example, a planetary rover could be evaluated on
how many photographs it captures of interesting
rocks. This means that its reward is dependent only
on itself, creating high sensitivity. However, the team
of rovers obtaining hundreds of photographs of the
same rock is not as interesting as obtaining hundreds
of photographs of different rocks, though these
would be evaluated the same with a local scheme.
This means that the local reward is not aligned with
the system-level reward.
Each of the reward functions has benefits and drawbacks that are closely mirrored in human systems.
However, we are not limited to just these reward
functions; as we mentioned before, an agent will single-mindedly seek to increase its reward, no matter
what it is, whether or not this is in the best interest
of the system at large. Is there, perhaps, a method
that could be as aligned as the global reward, while
as sensitive as the local reward, while still avoiding
the pitfalls of the centralized approach?
An ideal solution would be to create a reward that is
aligned with the system reward while removing the
noise associated with other agents acting in the sys-
tem. This would lead agents toward doing every-
thing they can to improve the system’s performance.
Such a reward in a multirover system would reward
a rover for taking a good action that coordinates well
with rovers that are close to it, and would ignore the
effects of distant rovers that were irrelevant.
A way to represent this analytically is to take the
global reward G(z) of the world z, and subtract off
everything that doesn’t have to do with the agent
we’re evaluating, revealing how much of a difference
the agent made to the overall system. This takes the
Di(z) = G(z) – G(z–i) ( 1)
where G(z–i) is the global reward of the world without the contributions of agent i, and Di(z) is the difference reward.
Let us first consider the alignment of this reward.
G(z) is perfectly aligned with the system reward. G(z–
i) may or may not be aligned, but in this case, it doesn’t matter, because agent i (whom we are evaluating)
has no impact on G(z–i), by definition. This means
that Di(z) is perfectly aligned, because all parts that
agent i affects are aligned: agent i taking action to
improve Di(z) will simultaneously improve G(z).
Now, let us consider the sensitivity of this reward.
G(z) is as sensitive as the system reward, because it is
identical. However, we remove G(z–i) from the equation; that is, a large portion of the system — on
which agent i has no impact on the performance —
does not affect Di(z). This means that Di(z) is very
sensitive to the actions of agent i and includes little
noise from the actions of other agents.
Difference rewards are not a miracle cure. They do
require additional computation to determine which