portions of the system reward are caused by each
agent. However, it is important to note that it is not
necessary to analytically compute these contributions.
In many cases, a simple approximation that serves to
remove a large portion of the noise caused by using
the system-level reward gains significant performance
increases over using the system reward alone.
Although in this article we focus on the continuous rover domain, both the difference reward and the
visualization approach have broad applicability. The
difference reward used in this article has been applied
to many domains, including data routing over a
telecommunication network (Tumer and Wolpert
2000), multiagent gridworld (Tumer, Agogino, and
Wolpert 2002), congestion games such as traffic toll
lanes (Tumer and Wolpert 2004a, 2004b; Wolpert
and Tumer 2001), and optimization problems such
as bin packing (Wolpert, Tumer, and Bandari 2004)
and faulty device selection (Tumer 2005).
Continuous Rover Domain
To examine the properties of the difference reward in
a more practical way, let us return to our example of
a team of rovers on a mission to explore an extraterrestrial body, like the moon or Mars (figure 3). We
allow each rover to take continuous actions to move
in the space, while receiving noisy sensor data at discrete time steps (Agogino and Tumer 2004).
Points of Interest
Certain points in the team’s area of operation have
been identified as points of interest (POIs), which we
represent as green dots. Figure 4 offers one of the layouts of POIs that we studied, with a series of lower-valued POIs located to the left on the rectangular
world, and a single high-valued POI located on the
right half. Because multiple simultaneous observations of the same POI are not valued higher than a
single observation in this domain, the best policy for
the team is to spread out: one agent will closely study
the large POI, while the remainder of the team will
cover the smaller POIs on the other side.
We assume that the rovers have the ability to sense
the whole domain (except in the results we present
later marked with PO for partial observability), but
even so, using state variables to represent each of the
rovers and POIs individually results in an intractable
learning problem: there are simply too many parameters. This is also why a centralized controller does
not function well in this case. We reduce the state
space by providing eight inputs through the process
illustrated in figure 5. For each quadrant, which
rotates to remain aligned with the rover as it moves
through the space, the rover has a rover sensor and
a POI sensor. The rover sensor calculates the relative
density and proximity of rovers within that quadrant and condenses this to a single value. The POI
Figure 3. A Team of Rovers Exploring Various Points of Interest on the Martian Surface.