Figure 8. We use the sensor information from the
rover (left) to determine which of the spaces we will
update (right). The alignment or sensitivity calculation (Agogino and Tumer 2008) is then represented
by a symbol that takes the form of a “+” or “–” sign;
the brighter the shade of the spot, the further from
the average. A bright “+,” then, represents a very
aligned or very sensitive reward and a bright “–” represents an antialigned or very nonsensitive reward
for a given POI and rover density, in the case of figure 9. We also present these calculations projected
onto a specific case of the actual space that the rovers
move through in figure 10. A more general version of
this technique projects onto the principal components of the state space, which is more thoroughly
explored in other work (Agogino and Tumer 2008).
Sensitivity and Alignment Analysis
A reward with simultaneously high alignment and
sensitivity will be the easiest for agents to use to
establish high-performing policies. Figure 9 presents
the visualization for each of the reward structures.
Notice that the perfectly learnable reward Pi does
indeed have high sensitivity across the space, but has
low alignment with the global reward in most of the
center areas, which correspond to a moderate con-
centration of rovers and POIs. This area near the cen-
ter of the visualization represents circumstances that
the rovers find themselves in most often (Agogino
and Tumer 2008).
The team reward Ti, by contrast, is very aligned
throughout the search space, but is extremely lacking in sensitivity (denoted by the many “–” signs
throughout the space).
The difference reward Di is both highly aligned
and highly sensitive throughout the search space.
When we reduce the radius at which Di can sense
other rovers and POIs, the visualization from the
Di(PO) row indicates that the sensitivity remains
strong everywhere, but there is a slight drop in alignment throughout the space.
So, it would appear that difference rewards (Di)
offer benefits over other rewards, even with partial
observability (Di(PO)), but what does this mean in a
more practical sense? To address this, we created figure 10, which projects the same type of alignment
into the actual plane in which the rovers are operating.
Points of Interest
Figure 8. Illustration of the Visualization Calculation Process.
We use sensor data to determine which spot in the state space a circumstance represents, and place a marker in that location that represents whether the reward scores highly (bright +), near random (blank) or lowly (bright –).