sensor does the same for all POIs within the quadrant.
We model the continuous motion of the rovers at
each finite time step as shown in figure 6. We maintain the current heading of each rover, and at each
time step the rovers select a value for dy and dx,
where the value of dy represents how far forward the
rover will move, and dx represents how much the
rover will turn at that time step. The rover’s heading
for the next time step is represented as the direction
of the resultant vector (dx + dy), shown as the solid
line in figure 6.
The rovers use multilayer perceptrons (MLPs) with sig-
moid activation functions to map the eight inputs
provided by the four POI sensors and four rover sen-
sors through 10 hidden units to two outputs, dx and
dy, which govern the motion of the rover. The weights
associated with the MLP are established through an
online simulated annealing algorithm that changes
the weights with preset probabilities (Kirkpatrick,
Gelatt, and Vecchi 1983). This is a form of direct pol-
icy search, where the MLPs are the policies.
We present the visualizations for alignment and sen-
sitivity of four reward structures in this work. The
perfectly learnable local reward, Pi, is calculated by con-
sidering the value of observations of all POIs made
by agent i throughout the course of the simulation,
ignoring the contributions that any other agents had
to the system.
The global team reward, Ti, is calculated by considering the best observation the team as a whole made
during the course of the simulation.
The difference reward, Di, is calculated similarly to
the perfectly learnable reward Pi, with the exception
that if a second agent j also observed the POI, agent
i is only rewarded with the difference between the
quality of observations. Thus, if two agents observe
a POI equally well, it adds to neither of their
rewards, because the team would have observed it
Figure 4. A Team of Rovers Observing a Set of Points of Interest.
Each POI has a value, represented by its size here. The team will ideally send one rover to observe the large POI on the right
closely, while the rest spread out in the left region to observe as many small POIs as possible.