ularization methods for neural networks, this optimization is sufficient to recover f* up to an additive
constant C (specifying what object height corresponds to 0). Qualitative results from our network
applied to fresh images after training are shown in
figure 4.
Evaluation
We manually label the height of our falling objects in
pixel space. Note that labeling the true height in
meters requires knowing the object’s distance from
the camera, so we instead evaluate by measuring the
correlation of predicted heights with ground truth
pixel measurements. All results are evaluated on test
images not seen during training. Note that a uniform
random output would have an expected correlation
of 12. 1 percent. Our network results in a correlation
of 90. 1 percent. For comparison, we also train a
supervised network on the labels to directly predict
the height of the object in pixels. This network
achieves a correlation of 94. 5 percent, although this
task is somewhat easier as it does not require the network to compensate for the object’s distance from
the camera.
This experiment demonstrates that one can teach
a neural network to extract object information from
real images by writing down only the equations of
physics that the object obeys.
Detecting Objects with
Causal Relationships
In addition to physics, other sources of domain
knowledge can in principle be used to provide supervision in the learning process. For example, significant efforts have been devoted in the past few decades
to construct large knowledge bases (Lenat 1995; Bollacker et al. 2008). This knowledge is typically encoded using logical- and constraint-based formalisms.
Thus, in our next experiment, we explore the possibilities of learning from logical constraints imposed
on single images. More specifically, we ask whether it
is possible to learn from causal phenomena.
We provide our model images containing a stochastic collection of up to four characters: Peach,
Mario, Yoshi, and Bowser, with each character having
small appearance changes across frames due to rotation and reflection. Example images can be seen in
figure 5. While the existence of objects in each frame
is nondeterministic, the generating distribution
encodes the underlying phenomenon that Mario will
always appear whenever Peach appears. Our aim is to
create a pair of neural networks f = (f1, f2) for identifying Peach and Mario, respectively. The networks, f1
and f2, map the image to the discrete Boolean variables, y1 and y2. This problem is challenging because
the networks must simultaneously learn to recognize
the characters and select them according to logical
relationships.
Constraints
Rather than supervising with direct labels, we train
the networks by constraining their outputs to have
the logical relationship y1 ⇒ y2, which means if y1 is
true ( 1), y2 can never be false (0). However, merely
satisfying the constraint y1 ⇒ y2 is not sufficient to
certify learning. For example, the system might false-
ly report the constant output, y1 ≡ 1, y2 ≡ 1 on every
image. Such a solution would satisfy the constraint,
but say nothing about the presence of characters in
the image.
To avoid such trivial solutions, we add three loss
terms. The first loss forces rotational independence
of the output by applying a random horizontal and
vertical reflection, ρ, to images and requiring the net-
work output to be the same. This encourages the net-
work to focus on the existence of objects, rather than
location. The second and third loss allow us to avoid
trivial solutions by encouraging high standard devi-
ation and high entropy outputs across a input batch
of 16 images, respectively.
Even with these constraints, the loss remains
invariant to logical permutations (for example, given
a correct solution (y1*, y2*), the incorrect solution (y1΄,
y2΄) would satisfy y1΄ ⇒ y2΄, and have the same
entropy).
( 4)
We address this issue by forcing each Boolean output to derive its value from a single region of the
image (each character can be identified from a small
region in the image.) The Peach network, f1, runs a
series of convolution and pooling layers to reduce
the original input image to a 7 ; 7 ; 64 grid. We find
the 64-dimensional spatial vector with the greatest
mean and use the information contained in it to predict the first binary variable. Examples of channel
means for the Mario and Peach networks can be seen
in figure 5. The Mario network, f2, performs the same
process. But if the Peach network claims to have
found an object, f2 is prevented from picking any
vector within two spaces of the location used by the
first vector.
Evaluation
On a test set of 128 images, the network learns to
map each image to a correct description of whether
the image contains Peach and Mario. This experiment demonstrates that networks can learn from
constraints that operate over discrete sets with potentially complex logical rules. Removing additional
constraints will cause learning to fail. Thus, the
experiment also shows that sophisticated sufficiency
conditions can be key to success when learning from
constraints.
Pendulum Tracking
For this task, we downloaded a video of a pendulum
from You Tube, 1 and we ask whether it is possible to
extract the angle of the pendulum over time. Given
; y1 = y1*
; y2 = ( y1* ; y2*);(¬y1* ;¬y2*)