Learning with Explicit Constraints
The goal of constraint-based learning is to train a
model f, mapping from inputs to outputs that we
care about, using only high-level rules rather than
labeled examples.
We focus on structured prediction problems, in
which the output is a vector y with correlated components. For example, y may correspond to the trajectory of a falling object in a sequence of video
frames x. Clearly, the heights in each frame are not
independent, and the sequence demonstrates a well-defined structure defined by physics. We can utilize
this physical law as a constraint and learn to detect
the object without resorting to exhaustive labeling.
Replacing Supervised Losses
with Constraints
To enforce our prior knowledge of the structure of y,
we specify a weighted constraint function g, which
penalizes output structures that are not consistent
with our understanding of the task. The key question
we explore in this work is whether this weak form of
supervision is sufficient to achieve high labeled accuracy on a test set.
While one clearly needs labels to evaluate the optimal function, labels may not be necessary to discover that optimal function. If prior knowledge informs
us that outputs of the optimal function have other
unique properties among functions in the hypothesis class, we may use these properties as constraints to
train the system without the need for explicit labeled
examples.
Specifically, we first consider an approach in which
no labels are provided to us, and optimize for a nec-
essary property of the output (the constraint)
instead. That is, we search for the function that opti-
mally satisfies the constraint requirements
( 2)
In our experiments, we find that commonly used
hypothesis classes (convolution layers encoding
translation invariance) and simple regularization
terms may be sufficient to avoid functions that optimize only for the constraint but not the original loss
function. In these settings, we can optimize the constraint in place of the loss function with stochastic
gradient descent (SGD), freeing us from the need for
labels.
Regularization for Constraint Learning
When optimizing for the constraint alone is not suf-
ficient to find the desired solution, we may add addi-
tional regularization terms R(f) to supervise the
machine towards correct convergence. For example,
if the constraint is undesirably satisfied by a function
that produces constant output at every frame, we add
a term to favor outputs with higher entropy, leading
f = arg min
f ;H
g
i= 1
n
; ( xi , f ( xi )) + R( f )
to the correct function. The process of designing the
precise constraint and the regularization term is a
form of supervision, and can require a significant
time investment. But unlike hand labeling, it does
not increase proportionally to the size of the training
data set, and can be applied to new data sets often
without modification.
Adversarial Constraint Learning
In the sciences, discovering constraints is often a
data-driven process — for example, the laws of
physics are often discovered by validating hypotheses with experimental results before formulas are
summarized.
Motivated by this idea, we ask ourselves whether
we can learn constraints (such as physical laws) from
data, rather than requiring that they be specified by
humans. This approach enables us to apply constraint learning in settings in which the invariants
governing a system’s output are too complex to be
specified manually.
Learning Constraints from Data
Suppose that we are given a small number of outputs
(that is, labels that are not necessarily associated with
inputs) or a black-box mechanism/simulator for generating such outputs. We formulate the task of learning a constraint loss from these labeled samples using
the framework of generative adversarial learning
(Goodfellow et al. 2014).
Our ultimate goal is to learn a function f(x) that
produces samples that lie close to the manifold of
true output samples in Y. To enforce this goal, we follow the approach of Goodfellow et al. (2014) and
define an auxiliary classifier D called a discriminator,
which tries to assign higher scores to the real set of
labels (since they follow the constraints by assumption) and lower scores to outputs from f(x). At the
same time, we train f(x) to produce outputs that score
higher under the discriminator. Thus, the discriminator learns to effectively extract the constraints in
the samples and impose them upon f(x); and since
the goal of f(x) is to produce outputs that score high
under the discriminator, this function learns to meet
the desired constraints.
Figure 3 shows an overview of the adversarial constraint learning framework when the outputs form a
trajectory from an object tracking system. The discriminator tries to distinguish generated trajectories
(outputs from the function f(x)) from real sample trajectories, while the regressor tries to output trajectories that match the distribution provided by a black-box simulator. When trained to optimality (and
assuming both models have enough capacity), the
discriminator represents the implicit constraint while
the regressor learns to perform structured prediction
that satisfies this constraint.