When the hypotheses are conditional probability distributions h(y|x) = Pr(y|x) and the loss function is – log
h(yi|xi), then empirical risk minimization becomes
maximum likelihood estimation.
The main weakness of empirical risk minimization
is that if the hypothesis space H is highly expressive,
then the function h that works best on the training
data will often work poorly on new data points. The
problem is that in an expressive hypothesis space
there are usually functions that can essentially memorize the training data without generalizing well.
A widely adopted solution to this problem is to
define a measure ǁhǁ of the “complexity” of h. For
example, if h is defined by a set of parameters (for
example, coefficients of a linear function, weights of
a neural network), then ǁhǁ might be defined as the
sum of the squares of these coefficients. We then
define the following complexity-regularized opti-
where λ is the regularization parameter. If we set λ =
0, then we recover the maximum likelihood solution.
As λ increases, the optimal h will be forced to become
progressively less complex. In the limit λ → ∞, all of
the coefficients are forced to be zero. The value of λ is
usually determined using a separate set of validation
data (or through the technique of cross-validation).
Regularization is the key to preventing overfitting,
and virtually all applications of machine learning
employ some form of regularization. We can view the
regularization penalty as a force that “pulls the solu-
tion back” from the unpenalized optimum.
Interestingly, one intuitive definition of overfitting
is that a hypothesis h has overfit the training data if
the loss sharply increases when we perturb the training examples: L(h(xi + δi ), yi) ≫ L(h(xi), yi). Equivalently, we can measure the capability of a hypothesis
h to generalize to new points in terms of how stable
its predictions are in the presence of perturbations.
Recent research by Shie Mannor and his collaborators
(Xu, Caramanis, and Mannor 2009) has shown that
this intuition can be formalized. Specifically, for the
case of the linear support vector machine (where L is
the so-called hinge loss), the regularized optimization
problem is equivalent to the following robust optimization problem
in which the parameter λ is equal to a perturbation
budget given to the adversary and ǁδiǁ is the Euclid-
ean distance of the perturbation δi.
; 1,…;N i ;L h xi +;i ( ), yi ( ) subject to i ;;;i;;; ;
find h ; H to minimize
;L h xi ( ),yi ( )+;h ,
find h ; H to minimize
;L h xi ( ), yi ( ).
In summary, regularization is an important technique for helping machine-learning algorithms to
generalize well. For the case of the linear support vector machine, regularization is equivalent to robust
optimization, and the parameter λ, instead of being
an arbitrary parameter, turns out to be the perturbation budget given to an adversary.
Idea 3: Risk-Sensitive Objectives
Let us now turn to the problem of planning in
Markov decision processes (MDPs). In an MDP, an
agent is interacting with a fully observable world. At
time t, the world is in state st. The agent observes that
state and then chooses an action at to perform
according to a policy function π: at = π(st). When the
agent performs action at, the world makes a stochastic state transition to a new state st+ 1 according to the
probability distribution P(st+ 1 ∣ st, at). The agent
receives a reward R(st, at). This reward is a random
variable, because it depends on all of the stochastic
transitions up to time t. Let us consider the problem
of finding a policy that optimizes the expected T-step
reward starting from an initial world state s0:
As I have indicated here, the standard objective is to
optimize the expected total reward, and the vast
majority of work in MDP planning and reinforcement
learning optimizes this (or closely related) objectives.
However, in some situations, one might be concerned about the down-side risk — that is, the risk
that the actual reward received in a particular T-step
trial will be very small. This is natural, for example,
when investing for retirement. It is also a concern in
problems involving endangered species where the
expected total reward might be good even though the
species goes extinct (a down-side risk) 25 percent of
the time. In such cases, we seek a more conservative
policy that may forgo large up-side outcomes in order
to avoid down-side risk.
Many modifications of J(π) have been studied that
include some notion of risk or down-side risk. One of
the best of these is the conditional value at risk
(CVaR). To understand CVaR, imagine we have adopted a fixed policy π and that we can perform repeated
trials in which we start in state s0 and follow the
actions of π for T steps. The T-step return VT that we
receive is a random variable:
This random variable will exhibit some probability
distribution resulting from the interaction of the policy π and the transition probabilities. Suppose this
distribution is the black curve shown in figure 9a.
Note that while the great majority of outcomes have
large values, the distribution has a long tail of low
returns. The conditional value at risk is controlled by
;R st , ; st ( ) ( ).
find ; to maximize J ; ( ) = E
;R st ,; st ( ) ( )|s0 ;;; ;;;.