a parameter α that specifies a quantile in the distribution of returns. For α = 0.1, the red vertical line indicates this quantile. The CVaR is the expected value of
all of the outcomes to the left of the red line — in this
example, the 10 percent worst outcomes. The expected value of those outcomes for this distribution is
3.06. The CVaR objective seeks to maximize the
expected value of these 10 percent worst outcomes.
We search the space of policies to find the policy that
maximizes this expectation.
A typical result is shown by the red curve in figure
9b. This is the distribution of VT under the CVaR optimal policy. Again the red line marks the 10 percent
quantile. The CVaR has improved to 3. 94. Note that
to achieve this we have sacrificed a significant
amount of up-side reward.
It is interesting to ask the following question: Does
acting conservatively (in the sense of CVaR) improve
robustness to model error? Recent work also by Shie
Mannor and his colleagues shows that the answer is
yes. Optimizing CVaR is equivalent to solving a
robust optimization problem in which an adversary is
allowed to modify the transition probabilities.
Consider an adversary who at each time step t can
choose a multiplicative perturbation δt and modify
the MDP transition probabilities so that instead of
making a transition from st to st+ 1 with probability
P(st+ 1 | st, at), the probability is changed to be P(s(t+ 1) |
st, at) ∙ δt. To be more precise, let δ be a vector that
specifies a multiplier, δ(s), for each possible state s.
Then P(s(t+ 1) | st, at) is perturbed to be
We will place two constraints on the possible values
of δ. First, the perturbed values P must still be valid
probability distributions. Second, the product of the
perturbations along any possible trajectory ;s1,…, st,
…, s T; must be less than η:
This is the “perturbation budget” given to the adver-
sary. These two constraints interact to limit the extent
to which δ values can become small (or even zero).
This is because if δt(s) = 0 for several states s, then δt(s΄)
will be forced to become large for some other states s΄,
which will violate the η budget.
Let Δ be the space of all perturbations ;δi, …, δT;
that satisfy these constraints. Then the robust opti-
mization problem becomes
Chow et al. prove that this π is exactly the policy
that maximizes the CVaR with quantile α = 1/η.
In summary, the optimal risk-averse CVaR policy is
also a policy that is robust to errors in the transition
find ; to maximize min
; 1,…,;T ; ;
;R st ,; st ( ) ( )|s0 ;;; ;;;.
;;t st ( ) ; ; ;;s1,…, st ,…, s T ;
P st+ 1|st ,at ( ):= P st+ 1|st ,at ( );;t st+ 1 ( ).
Figure 9. Conditional Value at Risk.
CVaR = 3.06
α = 0.1
CVaR = 3. 94