questions to better learn how to map features more
optimally to edge strengths. This section describes
our attempts to do so.
There are two main ways that we use the scenario-based training set. First, we create a set of closed-form
inference methods, most of which are approximations of methods described in the previous section,
but are more straightforward to optimize. We use
these methods to create equations that express candidate confidences in terms of the model, and then
optimize the model to maximize certain objectives
such as accuracy. Second, we combine all the inference methods, including the belief engine, into an
ensemble and use the scenario-based train set to optimize the weights of the ensemble.
Closed-Form Inference Methods
This section describes a series of closed-form inference methods. Note that none of these methods by
themselves have been shown to give higher accuracy
than the belief engine method described earlier.
However, they are more amenable to optimization,
and the ensemble of methods may perform better.
The noisy-OR model is most similar to the indicative semantics used in the belief engine, with some
differences. While the belief engine allows a graph
with directed cycles, the noisy-OR model requires a
directed acyclic graph (DAG). As mentioned above,
the assertion graph is not, in general, free of cycles.
Additionally, the assertion graph contains matching
relations, which are undirected. To form a DAG, the
nodes in the assertion graph are first clustered by
these matching relations, and then cycles are broken
by applying heuristics to reorient edges to point from
factors to hypotheses. Confidence is computed in a
feed-forward manner. The confidence in factors
extracted by scenario analysis is 1.0. For all other
nodes the confidence is defined recursively in terms
of the confidences of the parents and the confidence
of the edges produced by the question-answering system. This generates an equation for each candidate,
expressing its confidence interms of the parameters.
The edge type variant of the noisy-OR model considers the type of the edge when propagating confidence from parents to children. The strength of the
edge according to the question-answering model is
multiplied by a per-edge-type learned weight, then a
sigmoid function is applied. In this model, different
types of subquestions may have different influence
on confidences, even when the question-answering
model produces similar features for them.
The matching model estimates the confidence in a
hypothesis according to how well each factor in the
scenario, plus the answers to forward questions asked
about it, match against either the hypothesis or the
answers to the backward questions asked from it. We
estimate this degree of match using the term matchers described earlier in the Matching Graph Nodes
The feature addition model uses the same DAG as
the noisy-OR model, but confidence in the intermediate nodes is computed by adding the feature values
for the questions that lead to it and then applying
the logistic model to the resulting vector. An effect is
that the confidence for a node does not increase
monotonically with the number of parents. Instead,
if features that are negatively associated with correctness are present in one edge, it can lower the confidence of the node below the confidence given by
The causal model attempts to capture causal semantics by expressing the confidence for each candidate
as the product over every clinical factor of the probability that either the diagnosis could explain the factor (as estimated from Watson/question-answering
features), or the factor “leaked” — it is an unexplained observation or is not actually relevant.
In the closed-form inference systems described,
there is no constraint that the answer confidences
sum to one. We implement a final stage where features based on the raw confidence from the inference
model are transformed into a proper probability distribution over the candidate answers.
The methods described in the previous section permit expressing the confidence in the correct answer
as a closed-form expression. Summing the log of the
confidence in the correct hypothesis across the training set T, we construct a learning problem with log-likelihood in the correct final answer as our objective
function. The result is a function that is nonconvex,
and in some cases (due to max) not differentiable in
To limit overfitting and encourage a sparse, interpretable parameter weighting we use L1-regulariza-
tion. The absolute value of all learned weights is subtracted from the objective function.
To learn the parameters for the inference models
we apply a “black-box” optimization method:
greedy-stochastic local search. Learning explores the
parameter space, tending to search in regions of high
value while never becoming stuck in a local maximum.
We also experimented with the Nelder-Mead simplex method (Nelder and Mead 1965) and the multi-directional search method of Torczon (1989) but
found weaker performance from these methods.
We have multiple inference methods, each approach-ing the problem of combining the subquestion confidences from a different intuition and formalizing it
in a different way. To combine all these different
approaches we train an ensemble.
This is a final, convex, confidence estimation over
the multiple-choice answers using the predictions of
the inference models as features.