result in vastly different amounts of labeled data.
Consequently, a learner trained with one set of seeds
may have, say, twice as much training data as a learner trained with a different set of seeds. A common
mistake is for people to identify seeds by hand,
assuming that they know what the most frequent
words will be. Corpora are highly idiosyncratic, and
words that may often be common will not necessarily be common in any particular text collection. Furthermore, people can miss words that are extremely
common in the corpus and that could therefore be
valuable seeds. Looking at corpus statistics is the
most reliable way to know that your seeds are aligned
with your actual data.
For lexicon induction, we typically follow a procedure first suggested by Roark and Charniak (1998) to
ensure high-frequency seeding. To select seed words,
sort all of the candidates (for example, nouns) in the
unannotated text corpus by frequency. Then manually review the list, selecting the first (most frequent)
k words that belong to the category of interest. This
process ensures that the seeds will be highly frequent
in the corpus.
High Precision
The seeding heuristics are used to automatically label
data for training, so the accuracy of the heuristics
directly correlates with the quality of the training
data. Consequently, the heuristics should have high
precision. However, high precision can be at odds
with high frequency because there is often a trade-off
between precision and recall (that is, high-precision
rules are often low recall, and vice versa). Consequently, it is sometimes difficult to identify seeding
heuristics that satisfy both the high-precision and
high-frequency goals. When forced to choose, high
precision is usually preferable for two reasons: ( 1) too
much noise in the training data can render it ineffective, and ( 2) low recall can often be compensated
for by increasing the size of the text corpus. Since
only unannotated texts are needed, obtaining more
texts is often feasible.
Diversity
The seeding heuristics should be able to label a
diverse set of examples, so as to produce labeled data
that is (reasonably) representative of the corpus as a
whole. If the seeding heuristics only label instances
that are highly similar, or if they have poor coverage
across subclasses, then the resulting data will be
strongly biased. Of course, by definition the labeled
instances will share whatever properties are selected
for by the seeding mechanism. But diversity can
often be achieved by defining seeding heuristics that
are not too specific and by using multiple seeding
heuristics that cover different parts of the search
space. It has been shown that if we characterize the
corpus in terms of connectivity of seeds and contexts
as a graph, then we should try to cover any sub-graphs disconnected from the main graph (Jones
2004).
Seeding Mechanisms
Bootstrapped learning has been applied to a wide
variety of NLP tasks, using many types of seeding
strategies. Here we take a brief look at some of the
seeding mechanisms that have been used successfully, to emphasize that bootstrapping can be initiated
in many different ways.
Seed words are a common form of seeding that has
been used for lexicon induction, pattern learning,
and word sense disambiguation (Yarowsky 1995).
Seed patterns have been used to identify relevant
contexts, for example to classify relevant and irrelevant texts for bootstrapped pattern learning (
Yangarber et al. 2000) and to identify relevant regions to
train a sentence classifier (Patwardhan and Riloff
2007). Seeding rules have been used for named entity
recognition and coreference resolution. Collins and
Singer (1999) heuristically labeled named entities to
create training data by defining rules such
Contains(X, “Mr.”) ⇒ PERSON(X). Bean and Riloff (2004)
used lexical and syntactic seeding rules to generate
labeled data for coreference resolution.
Another approach is to create an initial seeding
classifier that is applied to unlabeled texts to produce
an initial set of labeled instances, which are then
used as training data to jumpstart a bootstrapped
learning process. This scenario makes sense when it
is possible to easily construct a high-precision, but
potentially low-recall, classifier. This type of classifier can initially label some instances with high accuracy, which the bootstrapping process can use to
learn new information. This approach has been used
for opinion analysis, to learn patterns representing
subjective expressions (Riloff and Wiebe 2003) and to
train subjective and objective sentence classifiers
without annotated data (Wiebe and Riloff 2005).
It is worth noting that distant supervision is also a
seeding method, although it is typically used to generate a large set of labeled data to train a supervised
learner in a single step. Distant supervision takes
advantage of an existing knowledge base (KB) to
heuristically label instances that correspond to data
found in the KB. For example, distant supervision has
been applied to relation extraction by identifying
pairs of entities listed in a knowledge base as having
a relation, and then heuristically labeling instances
of the entity pairs that appear in close proximity as
positive instances of the relation (Mintz et al. 2009).
Secondary Benefits of Bootstrapping
The primary benefit of bootstrapped learning is that
it eliminates the need for manually annotated training data, which is expensive and time consuming to
obtain. However, bootstrapping methods have several secondary benefits as well, which are often under-appreciated.
First, bootstrapped learning allows for easier and
more freewheeling system design, development, and
experimentation. Since supervised learning depends