on manually annotated data, system development
often must wait until annotated data has been collected. And then, system developers are handcuffed
to that data set because it is the only available training data. In contrast, with bootstrapped learning systems, the only input consists of unannotated texts
for the domain and a seeding mechanism. Unannotated text corpora are relatively easy to obtain, and
most seeding strategies are lightweight. As a result, it
is blissfully easy to try bootstrapped learning on different domains, text corpora, and tasks.
For example, suppose someone is interested in
semantic lexicon induction for a new domain. If the
person has an interest in creating an NLP system for
the domain, then they probably already have (or
know where to find) a large collection of texts for
that domain. Given the text corpus, the person needs
only to define a small number of seed terms for each
semantic category. A key question is what the ideal
set of semantic categories should be. That’s where the
benefits of the bootstrapping paradigm become
apparent. The developer can choose an initial set of
categories based on their domain knowledge and
define a small set of seed words for each one. This
process may take as little as an hour. Then the developer can apply the learning algorithm and inspect
the results. If new words are learned that clearly
belong to different categories, then new categories
can be added simply by defining a few seed words for
them. If some categories are behaving similarly, then
the developer may choose to merge categories to represent a higher-level concept. If the frequencies look
small, the developer can expand the size of the corpus simply by obtaining more unannotated texts.
Furthermore, cross-resource experimentation is also
relatively straightforward. Experimenting with different text corpora and even languages (depending
on the task) can be as simple as replacing one text
collection with another and mapping the seeding
strategy onto the new resource. While these changes
may not be trivial, explorations like these are substantially easier in a bootstrapping paradigm than
they would be in a supervised learning paradigm that
requires manually annotated training data.
Another advantage of learning with seeding strategies as opposed to manually annotated data is that
the former is typically easier for people to produce
than the latter. In natural language processing, manually annotating texts can be deceptively difficult
because of issues pertaining to phrase boundaries,
edge cases (borderline concepts), and idiosyncratic
expressions. Given any set of natural language documents, many of these issues are likely to appear and
can be impossible to avoid. Overall, bootstrapped
learning offers many advantages, from the perspective of data requirements as well as research and system development efforts.
Summary
Nineteen years after appearing in the AAAI’ 99 con-
ference, our paper continues to be cited. As we have
tried to show, we did not start the revolution alone,
but were part of a movement that has continued to
have an impact on research today.
It is difficult to believe that the long-term success
of natural language processing will rely on manually
annotated text corpora for every conceivable task,
domain, and language. Bootstrapping, weakly supervised learning, and distant labeling are important
tools for the future, especially as text corpora continue to grow in size, massive computing power
becomes increasingly available to support large-scale
text processing, and NLP applications are ever more
ubiquitous in everyday life.
There remain many open questions and research
avenues for future work, both within natural language processing in general and for bootstrapped
learning methods in particular. Accuracies are still far
from perfect for many NLP tasks, and new applications for NLP are constantly emerging. Our hope is
that the next generation of researchers will continue
investigating and improving bootstrapping learning
methods for natural language processing and that
these techniques will play a major role in future NLP
technologies.
Notes
1. This is essentially a form of distributional similarity,
which has become a widely used NLP tool for empirical
semantic analysis.
References
Bean, D., and Riloff, E. 2004. Unsupervised Learning of
Contextual Role Knowledge for Coreference Resolution. In
Proceedings of the 42nd Annual Meeting of the North American
Chapter of the Association for Computational Linguistics
(HLT/NAACL 2004). Stroudsburg, PA: Association for Computational Linguistics.
Bing, L.; Dhingra, B.; Mazaitis, K.; Park, J. H.; and Cohen, W.
W. 2017. Bootstrapping Distantly Supervised IE Using Joint
Learning and Small Well-Structured Corpora. In Proceedings
of the Thirty-First AAAI Conference on Artificial Intelligence,
3408–3414. Palo Alto, CA: AAAI Press.
Blum, A., and Mitchell, T. 1998. Combining Labeled and
Unlabeled Data with Co-Training. In Proceedings of the
Eleventh Annual Conference on Computational Learning Theory (COLT’ 98), 92–100. New York: Association for Computing Machinery. doi.org/10.1145/279943.279962
Califf, M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. PhD dissertation.
Technical Report AI98-276, Artificial Intelligence Laboratory, The University of Texas at Austin.
Collins, M., and Singer, Y. 1999. Unsupervised Models for
Named Entity Classification. In Proceedings of the Joint SIG-DAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’ 99). Stroudsburg, PA: Association for Computational Linguistics.