through a combination of two techniques: human
guided active learning, and learning a language model that is robust under class imbalance. We cover the
first technique in this section and discuss the language model induction in the following section.
Previous research has shown that under extreme
class imbalance, simply finding examples of the
minority class and providing them to the model at
learning time significantly improves the resulting
model quality and reduces human labeling cost
(Attenberg and Provost 2010). In this work, we leverage human guided machine learning — a novel
learning method that considerably reduces the
amount of human effort required to reach any given
level of model quality, even when the number of
negatives is many orders of magnitude larger than
the number of positives (Sadilek et al. 2013). In our
domain, the ratio of sick to healthy tweets is roughly 1 : 2500.
In each human guided learning iteration, nEmesis
samples representative and informative examples to
be sent for human review. As the focus is on the
minority class examples, we sample 90 percent of
tweets for a given labeling batch from the top 10 per-
cent of the most likely sick tweets (as predicted by our
language model). The remaining 10 percent is sam-
pled uniformly at random to increase diversity. We
use the HITs described above to obtain the labeled
In parallel with this automated process, we hire
workers to actively find examples of tweets in which
the author indicates he or she has an upset stomach.
We asked them to paste a direct link to each tweet
they find into a text box. Workers received a base pay
of 10 cents for accepting the task, and were motivated by a bonus of 10 cents for each unique relevant
tweet they provided. Each wrong tweet resulted in a
10 cent deduction from the current bonus balance of
a worker. Tweets judged to be too ambiguous were
neither penalized nor rewarded. Overall, we have
posted 50 HITs that resulted in 1971 submitted tweets
(mean of 39. 4 per worker). Removing duplicates
yielded 1176 unique tweets.
As a result, we employ human workers that “guide”
the classifier induction by correcting the system when
it makes erroneous predictions, and proactively seek-
Figure 5. Example of a Mechanical Turk Task.
In this task, online workers are asked to label a given tweet. While tweets are often ambiguous, we encouraged workers to
use their best judgment and try to polarize their answers. We found that when workers are presented with too many options,
they tend to select “Can’t tell” even when the text contains a strong evidence of illness.
Help us ;nd health problems looming behind these tweets.
Please use your best judgment to evaluate these tweets for signs of upset stomach, e.g. food poisoning, diarrhea,
stomach ache, or food-related disease. Use theradio-buttons to select what you think is the most likely answer for
each tweet. You will be paid based on agreement of your input with other workers and with our automated
system. Please consider each tweet carefully. Use the last response("It's absolutely impossible to tell from
this tweet") only when absolutely sure the health of the person cannot be estimated.
• Evaluate all tweets to complete the HIT.
• The tweets are often ambiguous or even nonsensical. Please use your best judgment to
;nd the best label for each tweet.
• You are not required to follow any links that may be included in the text.
• The tweets are un;ltered and therefore may contain offensive language.
• Enjoy the HIT, you are helping science! :-)
Do you think the author of this tweet has an upset stomach today?
I want to go to bed. It's 1am and I can't fall asleep because I'm sad :(
Yes: This person likely has an upset stomach
No: This person does NOT indicate upset stomach in this tweet
It's absolutely impossible to tell from this tweet