inspection would focus on the preparation of the particular foods consumed and the risk factors for the
contamination, proliferation or amplification, and
survival of the causative organism. This type of
inspection is reactive in nature, and while it may prevent additional disease, problems in the facility have
already occurred. The ultimate goal of all of these
types of inspections is to prevent foodborne illness.
Historically, there has been no way to easily identify
restaurants having a decline in food handling practices and easily prevent illness, as inspections are
based largely on the elapsed time from a previous
inspection. As a result, these types of inspections represent the bulk of inspection activities but tend to be
rather inefficient in identifying problem facilities.
Complaint-driven inspections, while important,
identify the problems after they have occurred,
which is too late to prevent disease. More importantly, foodborne illnesses are frequently underdiagnosed
and underreported (Scallan et al. 2011), preventing
public health officials from identifying the source of
illness for most foodborne infections.
Clark County, Nevada, is home to more than 2
million people and hosts over 41 million annual visitors to the Las Vegas metropolitan area. The Southern Nevada Health District (SNHD) is the governmental agency responsible for all public health
matters within the county and is among the largest
local health departments in the United States by population served. In 2014, SNHD conducted 35,855
food inspections (of all types) in nearly 16,000 per-mitted facilities. In Southern Nevada, inspection violations are weighted based on their likelihood to
directly cause a foodborne illness and are divided
into critical violations at 5 demerits each (for example, food handlers not washing hands between handling raw food and ready to eat food), to major violations at 3 demerits each (hand sink not stocked
with soap), to good food management practices with
no demerit value (leak at the hand sink). Demerits are
converted to letter grades, where 0– 10 is an A, 11–20
is a B, 21–39 is a C, and 40+ is an F (immediate closure). A repeated violation of a critical or major item
causes the letter grade to drop to the next lower rank.
A grade of C or F represents a serious health hazard.
During the experiment, when a food establishment
was flagged by nEmesis in an inspector’s area, he was
instructed to conduct a standard, routine inspection
on both the flagged facility (adaptive inspection) and
also a provided control facility (routine inspection).
Control facilities were selected according to their location, size, cuisine, and their permit type to pair the
facilities as closely as possible. The inspector was blind
as to which facility was which, and each facility
received the same risk-based inspection as the other.
Labeling Data at Scale
To scale the laborious process of labeling training
data for our language model, we turn to Amazon’s
Mechanical Turk. 2 Mechanical Turk allows requesters
to harness the power of the crowd in order to complete a set of human intelligence tasks (HITs). These
HITs are then completed online by hired workers
(Mason and Suri 2012).
We formulated the task as a series of short surveys,
each 25 tweets in length. For each tweet, we ask “Do
you think the author of this tweet has an upset stomach today?” There are three possible responses
(“Yes,” “No,” “Can’t tell”), out of which a worker has
to choose exactly one (figure 5). We paid the workers
1 cent for every tweet evaluated, making each survey
25 cents in total. Each worker was allowed to label a
given tweet only once. The order of tweets was randomized. Each survey was completed by exactly five
workers independently. This redundancy was added
to reduce the effect of workers who might give erroneous or outright malicious responses. Inter-annotator agreement measured by Cohen’s κ is 0.6, considered a moderate to substantial agreement in the
literature (Landis and Koch 1977). Responses from
workers who exhibit consistently low annotator
agreement with the majority were eliminated.
Workers were paid for their efforts only after we
were reasonably sure their responses were sincere
based on inter-annotator agreement. For each tweet,
we calculate the final label by adding up the five constituent labels provided by the workers (Yes = 1, No
= – 1, Can’t tell = 0). In the event of a tie (0 score), we
consider the tweet healthy in order to obtain a high-precision data set.
Designing HITs to elicit optimal responses from
workers is a difficult problem (Mason and Suri 2012).
Pricing HITs poorly can lead to workers not even
considering a task; HITs that are too long can cause
worker attrition, poorly or ambiguously worded HITs
will lead to noisy data. Worker satisfaction is also an
important “latent” factor, which should not be taken lightly. Many Mechanical Turk workers are members of communities that offer requester reviews,
very similar to Amazon’s product review system. As
a result, requesters who are unresponsive or opportunistic will soon find it hard to get any HIT completed.
Given that tweets indicating foodborne illness are
relatively rare, learning a robust language model poses considerable challenges (Japkowicz et al. 2000;
Chawla, Japkowicz, and Kotcz 2004). This problem
is called class imbalance and complicates virtually all
machine learning. In the world of classification,
models induced in a skewed setting tend to simply
label all data as members of the majority class. The
problem is compounded by the fact that the minority class members (sick tweets) are often of greater
interest than the majority class.
We overcome class imbalance faced by nEmesis