large geographical area, typically at the level of a state
or large city. Researchers have examined influenza
tracking (Culotta 2010; Achrekar et al. 2012; Sadilek
and Kautz 2013; Broniatowski and Dredze 2013;
Brennan, Sadilek, and Kautz 2013), mental health
and depression (Golder and Macy 2011; De Choudhury et al. 2013), as well as general public health
across a broad range of diseases (Brownstein, Freifeld,
and Madoff 2009; Paul and Dredze 2011b).
Some researchers have begun modeling health and
contagion of specific individuals by leveraging fine-grained online social and web search data (Ugander
et al. 2012; White and Horvitz 2008; De Choudhury
et al. 2013). For example, in Sadilek, Kautz, and Silenzio (2012) we showed that Twitter users exhibiting
symptoms of influenza can be accurately detected
using a model of language of Twitter posts. A detailed
epidemiological model can be subsequently built by
following the interactions between sick and healthy
individuals in a population, where physical encounters are estimated by spatiotemporal colocated
Our earlier work on nEmesis (Sadilek et al. 2013)
scored restaurants in New York City by their number
of sick tweets using an initial version of the language
model described here. We showed a weak but significant correlation between the scores and published
NYC Department of Health inspection scores.
Although the data came from the same year, many
months typically separated the inspections and the
Other researchers have recently tried to use Yelp
restaurant reviews to identify restaurants that should
be inspected (Harrison et al. 2014). Key words were
used to filter 294,000 Yelp reviews for New York City
to 893 possible reports of illness. These were manually screened and resulted in the identification of 3
Background: Foodborne Illness
Foodborne illness, known colloquially as food poisoning, is any illness that results from the consumption of contaminated food, pathogenic bacteria,
viruses, or parasites that contaminate food, as well as
the consumption of chemical or natural toxins such
as poisonous mushrooms. The US Centers for Disease
Control and Prevention (CDC) estimates that 47. 8
million Americans (roughly 1 in 6 people) are sickened each year by foodborne disease. Of that total,
nearly 128,000 people are hospitalized, while just
over 3000 die of foodborne diseases (CDC 2013).
CDC classifies cases of foodborne illness according
to whether they are caused by one of 31 known foodborne illness pathogens or by unspecified agents.
These 31 known pathogens account for 9. 4 million
( 20 percent of the total) cases of food poisoning each
year, while the remaining 38. 4 million cases ( 80 percent of the total) are caused by unspecified agents.
populations (Eubank et al. 2004), and by analysis of
official statistics (Grenfell, Bjornstad, and Kappey
2001). Such models are typically developed for the purposes of assessing the impact a particular combination
of an outbreak and a containment strategy would have
on humanity or ecology (Chen, David, and Kempe
However, the above works focus on aggregate or
simulated populations. By contrast, we address the
problem of predicting the health of real-world populations composed of individuals embedded in a social
structure and geo-located on a map.
Most prior work on using data about users’ online
behavior has estimated aggregate disease trends in a
Figure 2. nEmesis Web Interface.
The top window shows a portion of the list of food venues ranked by the
number of tweeted illness self-reports by patrons. The bottom window provides a map of the selected venue, and allows the user to view the specific
tweets that were classified as illness self-reports.