20 AI MAGAZINE
An important goal for many conservation applica-
tions is spatial prioritization, the identification,
delineation, and ranking of regions for management
actions (Moilanen, Wilson, and Possingham 2009).
For applications with large geographic extents, mul-
tiscale spatial prioritization is essential for land man-
agers to identify land parcels for acquisition (Schus-
ter and Arcese 2013) or remediation. For example,
with declining populations of long-distance migrat-
ing birds, a key question is whether declines are
caused by events on breeding grounds, nonbreeding
grounds, or during migrations. Answering this ques-
tion requires the comparison of regional population
estimates across continents. Once important large-
scale regions are identified, fine-scale information is
needed to identify critical habitat patches and indi-
vidual migration stopover sites.
Multiscale information is also vital to a broad
range of related sustainability applications. Scientists
need to prioritize regions for disease control management (Ostfeld, Glass, and Keesing 2005). Policy
makers need to select sites for human development
while trying to minimize ecological costs, for example, when developing wind farms (Drewitt and
Langston 2006). In these examples, multiscale information is valuable because it allows managers to
inform policy and make objective decisions at the
appropriate spatial and temporal scale (Gomes 2009).
One of the fundamental challenges of studying
multiscale processes is the collection of data. Consistent sources of fine-resolution data are needed across
broad extents. For many types of biodiversity data,
the largest collection programs are national in scope.
Unfortunately, the variation among national programs hinders ecological study and conservation
planning for broadly distributed species. Because of
the difficulty and expense of collecting systematic
biodiversity data across large extents, many
researchers are beginning to use data collected by citizen science projects through crowdsourcing techniques (Dickinson, Zuckerberg, and Bonter 2010).
Crowdsourcing projects that engage the public to
collect data have been very successful at collecting
data across large areas. However, these data tend to be
irregularly and sparsely distributed. When participants opportunistically choose where to report their
observations, the data tend to follow patterns of
human activity (Hochachka et al. 2012), for example,
figure 1. This structure presents a challenge for the
analysis of multiscale processes because variation in
data density translates into variation in scale at
which valid inferences can be made. Intuitively, as
data density increases at a particular location, the
information available for estimating processes operating there also increases, allowing study of smaller
scale processes. In addition to the density of data, the
scale structure of analytical models affects the scale at
which valid inference can be made (Dungan et al.
2002). Thus, to take full advantage of crowdsourced
data, models that can discover multiscale structure
by adapting to the varying density of irregularly dis-
tributed observations are needed. Additionally, to
make full use of these models tools are needed to
quantify and communicate the finest scales at which
inferences can reliably be made.
The most common approach to account for spatial
and spatiotemporal scale has been to model how correlation varies as a function of proximity. This has
been an active research area in statistics and machine
learning for the past two decades (Cressie 1993; Rasmussen and Williams 2006; Cressie and Wikle 2012).
Methodologies such as Kriging (Cressie 1986), Gaussian processes (Paciorek and Schervish 2004), Gaussian Markov random fields (Rue and Held 2005),
splines (Pintore, Speckman, and Holmes 2006; Kammann and Wand 2003), and autoregressive models
(Huang, Cressie, and Gabrosek 2002; Tzeng, Huang,
and Cressie 2005) have been proposed to estimate
and account for spatial correlation in stationary settings, where the effects of proximity are assumed to
be constant. More recently, research has focused on
accounting for nonstationary spatial correlation that
allows for varying scales. Nonstationary covariance
functions have been proposed for Gaussian processes and Kriging models (Stein 2005, Paciorek and
Schervish 2004, Jun and Stein 2008, Pintore and
Holmes 2004). Similarly, spline methods have been
developed with spatially varying penalties (Pintore,
Speckman, and Holmes 2006; Crainiceanu et al.
2007). However, the computational complexity of
many of these models is high (Cressie and Johannesson 2008, Gelfand 2012) necessitating trade-offs
between computational efficiency and the scale of
analysis for large data sets where the number of
observations and locations is in the millions.
In this article we present an ensemble model
designed to discover scale-dependent, nonstationary
predictor-response relationships from large, irregularly distributed observational data (Fink, Damoulas,
and Dave 2013). We call this model AdaSTEM, an
extension to the spatiotemporal exploratory model
(Fink et al. 2010) based on a simple yet effective
divide and recombine strategy. The first stage of the
model divides the extent of analysis into regional
units based on data density using tree data structures.
Next, a mixture model is used to organize regional
units into a cohesive framework while facilitating
discovery of nonstationary patterns of predictor-response associations among regions. Within the
regional units, a user-specified model carries out the
supervised learning task that associates predictors
and responses. AdaSTEM is a highly automated
ensemble model with a pleasingly parallel implementation that scales to big data. The experiments
described here were conducted on the Lonestar cluster through an allocation on XSEDE (www.xsede.
We illustrate the use of AdaSTEM with an analysis