policies to allocate the work to be performed across
available hardware threads and compared these
against the baseline run time of the original single-threaded algorithm. All implementations, other than
the baseline, were built to be configurable in the
number of worker threads.
Some of our policies have the main thread distribute work to the worker threads, in which case the set
of explanations after each observation is collected
and redistributed to threads on the next observation.
The others have the worker threads pull work when
they are otherwise idle. This means these schedulers
do not need to have all the worker threads complete
their work and fall idle after each observation but can
instead keep all threads working until all the observations have been processed. We will highlight these
distinctions for each of the five implemented policies
First, the baseline implementation is the original
implementation, albeit with the thread-safety guarantees in place. This involved ensuring the reference-counting implementation used for releasing memory
was thread safe, and guaranteeing that the shared
data structures used were not modified for the duration of the search.
Second, the naive scheduler (Herlihy and Shavit
2012) implementation is a proof of concept for multithreading the algorithm; it spawns a new thread for
each unit of work to be scheduled, and the thread is
destroyed when the unit of work is completed. For
each observation, one unit of work is produced for
each thread, and the set of explanations is shared
equally between units of work.
Third, the blocking scheduler (Herlihy and Shavit
2012) gives each worker thread a queue, and the
main thread distributes work to these queues on each
observation. Threads can block on an empty work
queue instead of repeatedly having to check the
queue. As in the naive scheduler, explanations are
redistributed equally among threads on each new
Fourth, the global queue (Herlihy and Shavit 2012)
scheduler uses a single multiple-producer, multiple-consumer work queue shared between all the threads
and guarded by mutex at both ends. Worker threads
push new work into this queue as they produce new
explanations and fetch work from this queue when
they fall idle. This policy has a second configurable
parameter, the batch size, which specifies the maximum number of explanations to be added to a unit
of work to be scheduled. The larger the batch size, the
fewer units of work we need to schedule when processing, but the more potential there is for missed
parallelism due to underutilization. By measuring the
run time with different batch sizes, We determined a
batch size of 32 to be adequate, although larger values may be preferable for large problems.
Fifth and finally, the work-stealing (Blumofe and
Leiserson 1999) scheduler gives each worker thread a
queue. When worker threads produce new explana-
tions, they schedule new work units into their own
queue, and threads that run out of work can steal
work from other threads’ queues. We implemented a
lockless work-stealing queue due to Chase and Lev
We tested the performance of the schedulers described
above on three domains. First, a simplified robotic
kitchen-cleaning domain involving picking up
objects and putting them away (XPER). This domain
is based on the European Union-FP7 XPERIENCE
robotics project. Second, a logistics domain (LOGIS-
TICS), involving the transporting of packages between
cities using trucks and airplanes. This domain is based
on a domain in the First International Planning Com-
petition (Long and Fox 2003). Third and finally, a
cyber-security-based domain (CYBER) based on recog-
nizing the actions of hostile cyber attackers in a cloud-
based network computing environment.
For each domain a problem with a run time
between a second and a minute for the baseline algorithm was generated by hand. This problem was then
presented to each of the algorithms running on a
multiprocessor machine using 1 to 12 cores. We will
present data on the speedup of each algorithm on
the problem, defined as the single threaded run time
divided by the run time with a larger number of
threads. Ideally we would like to achieve linear
speedup (speedup equal to the number of threads).
In the following graphs, we compute the speedup
against the baseline run time of the original algorithm. This tells us how much faster we processed
the input compared to using a single-threaded
implementation. In later figures, where the baseline
implementation is not included, we instead compute
the speedup by comparing the run time for a single
thread and the run time for the current number of
Figures 4, 5, and 6 show the average speedup for
each scheduler while varying the number of threads.
Each data point was generated by averaging 20 runs.
Comparing the results for different schedulers on all
three problem domains, the work-stealing scheduler
is the clear winner; the next best scheduler varies
depending on the domains, but the work-stealing
scheduler dominates the others.
The work-stealing scheduler does this by ensuring
threads that are starved for work can rapidly find
more, and the lockless work-stealing deque implementation has very low overhead. Given this convincing success, the remainder of our experiments
focused on the work-stealing scheduler.
Figure 7 compares the speedups achieved on all
three domains using the work-stealing scheduler.
The algorithm performs significantly worse on the
CYBER domain than the XPER and LOGISTICS