vice to both internal and external customers. The
core R&D components of the SKILL system (see figure
1) are similar in design to previous NEN efforts as
well as to systems that generate taxonomies from
data sources using knowledge bases such as
Wikipedia. Both the taxonomy generation and WSD
phases are offline processes that can be scheduled to
run on demand. The taxonomy generation phase
makes extensive use of data scraping, cleaning, and
extraction scripts that run millions of job postings.
Since disambiguation is a computationally intensive
process that usually involves clustering, it was also
designed as an offline batch job. The skill tagging
algorithm was developed to support the near real-time requirements of a web service.
On the data engineering side, the SKILL service
was developed as a Java 5 web service using the
servlets framework. Library file generation is done
offline ahead of time through a separate process.
These libraries are loaded into memory at web service startup time. The service is RESTful: it holds no
state and it always returns the same response for a
given request. The service accepts HTTP GET and
POST requests and handles the two identically; POST
support is provided exclusively as a means of accepting larger payloads (most of our customers use POST
as the default for all skill extraction requests).
An incoming request must contain a content
string with the text upon which skill extraction
should be performed. A request may also optionally
provide a language string (we support skill tagging in
22 languages), a threshold decimal value between 0
and 1 for controlling minimum relevancy, and an
auto_thres Boolean value that controls the extractor’s
behavior on inputs containing 150 or fewer words.
After successful extraction, the service returns a JSON
payload containing an array of extracted skill objects.
Each skill object contains a unique identifier code, its
normalized term, a confidence score between 0 and
1, and a skill type.
Development and Deployment
The core system was developed by two data scientists
over a period of one year. The taxonomy component
was developed in R, Hadoop, and C (for the
word2vec word vectors), while the WSD component
was developed in C++ to alleviate the computational
cost of large-scale clustering processes. Since the skill
tagging algorithm is a core component of the
deployed SKILL service, it was implemented in Java
to take advantage of CB’s deployment and scaling
Over time, the technical implementation of the
service has evolved. The initial implementation was
not written with clean code principles in mind; over
time, functions have been shortened, new functions
and classes have emerged, redundancies have been
eliminated, and variables, functions, and classes have
Table 5. Skill Tagging Results per Confidence Score Level.
The correlation between confidence score and approval rate is 0.81.
Relevancy Score Approved Skills Total Skills Approval Rate
. 95 130 149 .8725
. 90 316 371 .8518
. 85 455 546 .8333
. 80 317 407 .7897
. 75 109 146 .7466
. 70 8 12.6667
Figure 3. Scaling Relevancy Score by Beta Distribution.
0.0 0.2 0.4 0.6 0.8 1.0