also find application in compensation analytics,
which helps quantify the value of specific skills and
assist in improving employee wages.
There are various skills taxonomy and extraction
systems. ESCO4 is a European Commission project to
categorize skills, occupations, and other relevant
competencies. It aims to provide semantic interoperability between labor markets and education and
training programs. No information is available on
what techniques and methodologies were used to
create the ESCO taxonomies. The approach discussed
in Kivimäki et al. (2013) uses the LinkedIn skills taxonomy in conjunction with the spreading activation
algorithm applied on the Wikipedia hyperlink graph
to extract both inferred and explicitly stated skills
from text. A skill inference model based on social
graph connections is discussed by Wang et al. (2014).
This approach also uses data from LinkedIn and
builds a factor graph model using textual information contained in the skills and expertise section, personal profile connections (shared majors, titles, companies, and universities), and skill connections (skills
that cooccur together). While the model based on
skill connections is more accurate than the one that
uses only profile connections, the joint model that
uses both connection types gives the best results.
The LinkedIn Skills system (Bastian et al. 2014)
uses a data-driven approach to build a skills folkson-omy. The folksonomy-building pipeline consists of
discovery, disambiguation, and deduplication steps.
The system also consists of a skills inference component, which uses profile attributes such as company,
title, and industry (among others) as features. The
approach is similar to the skill inference model presented by Wang et al. (2014) except that it uses a
Näıve Bayes instead of a graph-based model. Skill recommendation and inference also find application in
talent management in large enterprises. Varshney et
al. (2013) discuss a matrix factorization–based
approach to skill recommendation. This approach
also leverages employee data from enterprise social
networking tools, human resources (HR), and management data.
Our previous work (Zhao et al. 2015) gave an
overview of an early version of SKILL, a system that
utilized a novel approach to named entity normalization (NEN) of occupational skills by leveraging
properties of semantic word vectors. In this article,
we build on that work and provide more details on
the deployed SKILL system; more specifically, we discuss the skill tagging algorithm as well as the skill
entity sense disambiguation component. As the system has been in production for over a year, we also
discuss the wide range of use cases at CareerBuilder5
(CB), end-user feedback that resulted in improvements to the system, and best practices and lessons
learned from deploying and maintaining the system
in production for a global customer base.
SKILL System Overview
Some key challenges of our tasks are summarized
below. An effective skill system should be able to do
1. Recognize skill entities from both job postings and
resumes. These sources are semistructured and may
contain varying degrees of noise.
2. Handle name variations. The skill entity artificial
intelligence can be in plural-form artificial intelligences, it can also be in acronym-form AI, and it
might contain typos artificially intelligent.
3. Leverage semantic context to recognize unspecified
skill entities. A statistician job posting, for example,
might contain correlation analysis and multivariate
regression skill entities, as well as other skills required
by PhD- or masters-level work, but not logistic regression and hypothesis testing. These unspecified skills
should be recognized with reasonable confidence.
4. Reduce false positives in tagging skills with multiple
senses. The term “stock,” for example, has meaning
both in the context of food preparation and the context of finance.
Some of these challenges ( 1 and 3) are unique in the
recruitment domain, while others ( 2 and 4) also exist
in other typical NEN tasks. In this section, we describe
the workings of the SKILL system, which aims to
address the sum total of these challenges. Figure 1
summarizes our system architecture. On the left, we
present the skill taxonomy generation. Once this task
is completed, we employ the resulting skill library for
the tagging task, as shown on the right of the figure.
Skill Taxonomy Generation
Collect To generate candidate skills, we collect skill-related contents from over 60 million candidate
resumes and 1.6 million job postings available at the
CB online career site. The selected section can be
skills, technical skills, technical proficiency for resumes
or the requirements section in job postings. We do not
infer the content or meaning of the extracted content at this point, as the goal of this step is to capture
as much skill data as possible.
Clean. We split text by punctuation, then remove
any noise. The predefined noise dictionary contains
stop words,6 country and city names, additional
adverbs and adjectives, and other predefined terms
by domain expertise. Our goal here is to discard the
ubiquitous words that contribute little to no semantic value in building the skill taxonomy.
Call. After gathering raw terms (also known as
surface forms), we call the Wikipedia API7 for normalization and deduplication. We do an open search action
using seed phrases as the input query, followed by a
query action for associated Wikipedia documents, if
any, and then collect category tags and redirections.
Validate. The goal of this step is to retain surface
form directly related to occupational skills. We rely
on keywords from the standard occupational classification (SOC) system8 to validate the returned