Wikipedia category tags and make a decision on the
qualification of the surface form as skills. Overall, an
input query will be considered a skill surface form if
its resulting Wikipedia document title category tags
pass the SOC keyword screening.
Disambiguate. The objective here is to address the
word sense disambiguation (WSD) problem. For
example, a surface form links to multiple qualified
Wikipedia documents, hence multiple normalized
skills. Our initial approach for WSD utilized the
Google Search API. 9 For instance, given a surface
form with multiple senses, we select the one with the
highest Google Search ranking (by relevancy). This
approach, however, shows the obvious weakness in
not considering semantic context, leading us to
develop a more robust approach for the WSD task, as
described in detail in the next section.
Skill Library. In our current taxonomy, there are
39,000 surface forms mapped to 26,000 normalized
skill entities. Each skill entity contains a unique identification code (skill ID), its raw term (or surface
form), its normalized term, its vector of related surface forms, the corresponding vector of the cosine
similarities, and a skill type (such as hard skill, soft
skill, and certification). See table 1 for an illustration
of a typical skill entity in our skill library.
Skill Tagging
Identify. We identify seed skills from a given input
document by a direct match in the taxonomy. We
break the input text into unigram tokens, assemble
n-grams sequentially, and then match them against
the existing taxonomy, which is stored as a hashmap
with surface forms as keys.
Figure 1. Architecture of the SKILL System.
GENERATION TAXONOMY TAGGING
Collect skill
relevant contents
Clean predefined
noises
Call Wikipedia API
Validate
normalized terms
Disambiguate
multisense terms
Doc
resume / job
Identify
raw terms
Compute
relevancy score
Filter irrelevant /
ambiguous skills
Return
normalized
skills
Database
resumes /
jobs
Raw skill
terms
Skills
Library