ESCO index
ESCO data
The full dataset in all languages was downloaded from ESCO download page (as .csv) and included in this repository. Maybe in some languages there's some file missing
The following files are located on the directory esco/data/{lang_code}
.
List of files in each language directory:
Skills:
skillsHierarchy (Define the hierarchy of the skills that has subskills, parent and children skills)
skills (Skills that don’t have children 'Final skills')
broaderRelationsSkillPillar (Define the skill and their parent skill)
digCompSkillsCollection (Collection of skills related to digComp)
digitalSkillsCollection (Collection of skills related to digitalSkills)
greenSkillsCollection (Collection of skills related to ecological skills)
languageSkillsCollection (Final language skills "That they do not have children" and their parents)
researchSkillsCollection (Collection of skills related to 'research' and the parents of them)
skillGroups (Skills that define a group or subgroup, like the skillsHierarchy)
skillSkillRelations (Relation between a skill with another skill, check optional Knowledge and optional for)
transversalSkillsCollection ('Final transversal skills' and their parents skills)
Occupations:
broaderRelationsOccPillar (Define the occupation and their parent occupation)
ISCOGroups (Define the hierarchy of the occupations, parent and children occupations)
occupations (All occupations)
researchOccupationsCollection (Occupations related to researching and their parents)
Other files:
conceptSchemes (Only in english language, Collection of schemas related to ESCO,ISCO,digComp)
occupationSkillRelations (Relation between occupations and skills that has the defined occupation)
List of files needed to retrieve all skills & occupations: Skills , occupations & occupationsSkillRelations
ESCO src
In this directory we can find all the implementation needed to process the downloaded ESCO data from the .csv
files in order to populate the ESCO indexes of esco_occupations_sbert
and esco_skills_sbert
. This implementation ONLY make use of the official languages of the EU, those languages are defined in a file within esco/src/lib/data
.
All the logic can be found on the executable file escoIndexImporter.js
which is charge of the processing ESCO data to send it and store on the OpenSearch service.
ESCO src/lib
In this directory we can find subdirectories that are in charge of the configuration of the script
ESCO src/lib/data
In this directory we can find the javascript files in charge of configure the static values of the languages, language files and the index names.
ESCO src/lib/service
In this directory we can find the javascript files in charge of configure the HTTP Axios client and perform the HTTP request to the different API endpoints.
To perform HTTP requests to the Sbert_AI API it is mandatory to first define an environment variable called: 'SBERT' with the value of the protocol, domain and port. Ex: 'http://localhost:5000'
In order to run the script we recommend the use the Node.js version 16.19.1 in order to avoid conflicts with ESM modules
To check the current version of node use the command:
nvm ls
To install the recommended version use the command:
nvm install 16.19.1
To change the current node version use the command:
nvm use <node version>
Run in local the
docker-compose
file of the Sbert_AI repositoryTo perform HTTP requests to the Sbert_AI API it is needed that the environment variable
SBERT
it is defined with the value of the protocol, domain and port.To set the environment variable launch the next command on the CLI
export SBERT=http://localhost:5000
To run the script it is only needed to run the command:
yarn start
over the directoryopensearch/esco
This command will install all the needed dependencies and run theescoIndexImporter.js
file.
Libraries needed
To perform all the tasks to process and insert the data on the OpenSearch API the following libraries are needed:
Axios: A promise based HTTP client that is in charge of sending and handling the HTTP request that are sended to the OpenSearch API.
CSV parser: A library that can convert CSV files into JSON. This library is needed to read the ESCO skills & occupations of each
.csv
file and parse into JSON to process the data easily.Sbert_AI API: An AI API which vectorise input text like a skill description.
Populate ESCO index
In order to create and populate the indexes of esco_occupations_sbert
and esco_skills_sbert
with occupations and skills run the script escoIndexImported.js
located in esco/src
(Check the documentation in the section Run the implementation).
Once the script is executed the first step is that the script request to the user the credentials to allow to send HTTP to the OpenSearch API, if the username or password are not provided then the script execution will be finished. if the credentials provided are mistaken then the request to the OpenSearch API will return a 401 error code and finish the execution.
Then after the user introduce the credentials the script will read the data of the file occupations_${langcode}
that is stored on esco/data/<lang_dir>
then if there is any target language defined on the function selectLanguages
from the file escoIndexImported.js
it is only read the data of those directories, otherwise all the data of the files are read and dumped on the machine memory (The english language is always mandatory).
Esco occupations index
Once the data of the files are dumped on the machine memory then it is first checked if any index with the name esco_occupations_sbert
exist, if it not exists then it is created via HTTP PUT request with a configuration to create a KNN OpenSearch index, after that first step then it is processed all the data of occupations_${langCode}
that is stored on esco/data/<lang_dir>
, building an object, for each column of the file a field is created in the object with the same name of the column and value of the current row of the file. Then it is added in each object a field called langCode
with a 2 letter language code in order to identify from which language the occupation is, for the last step of the process the values of "preferredLabel", "altLabels" and "hiddenLabels" are merged to form a single string using a "\n" separator between each value and the description of each occupation are retrieved and send to the given Sbert_AI API endpoint in order to get two vectors of 1024 dimensions one for the description and the other for the merged labels. Once the vectors are obtained they are added to a new fields called vector_labels
and vector_description
that are the same that the given in the configuration when creating the esco_occupations_sbert
index.
The task of vectorise all the occupations descriptions and labels from a language it takes a long time due that a lot of HTTP requests are sent one by one to the Sbert_AI API
. This process could endure a long time.
Finally, once the payload of each occupation is built, a batch of 200 occupations is sent via HTTP POST request to the OpenSearch API in order to populate the esco_occupations_sbert
index. The batch request are sent each 10s by default and once 10 requests are successfully sent there is a gap of 5 minutes by default to give time to the index to perform the indexing operations (aka "Index Refresh") of the data that has received, otherwise the index could be overwhelmed and close the socket rejecting the next requests with a 429 error code or a timeout.
Esco skills_sbert index
After the population of the esco_occupations_sbert
index, then it is performed the same operations but with the skills, an index called esco_skills_sbert
is created if not exist, and it is performed the same steps that the esco_occupations_sbert
index did, the main difference with the esco_occupations_sbert
index is that in the esco_skills_sbert
index it is also read the file occupationsSkillsRelations_${langCode}
that is stored on esco/data/<lang_dir>
in order to get the relations between skills and occupations, the data processing of each object add a field called relatedOccupations
this field has the next structure: [ { "occupationUri": "URI", "relationType": "relation", "occupationPreferredLabel": "label" } ]
Once the process of indexing finish there is 30m break, in order to give time to the indexes to complete all unfinished indexing tasks and then an HTTP POST request is sent in order to merge the segments of each indexes into only one to achieve a better performance in the search operations of the indexes.
Sometimes this request takes a very long time to perform the merge operations, and it could automaticallt close the socket connection after a while. This doesn't mean that the operation fails, only that the response wasn't available after a few time.
Last updated