Data Curation Guidelines

Data Curation Guidelines

Data Curation Guidelines 3

Ontology 3

Categorization 4

Adding new entities and relationships via UI 4

Step 1: Check for Duplicates 4

Step 2: Check for Wikidata entry 4

Step 3: Fill out basic information 5

Step 4: Add parent concept 7

Step 5: Add additional relationships 9

Adding new competencies and relationships via Excel import 10

Deactivating / deleting entities & relationships 11

Data Curation Guidelines

ProfileMap’s search and analysis features rely on a controlled knowledge base, the so-called ontology. The quality and completeness of this ontology will strongly affect the quality of the search and the analyses performed within ProfileMap. ProfileMap’s User Manual provides the documentation of the data curation functionalities used for augmenting, correcting, and maintaining the ontology. The guidelines provided in this document provide additional details and best practices to facilitate the user to efficiently set-up and maintain a comprehensive and effective ontology.

Ontology

An ontology is a graph used to represent knowledge. The nodes of the graph mostly correspond to entities. An entity can be a competency, a language, a certificate, or a client. In addition to that internally also translations and synonyms are modelled as nodes though this doesn’t affect the data curation work. The edges of the graph are directed and called relationships. They define how two entities are related to each other. Relationships have types and these types convey meaning.

For example, this graph represents the information stored in the ontology of the entity scrum master. The entity is called “scrum master” in English (see the term in the purple node in the middle), it is called “scrum master” in German (see the red node) and has three relationships to other entities. The INSTANCE_OF relationship indicates that the entity the edge is originating from is a more specific kind of entity of the one the edge going to. So, in this case “scrum master” is a specific kind of profession and a specific kind of role. The PART_OF relationship indicates that the entity this edge is going to is made up of multiple parts. In this case, having a scrum master is part of the scrum process.

As described in the User Manual, the ontology is used in the search filter to allow ProfileMap to find candidates with indirectly matching profiles and affects the order in which the candidates are displayed.

Categorization

As described in the User Manual the categorization (or taxonomy how it is also sometimes referred to) of entities is used to structure the profile and to allow filtering and selecting entities based on the defined categorization. Typically, the categorization is based on customs and conventions of the respective customer. It is recommended to choose the categories and subcategories to make it as natural as possible for the users to find the entities they are looking for. Since this is customer-specific, we do not provide a general recommendation on how to categorize.

Adding new entities and relationships via UI

ProfileMap comes with a comprehensive ontology. In addition to that, each customer can adapt his or her ontology based on their respective requirements. Possible adaptations are adding and/or deleting entities and/or relationships. The functionality ProfileMap provides for this is described in the User Manual chapter called “Data Curation”. This section provides further detail information on how to choose entities, synonyms, and relationships to add or delete. In addition to that, it is possible to add competencies using an Excel template. This option can be attractive if many competencies should be added to the system at once as might be the case during the rollout phase, especially if many competencies from domains not yet fully covered by ProfileMap’s standard ontology should be included. The provided template will be described in the section Adding new competencies and relationships via Excel import, though many of the recommendations provided in this section apply likewise to the Excel template.

Step 1: Check for Duplicates

Search for the name, translations, and synonyms of the entity using the entity filter on the data curation screen (see image) to make sure that the entity you want to add doesn’t exist in the system already. Only proceed if it doesn’t.

Step 2: Check for Wikidata entry

Search for the name, translations, and synonyms on https://www.wikidata.org/ to see whether a Wikidata entry for your entity exists. If so, copy the Wikidata-ID into the entity creation form.

One can search for terms in Wikidata using the search bar in the top right corner of the website. If no fitting Wikidata concept is found, leave the Wikidata ID field empty and go on with steps 3-5.

Step 3: Fill out basic information

If you entered a valid Wikidata ID, click the reload button. This will automatically fill the fields of the entity creation form and you can skip steps 4 and 5.

Otherwise, fill out the name, aliases, and description for both English and German in the entity creation form by yourself.

Typically, there are different spellings of the same term as well as common misspellings that the system should ideally also recognize. There are improvements for ProfileMap planned that will allow the system to detect certain variations of terms automatically, so that not all variations need to be entered as synonyms. The following table provides an overview on where it is recommended to include both spellings and where one spelling suffices either now or in the future once the planned improvements of ProfileMap’s normalization process are integrated.

Case Example Currently Matched Extension Planned Recommendation
Compound words “Projektmanagement” and “Projekt Management” No No Enter both the compound word and the misspelling
Different separators “Project Management”, “Project-Management”, “Project_Management” and “Project/Management” No Yes Enter one spelling, with future extensions more and more of the alternative spellings will be matched
Singular / Plural “test automation” and “test automations” No Yes Enter the singular spelling the plural one will be normalized with a future extension
Verb Forms “debug”, “debugs”, “debugging” and “debugged” No Yes Enter the base form the other forms will be normalized with a future extension
Camel Cases “ProfileMap” and “Profile Map” No Yes Enter the more common spelling. The alternative one will be matched with a future extension
Two skills in one term „Regressions- und Integrationstests“ No Yes Only include the individual terms (“Regressionstest” and “Integrationstest” in the example), not the combined term as well. Identifying the first of the two skills is a big challenge to the system, so it might take time until this matching is available, and it might be more error-prone than other normalizations. But we do not recommend adding all combinations because of the effort to do so and the effects on the ontology model.
Skill including a “and” “CI/CD”, “CI / CD”, “CI & CD”, “CI and CD”, “Continuous Integration and Continuous Deployment”, “Continuous Integration / Continuous Deployment” and “Continuous Integration & Continuous Deployment” No Yes In comparison to the “Two skills in one term” case, here the terms typically appear together and are to be considered a single skill. In this case, only include one spelling for the full version and one spelling for the abbreviated one (if there is one as in the example). The white space and different versions of combining the two terms (“/”, “&”, “and”) will be normalized with future extensions.

Whether to include software versions, like “Ubuntu 20.10”, “Ubuntu 21.04” and so on, or not, depends on whether the users intend to formulate such specific search requests. There isn’t a general recommendation from minnosphere on whether to include them or not. It is also possible to only include them in certain areas and be more general in others. If they are connected to a more general entity (“Ubuntu” in the example above) using an INSTANCE_OF relationship, candidates only having entities including the software version but not the more general skill will also be found, when searching for the more general skill.

Sometimes there are competencies that seem to be formed by combining two existing competencies, like “linux firewall” seems to be made up of the two separate competencies “linux” and “firewall”. There are two ways of dealing with situations like this. Either the combined term is not entered in the ontology and users simply search for the two entities (here “linux” and “firewall”) and receive candidates that have both in their profile. This can lead to situations where candidates look appropriate but aren’t, like e. g. if a candidate knows about other aspects of “linux” and only about a specific other “firewall”. However, such cases tend to be rare, and this approach reduces the number of entities that need to be maintained by the data curators.

If such detailed distinctions are required, on the other hand, the recommended way to do this is the following: add a new entity (in the example “linux firewall”) and create appropriate relationships to the two other entities. Often the second word is the headword of a term, and the new entity is a special kind of the entity corresponding to the headword. In the example, “linux firewall” is a special kind of “firewall”. So, it should be connected to “firewall” with an INSTANCE_OF relationship. The “linux firewall” here is also a part of the whole linux system. Thus, a PART_OF relationship should be create between “linux firewall” and “linux”.

A good description helps the system to use the entity better. It might in the future e. g. contribute to the ranking of candidates in the search result list or to distinguishing between ambiguous terms. Also, relationships will be used to disambiguate terms. Thus, curating these well will contribute to better disambiguation in the future.

⚠At the moment there isn’t yet a disambiguation algorithm included when extracting terms from text. Thus, all terms that have a matching synonym are displayed. If a very rare entity has a synonym that happens to also be a very commonly used word (like “automatic network dialing” being abbreviated as “and”), this entity will be suggested every time the common word appears in a text. It might be more beneficial from a UX perspective to not include this synonym for the very rare entity to spare the user the need to often remove the entity manually.

Step 4: Add parent concept

Add a parent concept for your skill by creating an INSTANCE_OF relationship to a broader category. Choose the most specific applicable category, e. g. better “Test Automation Framework” than “Software”.

Typical parent concepts for competencies include:

Wikidata-ID Term Example Children
Q11862829 Academic Discipline Statistics, Human Resource Management, Robotics, Marketing
Q66747126 Computer Science Term Systems Programming, Debugging
Q188267 Programming Paradigm Functional Programming, Big Data, Event-Driven Programming
Q188522 Software Testing White-box testing, data-driven testing, test automation
Q7397 Software Netscape, Google Analytics, Adobe Reader 5.0
Q1077784 Programming Tool Automake, lint, pkg-config, Windows driver frameworks
Q271680 Software Framework OpenCL, Apache Forrest, Puppet
Q188860 Software Library Qt, Akka, Java Media Framework, Apache pdfbox
Q15618492 Software Testing Tool Tricentis Tosca, Apache JMeter, Oracle Application Testing Suite
Q7705752 Test Automation Framework Unity, JUnit, mojotest, Selenium
Q1330336 Web Framework React, Angular, JavaServer Faces, django
Q9143 Programming Language Java, Scala, C++, Python
Q9135 Operating System Windows 8.1, Nintendo IOS, webOS
Q241317 Computing Platform Java Virtual Machine, Cygwin, MS DOS, Platform as a service
Q13741 Integrated Development Environment Intellij IDEA, Eclipse, Xcode
Q2727468 Build Automation Gitlab, cmake, Gradle, sbt
Q891055 Package Management System Apache Maven, npm, Homebrew
Q1480561 Issue Tracking System Jira, Trello, Redmine
Q176165 Database Management System Redis, Apache HBase, Microsoft Access
Q595971 Graph Database Blazegraph, Amazon Neptune, Neo4j
Q3932296 Relational Database MySQL, Amazon Redshift, Postgres
Q82231 NoSQL Database Management System MongoDB, Amazon DynamoDB, Apache Cassandra, OrientDB
SAP Product SAP Cloud Platform, SAP Analytics, SAP Business One
SAP Functional Module SAP Quality Management, SAP Financial Services, SAP Customer Service
Q1378470 Software Development Methodology Waterfall Model, Agile Software Development, Scrum, Continuous Delivery
Q8187769 Economic Activity Software as a Service, Accounting, Customer Relationship Management
Q131093 Content Management System Typo3, WordPress, Drupal, Microsoft Sharepoint
Q8148 Industry Robotics, Human Resource Management, Computer Security
Q15910354 Soft Skills

Parent concepts are used to allow for indirect searches. If a user e. g. searches for a candidate who knows about “Build Automation”, he or she will also receive candidates that have more specific competencies like “Gitlab” or “cmake” but not “Build Automation” itself in their profiles. This function works transitively. So, if the user chooses “Functional Programming Language” as a parent term whose parent term in turn is “Programming Language”, his or her new entity will be matched both if one searches for a “Functional Programming Language” but also if one searches for the more general “Programming Language”. Choosing a more specific parent term allows for more of these indirect searches to take effect.

Typical parent concepts for certificates:

Wikidata-ID Term Example Children
Project Management Certifications Certified Scrum Master, PMI Professional Project Manager
Applications Basics SAP Solution Manager, Microsoft Office User Specialist
Test Management Tosca certified specialist, SQS AG – Initial Training, msg systems intensive training certified tester

Currently, there is not a mechanism for adding categories for certificates or clients. If you require additional categories, please get in touch with the ProfileMap help desk (e-mail to msg.ProfileMap.HelpDesk@msg.group).

Parent concept for languages:

Wikidata-ID Term Example Children
Q34770 Language Arabic, Spanish, Italian

Step 5: Add additional relationships

Choose additional relationships to concepts. Explanations of recommended relationships:

Relationship Explanation Examples
INSTANCE_OF Relationship from a more specific concept to a more general one (usually one can form a is-a sentence between the two like e. g. Java is a programming language) Java

– INSTANCE_OF ->

Programming Language

Waterfall model

– INSTANCE_OF ->

Software development methodology

PART_OF Relationship to a bigger concept from the concepts it consists of (usually one can form a is-made-up-of sentence like e. g. artificial intelligence is made up of machine learning and logics) Microsoft Access

– PART_OF ->

Microsoft Office

Machine learning

– PART_OF ->

Artificial intelligence

USE Relationship between a tool and a method or another tool that is used by it Git

– USE ->

Version control

Apache Maven

– USE ->

Build management

IBM Lotus Notes Expeditor

– USE ->

OSGi

PROGRAMMING_LANGUAGE The programming language of a tool. This is important to set between a software library and a programming language. Apache commons

– PROGRAMMING_LANGUAGE ->

Java

Pandas

– PROGRAMMING_LANGUAGE ->

Python

PROGRAMMING_PARADIGM A paradigm present in a programming language Clojure

– PROGRAMMING_PARADIGM ->

Functional programming

C#

– PROGRAMMING_PARADIGM ->

Object-oriented programming

OPERATING_SYSTEM The operating system on which a tool is available, or the operating system installed on a specific hardware iOS SDK

– OPERATING_SYSTEM ->

macOS

Git

– OPERATING_SYSTEM ->

Cross-platform

For all these relationships multiple relationships can be set for a single skill.

⚠It is recommended not to use the relationship types SUBCLASS_OF, TRANSLATION, ALSO_KNOWN_AS, DEVELOPER, BELONGS_TO or CATEGORY_OF when creating relationships in the entity creation form.

⚠The INSTANCE_OF, SUBCLASS_OF, USE, PART_OF and PROGRAMMING_LANGUAGE relationships are or will be used for indirect searches as described above. The general-specific pattern described above is only one of multiple indirect search patterns. If one connects entities wrongly using these relationship types this will have effects on which candidates are found by the search, on which terms are displayed as related terms in the side-by-side comparison and possibly also on the search performance if significantly more candidates are returned.

Adding new competencies and relationships via Excel import

The Excel template provided for adding competencies using a script includes the following fields:

Field Required Values Description
Category No String The field can be used to organize the competencies in the same way they will be categorized in ProfileMap. Taxonomy relationships between the skill and the category will be created during import.
Skill name EN Yes String The English name of the competency.
Skill name DE Yes String The German name of the competency.
Wikidata ID No Valid Wikidata-ID If a Wikidata-ID exists, it can be set here. If a Wikidata-ID is set the data from the template and from Wikidata will be combined. The names of the competency will be the ones from the template. The synonyms contain the Wikidata names (if different from the ones in the template) and both the synonyms from the template and from Wikidata. Likewise, the relationships both from the template and from Wikidata will be created. The descriptions will be set based on the contents of the template. If a description entry is left empty in the template, the description from Wikidata will be used instead.
Synonyms EN No Comma-separated list The English synonyms of the competency.
Synonyms DE No Comma-separated list The German synonyms of the competency.
Relationships No Relationship list The relationships to other entities. It is possible to create relationships to other entities in the same template if they are in earlier rows.
Description EN No String A continuous text describing the entity in English.
Description DE No String A continuous text describing the entity in German.
Comments No String A field that can be used for comments during the process of filling out the template. The column just as all other columns to the right will be ignored during import.

A relationship list here refers to a comma-separated list of relationship type and entity. The entity is represented by its Wikidata-ID or its name.

Example: PART_OF Q456157, SUBCLASS_OF sap product

⚠ To add new entities, the names of the entities that should be added using the template must be different from the names of the entities already in the ontology. If the German or English entity name or the entered Wikidata ID already exist, the existing entity will be updated instead of creating a new one. An entity is updated by adding the synonyms and relationships given in the template that weren’t existing already. In addition to that, if there are descriptions given in the template, they will be used to overwrite the existing descriptions.

⚠ All recommendations concerning adding entities and relationships from the previous section, especially those concerning choosing sensible values for the different fields, also apply to filling out the Excel template.

Once the template is filled out, your ProfileMap contact person will perform the import. Get in contact with the ProfileMap help desk (e-mail to msg.ProfileMap.HelpDesk@msg.group).

Deactivating / deleting entities & relationships

The functionality ProfileMap provides for deactivating entities is described in the User Manual section “Deactivating terms” in the chapter “Data Curation”. Certain entities in the ontology play a special role in the ranking algorithm (see the section “Step 4: Searching” in the chapter “Search for candidates” in the User Manual). Deactivating these entities will not break the aspect similarity calculations used for the machine learning. Most of these entities are in the table of potential parent concepts or are parent entities of the ones in the table (see section “Step 4: Add parent concept” in chapter “Adding new entities and relationships via UI”). If one thinks of deactivating one of these entities and replacing it with a similar one that suits one’s use case slightly better, it is recommended to instead make this new entity a child node of the original one, so that the machine learning model is weighting the new entity and its children correctly.

Relationships of competencies can be deleted using the functionality described in the subsection “Editing Skills” (section “Editing existing Skills, Certificates, Clients”). Also, the relationships INSTANCE_OF, SUBCLASS_OF, USE, PART_OF and PROGRAMMING_LANGUAGE play a special role in the search filter and the ranking algorithm of ProfileMap’s search. Based on these relationships it is decided which related entities are considered when choosing the list of candidates returned for a search and in some cases which entities belong to which aspects that are used for ordering the search result list. Thus, one should only delete existing relationships of these types if one is certain that such a relationship is incorrect.