National Library of the Netherlands (KB) Thesaurus Mapping Task

Participants of this task are expected to align two Dutch thesauri used to index books from two collections held by the National Library of the Netherlands (KB). The required alignments links are inspired by the SKOS vocabulary. Results have been evaluated by members of the STITCH project and thanks to an application-specific gold standard.

This task reiterates the OAEI 2007 Library track.

Current Status

Results have been submitted. Evaluation is available here.

Data sets

KB maintains two big collections: the Deposit Collection, containing all the Dutch printed publications (one million items), and the Scientific Collection, with about 1.4 million books mainly about the history, language and culture of the Netherlands.

Each collection is described according to its own indexing system and conceptual vocabulary. On the one hand, the Scientific Collection is described using the GTT, a huge vocabulary containing 35,000 general concepts ranging from Wolkenkrabbers (Sky-scrapers) to Verzorging (Care). On the other hand, the books contained in the Deposit Collection are mainly indexed against the Brinkman thesaurus, containing a large set of headings (more than 5,000) that are expected to serve as global subjects of books. Both thesauri have similar coverage (there are more than 2,000 concepts having exactly the same label) but differ in granularity.

For each concept, the thesauri provide the usual lexical and semantic information: preferred labels, synonyms and notes, broader and related concepts, etc. The language of both thesauri is Dutch, but a quite substantial part of Brinkman concepts (around 60%) come with English labels. For the purpose of the alignment, the two thesauri have been represented according to the SKOS model, which provides with all these features. But an OWL version is available (see the appendix for more details).

The goal of the task is to find semantic links between the concepts contained by these GTT and Brinkman thesauri.

Note that the two collections described with the GTT and Brinkman thesaurus are overlapping: some books are hence described by the two thesauri. This year, book descriptions will be made available to participants, so as to enable the use of extension-based alignment techniques.

Modalities

Expected Alignments

The expected alignments shall come in the format defined for the Ontology Alignment API.

As the context of the task is clearly thesaurus-based information systems, the alignment links to be produced shall be compatible with standard thesaurus semantic links. Especially expected are alignment links found in the (new) SKOS vocabulary namespace (http://www.w3.org/2008/05/skos#, referred to as skos:):

skos:exactMatch, which denotes equivalence between two concepts;
skos:broadMatch and skos:narrowMatch, denoting hierarchical generalization and specialization;
skos:relatedMatch, denoting general association.

Note that other semantic relations found in the former SKOS mapping vocabulary namespace (http://www.w3.org/2004/02/skos/mapping#,referred to as skosm:) are not allowed: skosm:minorMatch, skosm:majorMatch. It is however possible to use classes of SKOS mapping that enable to map defined concepts, even though not all be evaluated:

skosm:AND and skosm:OR, denoting intersection and union of concepts, will be evaluated;
skosm:NOT will not be evaluated.

It is also possible to handle alignment results in the form of OWL statements (rdfs:subClassOf, owl:equivalentClass,...). But these will be interpreted according to the rules presented in the appendix on OWL.

Note on alignment cardinality and confidence measure:

results can come with confidence measures, even though these will not be formally evaluated;
it is allowed to have a same concept involved in different mapping links (1-n and m-n mapping situations)

Evaluation

Evaluation of the alignments will be done by members of the STITCH team with the help of domain experts. A special attention will be given to application relevance. As for last year, criteria for evaluation will be chosen according to specific application scenarios like unified thesaurus design or book re-annotation.

Due to the size of the vocabulary, only sample evaluation may be carried out for some scenarios. The modality of this sample evaluation will be determined later, depending for example on the number of participants.

Data availability

The data for the two vocabularies can be asked for by sending mail to the organizers of the task, Antoine Isaac and Henk Matthezing. Participating requires accepting the terms of this creative commons licence and committing not to use the data for a purpose other than the OAEI campaign. [This situation is caused by the data coming from two different environments]

Please notice that this e-mail process is not only expected to ensure compliance with IP situation. It will also help us to keep contact with participants, for instance if a new version of the data is produced, after a complaint by a participant.

Schedule

May 19th: datasets are out
June 15th: end of commenting period
July 1st: tests are frozen
September 1st: participants send preliminary results (for interoperability-checking)
September 26th: participants send final results and papers
October 10th: organisers publish results for comments
October 26th-27th: final results ready and OM-2008 workshop.

Acknowledgements

Stefan Schlobach (VU) Claus Zinn (MPI) have helped the scientific organization of the track. Yvonne van der Steen, Irene Wolters, Maarten van Schie, Erik Oltmans and Johan Stapels have given crucial help on the KB side.

Organizors and contributors

Antoine Isaac, Henk Matthezing, Lourens van der Meij, Shenghui Wang.

Contacts

Please send any questions and comments to Antoine Isaac and Henk Matthezing.

Annex: OWL variant for the data

In case the participants' tool cannot input or output the proposed SKOS data, OWL versions are provided, and OWL alignment relationships can be evaluated. Notice, however, that this amounts to making specific interpretations of the original data and produced alignments, and might reduce the quality of the final results.

The following conversions were made regarding the SKOS data made available for the track:

instances of skos:Concept are converted into instances of owl:Class;
skos:prefLabel, skos:altLabel and skos:hiddenLabel statements are converted to rdfs:label statements, which removes the subtle distinctions that exist between these different properties (in GTT for instance, many altLabels are not synonyms at all)
various kind of skos:notes are converted to rdfs:comments;
skos:broader statements are converted into rdfs:subClassOf statements;
skos:related statements are converted into rdfs:seeAlso statements.

The following interpretations will be made of OWL data sent back by participants:

instances of owl:Class will be interpreted as instances of skos:Concept;
owl:equivalentClass statements will be interpreted as skos:exactMatch statements;
if participants align the original concepts (as instances of skos:Concept and not owl:Class) using owl:sameAs, these statements will also be interpreted as skos:exactMatchs;
rdfs:subClassOf statements will be interpreted as skos:broaderMatch statements;
rdfs:seeAlso statements will be interpreted as skos:relatedMatch statements;
if participants' tools output alignments involving defined classes using owl:intersectionOf and owl:unionOf respectively, the concerned statements will be interpreted as statements involving skosm:AND and skosm:OR. Disjunction-like statements (owl:disjoint, owl:differentFrom, owl:disjointWith, owl:ComplementOf which could be interpreted as statements involving skosm:NOT) will be ignored, as these are not evaluated.