Library track: large multiligual subject heading lists

Participants of this task are expected to create pairwaise alignments between three large subject heading lists. The required alignments links are inspired by the SKOS vocabulary. Results will be evaluated by members of the TELplus project using a gold standard established by the MACS project.

This task is similar, from a methodological perspective, to the OAEI 2008 Library track. It uses however a different dataset.

Current Status

Results have been submitted. Evaluation is available here.

Data sets

The vocabularies to align are:

LCSH, the Library of Congress Subject Headings, available as linked data at http://id.loc.gov. Contains around 340K concepts, including 250K general subjects.
RAMEAU, the heading list used at the French National Library, available as linked data at http://stitch.cs.vu.nl/rameau. Contains around 150K concepts, including 90K general subjects.
SWD, the heading list used at the German National Library. Contains 800K concepts, of which 160K general subjects.

The concepts from the three vocabularies are used as subjects of books. For each concept, the usual SKOS lexical and semantic information is provided: preferred labels, synonyms and notes, broader and related concepts, etc. For the purpose of the alignment, the three thesauri have been represented according to the SKOS model, which provides with all these features. But an OWL version is available (see the appendix for more details).

Even though two of these vocabularies are available online as RDF data, we will provide dumps for the convenience of participants.

We will also make available a part of the MACS manual mappings between these vocabularies, which can be used as a learning set.

Modalities

Expected Alignments

The expected alignments shall come in the format defined for the Ontology Alignment API.

As the context of the task is clearly thesaurus-based information systems, the alignment links to be produced shall be compatible with standard thesaurus semantic links. Especially expected are alignment links found in the SKOS vocabulary namespace (http://www.w3.org/2004/02/skos/core#, referred to as skos:):

skos:exactMatch or skos:closeMatch which denote semantic equivalence (more or less strong) between two concepts;
skos:broadMatch and skos:narrowMatch, denoting hierarchical generalization and specialization;
skos:relatedMatch, denoting general association.

It is also possible to handle alignment results in the form of OWL statements (rdfs:subClassOf, owl:equivalentClass,...). But these will be interpreted according to the rules presented in the appendix on OWL.

Note on alignment cardinality and confidence measure:

results can come with confidence measures, even though these will not be formally evaluated;
it is allowed to have a same concept involved in different mapping links (1-n and n-1 mapping situations)

Note on the use of learning set: depending on the systems' using the partial MACS learning set, and our own evaluation possibilities, we will try to measure the positive effect of using that learning set. In that case we will ask the participants to submit two versions of their alignments, as done when for subtracks 1 and 4 of the Anatomy track.

Evaluation

A special attention will be given to application relevance. As for last year, criteria for evaluation will be chosen according to specific application scenarios like re-annotation. In particular, we plan to perform automatic evaluation in the context of re-annotation of books from one subject vocabulary to the other, using sets of books that are common to two collections at a time. We will also perform evaluation against a manually built partial gold standard, namely the mappings established by the MACS project.

Research Questions

Within the evaluation we try to focus on the following aspects:

Which system performed best?
Do participants' performances vary from one evaluation setting to the other and from one vocabulary pair to the other?
How well do systems reproduce existing partial gold standard? How much can they extend on this gold standard?
Can systems find non trivial correspondences (based on same labels) in a multilingual case?
Can systems benefit from a learning set?

Data availability

The data for the three vocabularies and the MACS learning sample can be asked by sending mail to both organizers of the task, Antoine Isaac and Shenghui Wang. Participating requires committing not to use the data for a purpose other than the OAEI campaign. [Especially for the data sets which are not already provided as linked data]

Please notice that this e-mail process is not only expected to ensure compliance with IP situation. It will also help us to keep contact with participants, for instance if a new version of the data is produced, after a complaint by a participant.

Schedule

June 15th: datasets are out
June 22nd: end of commenting period
July 6st: tests are frozen
September 1st: participants send preliminary results (for interoperability-checking)
September 28st: participants send final results and papers
October 5th: organisers publish results for comments
October 25th: final results ready and OM-2009 workshop.

Acknowledgements

Patrice Landry, Genevieve Clavel and Jeroen Hoppenbrouwers for the MACS data. For LCSH, RAMEAU and SWD, respectively, The Library of Congress, The French National Library and the German National Library.

Contacts

Please send any questions and comments to Antoine Isaac and Shenghui Wang.

Annex: OWL variant for the data

In case the participants' tool cannot input or output the proposed SKOS data, OWL versions are provided, and alignments using OWL properties can be evaluated. Notice, however, that this amounts to making specific interpretations of the original data and produced alignments, and might reduce the quality of the final results.

The following conversions were made regarding the SKOS data made available for the track (see also here):

instances of skos:Concept are converted into instances of owl:Class;
skos:prefLabel, skos:altLabel and skos:hiddenLabel statements are converted to rdfs:label statements, which removes the subtle distinctions that exist between these different properties (for instance, many altLabels are not synonyms at all)
various kind of skos:notes are converted to rdfs:comments;
skos:broader statements are converted into rdfs:subClassOf statements;
skos:related statements are converted into rdfs:seeAlso statements.

The following interpretations will be made of OWL data sent back by participants:

instances of owl:Class will be interpreted as instances of skos:Concept;
owl:equivalentClass statements will be interpreted as skos:exactMatch statements;
if participants align the original concepts (as instances of skos:Concept and not owl:Class) using owl:sameAs, these statements will also be interpreted as skos:exactMatchs;
rdfs:subClassOf statements will be interpreted as skos:broaderMatch statements;
rdfs:seeAlso statements will be interpreted as skos:relatedMatch statements.