NWO

TABLE OF CONTENTS

Summary 3

1. General Problem Statement 5

2. Programme Strategy 7

2.1 Research Strategy 7

2.1.1 Theme 1: Semantic interoperability through metadata 8

2.1.2 Theme 2: Knowledge enrichment through automated analyses 10

2.1.3 Theme 3: Personalisation through presentation 12

2.2 Implementation Strategy 14

2.2.1 Tools for the "back office" 14

2.2.2 Composition of the Research Teams 15

2.2.3 Design Principles 17

2.2.4 Integrators 18

3. Support Strategy 20

3.1 Transfer of Knowledge and Tools 20

3.2 Continuity 21

4. Programme Management and Budget 23

4.1 Steering Committee 23

4.2 Programme Committee 23

4.3 International Scientific Advisory Board 24

4.4 User Groups 24

4.5 Programme Management Bureau 24

4.6 Committee of Recommendation 25

4.7 Budget 25

5. National and International Context 28

5.1 National context 28

5.1.1 The Royal Netherlands Academy of Arts and Sciences 29

5.1.2 The Netherlands Organisation for Scientific Research 30

5.1.3 SURF 30

5.1.4 MultimediaN 31

5.2 International context 31

5.2.1 European Union 32

5.2.2 International networks 33

5.2.3 Related programmes in the European Union 33

5.2.4 Related programmes in the World 36

APPENDIX I: Six Core Projects 38

640.001.401: STITCH - SemanTic Interoperability To access Cultural Heritage 38

640.001.402: CHOICE - CHarting the informatiOn landscape employIng ContExt information 42

640.002.401: RICH - Reading Images in the Cultural Heritage 47

640.002.402: SCRATCH - Script Analysis Tools for the Cultural Heritage 53

640.002.403: MITCH - Mining for Information in Texts from the Cultural Heritage 58

640.003.401: CHIP - Cultural Heritage Information Presentation 64

APPENDIX II: Involvement consortium members in related international research projects 73

SUMMARY

The collective memory of the Netherlands is stored in our cultural heritage. The total size of the Dutch cultural heritage is certain to be huge. In the Netherlands there are at least 80 large collections that together contain more than several millions of objects. The economic value of this heritage (estimated at 22 billion euros, art collections only) underscores the enormous value of our cultural heritage. Cultural heritage belongs to the entire population of our country and plays a role in many aspects of society: tourism, education, research, cultural interest etc.

For historical reasons, the collections of physical objects have landed in a large number of cultural heritage institutions. This poses limitations for both visitors and researchers. Digitisation holds the promise for continuous access to all cultural heritage collections, unrestricted by time and space. All the digitised collections of the cultural heritage institutes form one large Ambient Heritage Collection. This opens unimagined possibilities for research, education, cultural leisure, and tourism.

Despite large investments, the cultural heritage institutions encounter a number of persistent obstacles that are hindering progress. There is a strong sense of urgency felt by the organisations in the cultural heritage domain to come up with new solutions to get access to the data of the digitised collections. The volume of the Dutch cultural heritage is immense and increasing everyday. A new approach has to be developed. The CATCH programme aims to do research in order to find these new solutions. The two central research questions in CATCH are:

· To what extent is it possible to develop innovative tools (1) to connect knowledge and cultural objects, (2) to integrate scattered digitised cultural objects and (3) to increase the accessibility of and the interaction with our cultural heritage supporting and improving the work of the professionals?

· Can we develop scientifically relevant methods to acquire new fundamental and applied knowledge about these processes and their IT-based solutions?

The challenges implied by the research questions are common to all cultural heritage institutions in the world. The CATCH programme joins the ongoing international efforts. On the one hand CATCH aims to develop tools to improve the specific situation for Dutch cultural heritage (research question 1). On the other hand CATCH wants to contribute new methods and techniques to the international research effort (research question 2).

The CATCH research goals have been established in a process that can be characterised as demand pull rather than technology push. In a demand-pull programme the interests of the (potential) users of the research results are of outstanding importance. Hence the programme strategy has a twofold focus: research and implementation. As a direct consequence, the CATCH programme will have two types of results:

· new knowledge

· software (tools).

The challenges for the CATCH programme: (1) multidisciplinary cooperation between cultural heritage and IT research, (2) excellent research contributions, and (3) intelligent and personalised tools. The CATCH research strategy concentrates on three research themes.

THEME 1: Semantic interoperability through metadata

THEME 2: Knowledge enrichment through automated analyses THEME 3: Personalisation through presentation

The CATCH research focuses on the development of tools and methods to speed up the back office processes, i.e. tools and methods that will enable the collection managers of the cultural heritage institutes to do more in less time and with higher quality. All developed tools and algorithms will be implemented in two 'integrators', existing large IT-projects of national importance in the cultural heritage field.

CATCH is a coordinated effort with respect to three strategies: research, implementation and support.

The research and implementation will be done by research teams consisting of CATCHfunded temporary researchers (PhD students, postdocs), temporary scientific programmers and senior research staff (all employed by universities), and programmers and senior staff employed by cultural heritage institutions (researchers and/or collection managers or others with relevant expertise). With an estimated total budget of M€ 12,5 in subsidies (to be realised in to phases), CATCH will be able to fund about 17 of these research teams. The programme will start with six research teams, each executing one of the six core projects which lay the foundation for the programme. The 11 remaining teams will be selected in competition on the basis of research plans. All Dutch universities can enter the competition, which will be organised by NWO. The participating cultural heritage institutions will contribute M€ 2,8 in kind to the programme.

The support programme provides for the transfer of knowledge and tools (a) within the programme and (b) to all other parties interested in the CATCH results. Furthermore, the support programme aims at building and establishing a structure which guarantees continuity for the results (in particular the tools, the software, and the knowledge) of the CATCH programme.

The programme will be run by a Programme Committee with representatives of the three CATCH themes and additional experts. Daily affairs will be taken care of by an Executive Committee and the Programme Management Bureau. A Steering Committee representing all parties contributing financially to the programme is responsible for the supervision of the programme and all major (financial) decisions. Programme Committee and Steering Committee are assisted by an International Scientific Advisory Board.

The CATCH programme starts in November 2004 and will run for six years.

1. GENERAL PROBLEM STATEMENT

The collective memory of the Netherlands is stored in our cultural heritage. Enormous amounts of archives, books and magazines, paintings and other objects of art, audiovisual sources, objects of folklore, archaeological remains, and logs describing these objects are kept in numerous places, often in buildings that form part of our cultural heritage themselves. The total size of the Dutch cultural heritage is difficult to estimate but is certain to be huge. In the Netherlands there are at least 80 large collections that together contain more than several millions of objects.! The economic value of this heritage is even more difficult to estimate since the true value is symbolic rather than economic. Nevertheless, the estimated monetary value (22 billion euros) (1998², art collections only) underscores the enormous value of our cultural heritage. This is accentuated by the fact that the government is spending around 200 to 250 million euro on an annual basis on the management of the cultural-heritage sector. Revenues and secondary economic effects are probably much larger.

All these witnesses of our past and present are indispensable components of our national identity. Cultural heritage belongs to the entire population of our country and plays a role in many aspects of society: tourism, education, research, cultural interest etc. For historical reasons, the collections of physical objects have landed in a large number of cultural heritage institutions. This poses limitations for both visitors and researchers. Related objects are often stored at different locations. For centuries these limitations were overcome through physical movement. Visitors and researchers travelled to the objects they desired to see, or related objects belonging to different collections were moved to one place to form an exhibition. Yet because of the limitations of time and space the accessibility remained inherently restricted.

Digitisation holds the promise for continuous access to all cultural heritage collections, unrestricted by time and space. Physical constraints no longer apply. All the digitised collections of the cultural heritage institutes form one large Ambient Heritage Collection. This opens unimagined possibilities for research, education, cultural leisure, and tourism. The cultural heritage institutions and the government are very much aware of the potential possibilities the new information technology offers them to perform their public tasks: to preserve, present and propagate their collections to audiences ranging from specialised researchers to the general public. They invest heavily in the digitisation of their collections and the accessibility of the collections through the internet. There are a number of excellent examples where large digital collections have been made available to large audiences.

Despite these investments and other major efforts, the cultural heritage institutions encounter a number of persistent obstacles that are hindering progress. Below they are summarised in five points.

1. The digitisation process is slow, often cumbersome, and therefore very expensive. Most heritage objects are precious and have to be handled with care. Refined technical solutions are needed to support and automate the digitisation process with the subtlety required by such precious goods.

Quick scan Digitalisering Cultureel Erfgoed in Nederlandse Collecties. Reekx Advies, April 2002. 2 Source: CBS.

2. Independent collections, unconnected databases. In the same way as physical objects are kept in numerous independent collections, their digital counterparts are stored in a huge archipelago of (more or less) unconnected databases. Connecting these databases and making them interoperable is a complicated problem, which needs to be solved if the promises to lift the limitations of time and space are ever to be fulfilled.

3. Access problems. Even if the databases are technically connected and can be approached as though they were one large system, there remains the problem to search and sift through millions and millions of objects, ranging from written text to spoken text, from still images to moving images, from 2D objects to 3D objects, and to find the objects one was looking for. Progress is hampered by the great variety of schemes and systems describing the semantics of the objects.

4. The problem of knowledge enrichment. Finding the objects, however, is not enough if we want to exploit the potential of the new digital world to the largest extent possible. Data from various sources (e.g., text and images) can be connected in sensible ways to give us deeper insight into the nature of objects (e.g., paintings) or processes (e.g., historical events). The challenge is to find automated ways to make new knowledge out of existing data and knowledge.

s. The problem of personalisation. The results of the searches have to be presented in ways that correspond to the needs of the person who was looking for the information. It is almost trivial to remark that the presentation of the results of a search to a specialised researcher can, and probably have to, be of another nature than the presentation of the same results to an eight-year-old child. However, it is far from trivial to devise the techniques to realise this.

There is a strong sense of urgency felt by the organisations in the cultural heritage domain to come up with new solutions to get access to the data. The volume of the Dutch cultural heritage is immense and increasing everyday. The funds and time required to be able to digitise and present all our cultural material in a traditional way are lacking by any means. Therefore, a new approach has to be developed since there is an increasing demand, stimulated by the use of internet.

This brings us to two central research questions.

· To what extent is it possible to develop innovative tools (1) to connect knowledge and cultural objects, (2) to virtually integrate scattered digitised cultural objects and (3) to increase the accessibility of and the interaction with our cultural heritage supporting and improving the work of the professionals?

· Can we develop scientifically relevant methods to acquire new fundamental and applied knowledge about these processes and their IT-based solutions?

The challenges implied by the research questions are common to all cultural heritage institutions in the world. Therefore, all over the world serious research efforts are realised to contribute to new ways of dealing with our cultural heritage. The CATCH programme joins these efforts. On the one hand CATCH aims to develop tools to improve the specific situation for Dutch cultural heritage (research question 1). On the other hand CATCH wants to contribute new methods and techniques to the international research effort (research question 2).

2. PROGRAMME STRATEGY

Essential in the CATCH research programme is the direct involvement of the cultural heritage sector in defining the aims and content of the research, right from the start. The CATCH research goals have been established in a process that can be characterised as demand pull rather than technology push. In a demand-pull programme the interests of the (potential) users of the research results are leading. The programme strategy - gUided by the CATCH principle of interaction and cooperation - has a twofold focus: research and implementation. As a direct consequence, the CATCH programme will have two types of results:

· new knowledge

· software(tools)

A main characteristic of the CATCH programme is that the production of these two types of results is interwoven. Obviously, from a scientific point of view, IT-research has as its principal aim the development of new methods, techniques, insights, and knowledge. The results achieved can be equally beneficial for the cultural heritage sector as for the ITresearch itself and a variety of commercial applications. Of course, all results will be disseminated, too, by papers, articles, dissertations etc. The universities and research institutions are responsible for the dissemination and preservation of this knowledge. Cultural heritage institutions and the participating companies should be able to have free access to the knowledge developed. Section 2.1 describes the programme's research strategy to produce new knowledge. Section 2.2 describes the programme's implementation strategy.

2.1 Research Strategy

Although the CATCH programme is ambitious, it has by no means the aspiration to deal with all obstacles mentioned in the previous chapter. Through a concerted and focused research effort, embedded within and gUided by the leading Dutch cultural heritage institutions, CATCH aims at a measurable and permanent impact on an improved accessibility of digital cultural heritage.

Four characteristics of cultural heritage are particularly relevant to the CATCH programme.

1. The volume of the cultural heritage is huge.

2. The cultural-heritage objects are distributed over many distinct collections. They are exhibited or stored in 900 museums, 400 archives, and 1100 libraries in the Netherlands.

3. The collection of cultural-heritage objects is heterogeneous, ranging from buildings to books and pictures.

4. Cultural heritage is generated in a largely unpredictable autonomous process.

Material and immaterial products of human activity and creativity enter the domain of cultural heritage in a continuous and perennial stream.

These characteristics combined with the obstacles mentioned earlier define the challenges for the CATCH programme: (1) multidisciplinary cooperation between cultural heritage and

IT research, (2) excellent research contributions, and (3) intelligent and personalised tools. The CATCH research strategy concentrates on three research themes.

THEME 1: Semantic interoperability through metadata

THEME 2: Knowledge enrichment through automated analyses THEME 3: Personalisation through presentation

2.1.1 Theme 1: Semantic interoperability through metadata

Situation in cultural heritage

From the start, the cultural heritage institutes have used registration systems to add metadata to their collections. However, each of the highly autonomous institutes has done so in its own way. Only recently the institutes have become more aware of the need for standards in the structure of the descriptions, the conventions within the descriptions, and the terminological sources. Nowadays, the sheer amount of heritage sources, their great diversity, the amount of different registration systems used, and the ever evolving wishes of the users make it impossible to provide the "Dutch Heritage Collection" with unambiguous metadata through intellectual human labour. The challenge is to achieve the desired situation by combining intelligent IT applications and human expertise.

Hence, cultural heritage may turn to information technology with a clear technology demand for tools and methods (1) to combine and enrich the already registered data and knowledge, (2) to document sources automatically or semi-automatically, and (3) to supply them with the necessary metadata. The (semi-)automatic generation of metadata is an essential prerequisite for the semantic interoperability of the collections. Metadata not only makes sure that a person can find a specific collection or object, it also enables bulk retrieval of digital objects that are related to each other (e.g., created by the same artist, about the same topic, from the same period, from the same geographic location, etc.). Here we reiterate that the creation of such metadata usually requires a considerable intellectual input of curators and others involved in digital heritage collections. Information technology may offer opportunities for semantic interoperability between digital collections and their metadata on a large scale, which could not be achieved by human input alone. Finally, it is remarked that the creation of a Semantic Web can only be achieved by extensive IT research on semantic interoperability.

Research topics

The leading question is: How can we achieve the creation of semantic metadata by applying automatic creation of metadata? An obvious research agenda reads: (1) by deriving metadata from other collections, and (2) by using ontologies for adding additional elements in metadata corpora to guarantee 'semantic cohesion' between collections and items. Although the main goal is to provide methods and tools that can be used in the "back office" to create semantically rich metadata, there are two more questions, viz. on the speed of the project execution, and on the open structure of the solutions. The tools should minimize the amount of user effort required for creating and maintaining semantic annotations and should help to increase the overall quality level of annotations.

Research will focus on methods and tools for harmonizing ontologies through semantic links between metadata corpora. This research challenge is similar to what is called the "ontology mapping" problem. Research issues with respect to ontology mapping include the following five different topics.

· Inventory of (the composition of) ontologies and vocabularies that are of potential use for cultural heritage applications.

· Types of mapping relations: e.g., equality, equivalence, subclass, instance.

· Methods for representation of mapping relations: e.g., how to add mappings without affecting the original metadata vocabularies.

· Semi-automatic learning of mapping relations; techniques such as emergent semantics (learning semantic relations from user behaviour) may be relevant here.

· Methods for combining metadata with full text documents within a single query.

Background

To understand the research question and the research topics more in depth, we provide some background. The first two bullets underline the importance of metadata once more. The bullets three to five emphasize the various difficulties with semantics.

· Metadata can refer to various kinds of data types. It turns out that the limited and welldefined semantic scope of keyword type of metadata (like IMDI) can be seen as the backbone for collection maintenance and discovery.

· Keyword type of metadata is also one of the keys for interoperability due to the broad usage (community agreed on elements and use the same concepts) and well-defined limited semantics.

· Achieving semantic interoperability is a hard process where the goals have to be clear.

The experience shows that most relationships between the elements of two disciplines can only be expressed with the help of a fuzzy type such as "mapsTo". Frameworks such as RDF(S) and OWL do not include such a relation type for good reasons. Actually, the "mapsTo" relation is exploited as a one-directional equality with some further necessary restrictio ns.

· The limited semantics of the keyword type of metadata and the fact that metadata creation is an expensive endeavour leading to missing values makes it necessary to use all types of contextual information (within metadata hierarchies/environments and outside) to enrich the metadata and to add it to the discovery domain. Both topics are completely new and not sorted out very well. Research has to be done to understand what is possible and how the quality of the metadata will be influenced. Also it has to be understood how metadata and context information can be combined to increase the chance of discovery.

· Semantic annotation has to rely on well-defined domain knowledge to form a coherent discovery space. Therefore, the concepts to be used should be taken from open data category registries (DCR). If a new concept is introduced due to the fact that the existing ones are semantically not sufficient, then the person intending to use it has the duty to enter it into the data category repository, i.e., defining it properly and also where possible define relationships with other existing concepts. The DCRs are essential to avoid a proliferation of concepts which would reduce its relevance for the discovery space and for achieving interoperability.

2.1.2 Theme 2: Knowledge enrichment through automated analyses

Situation in cultural heritage

Collection management and research in the cultural heritage field centres around content, i.e., the meaning of texts, objects, images and their mutual relations. For unanalysed objects, this information is hidden and implicit. The goal of knowledge enrichment is to make this implicit information explicitly available. CATCH aims to develop knowledge and to demonstrate its applicability in automated knowledge enrichment tools. One group of tools aims to support experts. Another group of tools enables fully automated analyses.

There are two dimensions in these two groups of tools. First, tools can be used to assist experts, or they can perform fully automatically. Second, tools can follow existing annotation schemes, or they can discover new structures within, and relations between objects. Knowledge enrichment can be applied to any of the media types which are covered by CATCH: text, images, handwritten documents, archaeological objects, etc.

Both groups of tools aim to alleviate the following problems occurring in the daily work of collection managers, and in the quality of many existing databases, respectively.

· Cultural heritage experts (collection managers and researchers) have used and developed content annotation schemes and classifications, laid down in thesauri, reference lists, topic maps. Their ability to apply these schemes and classifications to new data is only limited by time and scale. Knowledge enrichment techniques can alleviate the time and scale bottlenecks by adding machine power to manpower; by emulating how experts annotate data. After they have learned to emulate experts by examples, they can start to annotate (classify, analyse, relate) very large amounts of new data themselves, in a fraction of the time.

· Existing databases of objects, partially or inconsistently marked up with legacy classification systems can be automatically made more consistent with knowledge enrichment techniques. As far as they are partially or largely unannotated, disorganized, and unlinked, they can be automatically annotated, organized and linked semantically.

Research topics

The leading question is: How can we arrive at the automatic enrichment of cultural heritage data? We know that the current state of affairs asks for (1) tools to support experts in their manual enrichment work, to alleviate time and scale bottlenecks, and (2) tools for automatic data enrichment, particularly for making existing data cleaner and more consistent, and for discovering new structures and relations in data.

The research agenda that follows from these desiderata starts with the development of methods and software tools that can assist experts in their manual work, allowing them to enrich more data in less time. Such tools should be able to emulate experts' annotations, and suggest annotations of new data at such a high level of precision that experts only need to correct these suggestions occasionally. As a second step, the agenda should list the development of tools that operate in domains that demand even more automation; either because no initial annotation scheme is available (the data is still "raw") and an annotation needs to be bootstrapped from data, or because the annotation needs to be performed automatically, either due to the unavailability of experts or as an initial phase in exploring "raw" data.

This agenda calls for the use and development of methods for automatic knowledge generation in data (a broad field encompassing methods from machine learning, statistical learning, and data mining). Knowledge generation from data is typically needed in situations such as the one central to CATCH, where a digitisation effort has produced (potentially large-scale) databases of unanalysed data, and experts (collection managers) are eager to explore and analyse this data as effectively as possible in as little time as possible. Alternatively, the data is already annotated, or is receiving new annotations through a metadata project (as also present in CATCH), and knowledge enrichment is used to learn this annotation and apply it to yet unanalysed data.

This research is intrinsically empirical; the methods to be developed are based on empirical data, and the function they have can and must be judged and evaluated in terms of measurable improvements in accuracy and speed, both by objective quantitative evaluation and by the collection managers that use the methods.

Background

To understand the research question and the research topics more in depth, we provide some background. Table 1 shows four types of knowledge enrichment we distinguished.

Expert support

Expert support, based on existing annotation schemes

III

E 2

III >III

C o :j:; to .•....

o c C to

en c :j:; III

Supporting experts in the annotation of objects in databases according to an existing annotation scheme, in a software annotation environment that is able to make accurate suggestions.

Keywords: semi-automatic annotation, domain knowledge, existing ontologies, semantic web

.§ Expert support, automatic discovery of

2 structure

.•.... III .•....

o Confronting experts with statistically

~ salient patterns and structures within

~ and between objects, visualising

.!2! associations, suggesting new

"'0 structures.

u :j:; to

E Keywords: exploratory data analysis,

B data mining, statistical analysis. ::::l

Table 1: Four types of knowledge enrichment.

Automatic enrichment

Automatic enrichment, based on existing annotation schemes

Automatic annotation of unannotated objects, and automatic cleanup of incorrectly annotated objects. Allows to do what under quadrant A could not have been done in human time .

Keywords: automatic learning

data mining, text mining,

classification, machine

Automatic enrichment,

discovery of structure

automatic

Discovering structures within and between objects, and exporting these discoveries to ontologies, associative networks, and clustering.

Keywords: knowledge generation from data, self-organization, clustering

The "A" quadrant represents tools for the direct support of experts in the manual annotation of objects in databases. Precious time can be saved when intelligent software makes accurate suggestions to the annotator, who then only invests time when the suggestion is incorrect. Even more precious time can be saved when the same intelligent software running in the background makes preselections of especially salient objects that need to be annotated first.

The "B" quadrant takes over from the "A"-quadrant tools when the scale of the data cannot be tackled by the available human expert time. "B"-quadrant tools automatically annotate large amounts of data, and check for inconsistencies and noise in existing annotated databases. They will not do this flawlessly, but well enough that the automatically annotated data becomes largely searchable and retrievable, where before it was not.

The "C" quadrant is the mirror of the "A" quadrant, except that experts are not helped with annotation, but rather confronted with new patterns and relations that may deserve a new annotation symbol or level. A likely example is a new level of annotation which links pairs of objects to each other on grounds of some significant co-occurrence of the two, that thus far was not acknowledged by any level of annotation.

The "0" quadrant combines "B" and "C" - it operates autonomously in data to discover any grouping of objects that might be of interest, on such large amounts of data that a manual inspection of the process would not be feasible, except at the very end of the automatic knowledge discovery process.

2.1.3 Theme 3: Personalisation through presentation

Situation in cultural heritage

Most of the services that are currently available have predefined presentations. The institutions determine the ways a user may view objects and their metadata. Information technology offers many new options for personalisation of the presentation, but these are hardly used at all. The reason is straightforward: there are actually no easy-to-use tools in that respect. More research into human-computer interaction and user modelling is needed to specify such tools. A clear instance is the need for better navigation through digital collections. The amount of objects from cultural institutions run in the millions, if not billions when considered on a global scale. User modelling is considered as an attractive option for navigating more quickly, easily and efficiently across digital collections or objects. By automatic analysis of the user's search behaviour and by offering the facility to create personal contexts, it is expected that users can benefit more from such information services than via direct search-and-retrieval actions.

Research topics

The leading research question is: How can we develop methods and tools for generating presentations of cultural-heritage objects that are related in a semantic way? This work also includes (1) user-modelling issues, e.g., how can user groups be related to presentation styles? and (2) user-control issues, e.g., how can the user control the presentation style? More specifically, we list the following three research questions.

· Is it possible adequately to reduce the user's effort when expressing the ambitious information need that the system must take into account besides many other elements?

· Is it possible to construct a tool that composes an agreed-upon ontology in order to determine the meaning of terms in the user's questions and in the information sources?

· To what extent is it possible to find an "optimal" mix of (1) proactive behaviour that is based solely on the user's known interests and (2) selection of information based on other users' interests or the importance of certain (unrequested) information?

For the research involved two observations are important.

· The availability of a syntactically (XML-based) and semantically (RDF/OWL based) integrated metadata opens new avenues for presentation and personalization.

· By using semantic relations such as "period" and "style" it becomes possible to generate tailor-made presentations for groups or individuals.

Background

To provide an appropriate insight into the complexity of the three research questions we add some details about context and depth of the investigations. In research question 1, the "many other elements" include a user model containing the interests, goals, background and knowledge of the user, contextual information such as the physical location of the user and perhaps also his/her orientation, the time of day, the device and network he/she is using to interact with the system. Presently research is carried out on adapting the selection and presentation of information to a user based on one type of information about that user (either knowledge, interest, or context). This should be complemented by research on adaptation based on all kinds of information about the user in question and his/her context.

For research question 2 it is beneficial to understand that the answer to a question also consists of objects described by semantic metadata, used to determine how these objects relate to one another. This semantic information needs to be combined with descriptive metadata in order to generate a hypermedia (Web) structure that can be viewed using a "browser". While currently it is possible to generate such presentations based on one set of metadata, the combination of different types of metadata has to be investigated in order to generate the most appropriate presentation for each individual user.

Research question 3 looks somewhat further into the future: systems can be made to become proactive, selecting and presenting information that matches the user's interests and needs without the user having to express that need through a question. The automatic provision of information on a person, e.g., architect Max Weber, when dealing with housing of multicultural groups in Amsterdam, is a good example of proactive behaviour. A mix of active and proactive behaviour is needed in order to prevent an agent from becoming boring because an agent will never surprise the user with interesting but unexpected information.

For the research theme personalisation the CATCH programme aims at acquiring new knowledge in three subdomains: (1) selection of information, (2) automatic generation of presentations, and (3) adaptation or personalisation.

Selection of information. The challenge here is to answer incomplete information requests from users with an accuracy that is comparable with or even better than the database-query accuracy. Four techniques have to be combined into heuristic evaluation tools to achieve this goal. The techniques are: (1) information retrieval techniques based on (potential) natural language understanding of textual contents, (2) information retrieval techniques based on metadata using ontologies, (3) selection of objects based on descriptive metadata, and (4) database integration methods.

Automatic generation of presentations. The challenge is to "combine" selected information objects of different media types. Perhaps having different types of navigational or semantic relationships and combining them into a single virtual hypermedia (Web) presentation is the most difficult part. In that case it is necessary to adapt the result to the device and network capabilities of the user's environment. This requires a careful (automatic) selection of the use of the "dimensions" layout, time, and navigation.

Adaptation or personalisation. The results of almost any possible information request are too large to be presented to and browsed through by a user. Hence, an environment must be designed that derives additional specifications of the information or objects to be selected from past user behaviour. In order to improve this process, and especially its initial stages, users need to be clustered in groups (with similar interests, background, expertise, etc.). Finding scalable algorithms for grouping is an additional research issue here.

2.2 Implementation Strategy

The implementation strategy has two branches: the practical implementation and the structural implementation. The practical implementation focuses on the character of the project: demand pull. Hence in 2.2.1 we discuss "tools for the back office" and in 2.2.2 we deal with the composition of the research teams and their collaborations. The structural implementation emphasizes the design principles to be valid for all cultural heritage institutions and to be followed by all research teams (in 2.2.3). In 2.2.4 attention is paid to the connectedness of the knowledge suppliers (the cultural heritage), the researchers, and the end users by introducing two integrators in which the software and tools have to be implemented.

2.2.1 Tools for the "back office"

The potential users of the results of CATCH fall into two categories.

1. The collection managers of the cultural heritage institutes.

2. The end users of the services provided by the cultural heritage institutes.

The two categories have their own demands. The first group is located in the "back office". Here preparations are made for the services and products (such as exhibitions, catalogues, and websites) which will be presented to the end users: the people who are the rationale for the very existence of the cultural heritage institutions. Within the category of end users we distinguish four groups.

a. Research: scientific staff from disciplines like History, History of Art, Archaeology,

Cultural Studies, Linguistics, etc.

b. Education: teachers at universities, high schools, Art Academies.

c. Media: journalists, publishers, editors, marketeers of cultural heritage institutions.

d. Entertainment and edutainment: the general public.

The CATCH research focuses on the development of tools and methods for the collection managers of the cultural heritage institutes (category 1 users) that will enable them to do more in less time and with higher quality. This speeding up of back office processes is needed for at least three reasons: (1) the rapidly growing amount of digitised heritage, (2) the existing amount of heritage that is still waiting to be processed and (3) the ever fastening changes in public demand (category 2 users). Cultural heritage institutes have to adapt to these changes or they will become obsolete. Information technology can provide tools to support the back office in their endeavour to enhance the interaction between the end users and their cultural heritage. It is the ambition of CATCH to develop new knowledge and demonstrate its applicability in a number of tools suitable for use in wide ranges of cultural heritages institutes.

Within the category of end users CATCH pays special attention to group (a): scientific staff from disciplines like History, History of Art, Archaeology, Cultural Studies, Linguistics etc.

2.2.2 Composition of the Research Teams

Essential for the rationale underlying the CATCH programme, the temporary researchers and programmers financed by CATCH will be employed by the universities³ but will have their daily work within the cultural heritage institutes. By physically locating the researchers in the environment where the fruits of their research will be used, CATCH aims at supporting a vivid interaction between the researchers and the prospective users. The idea is that the principal investigator remains responsible for the quality of the research being done, and that the director of the hosting cultural heritage institute has control over the daily routine. Rights and duties of all parties involved are laid down in a guest researcher agreement.

The CATCH research teams will consist of:

· CATCH-funded temporary researchers (PhD students, postdocs), employed by the universities.

· Senior research staff employed by universities.

· Senior staff employed by cultural heritage institutions (researchers and/or collection managers or others with relevant expertise).

· CATCH-funded temporary scientific programmers, employed by the universities.

· Programmers employed by the cultural heritage institutions.

Each team is a mix of persons from each of these five categories. The CATCH principles of interaction and co-operation is also manifest in the composition of the research teams. The PhD-students, postdocs and programmers financed by CATCH are embedded in a team consisting of both senior researchers from one or more universities and senior staff from the cultural heritage institute acting as host. The teams are jointly headed by the principal

In this programme text "universities^H is used as a shorthand for "universities, Telematics Institute and Max Planck Institute for Psycholinguistics^H•

investigator from the university and one of the senior staff members of the hosting cultural heritage institute.

The programme starts with the formation of six teams. For each team, CATCH funds one PhD student (four years), one postdoc (three years) and one scientific programmer (four years). The six teams will each execute a core project, which together constitute the foundation for the research programme. The research details of the core projects are given in Appendix I. Table 2 gives an overview of the universities and cultural heritage institutions involved in the core projects. The second and third column mention the principal investigator and staff member cultural heritage who are jointly responsible for the execution of the project. The fourth column mentions the universities which will employ the researchers and programmers. The last column mentions the cultural heritage institutions in which the researchers and programmers will actually do their work.

	Principal investi-	Staff member	Researchers & Programmers
	gator & university	cultural heritage	University	CH Institution
Theme 1: Semantic interoperability through metadata
Project 1.1:	Van Harmelen, VU	Matthezing, KB	1 PhD VU	KB
STITCH			1 Postdoc MPI
			1 Progr. VU
Project 1.2:	Veenstra, TI	Oomen, B&G	1 PhD TI	B&G
CHOICE			1 Postdoc VU
			1 Progr. MPI
Theme 2: Knowledge enrichment through automated analyses
Project 2.1:	Postma, UM	Lange, ROB	1 PhD UM	ROB
RICH			1 Postdoc UM
			1 Progr. UM
Project 2.2:	Schomaker, RUG	Jager, NA	1 PhD RUG	NA
SCRATCH			1 Postdoc RUG
			1 Progr. RUG
Project 2.3:	Van den Bosch, UvT	Houtgraaf, Naturalis	1 PhD UvT	Naturalis
MITCH			1 Postdoc UvT
			1 Progr. UvT
Theme 3 : Personalisation through presentation
Project 3.1:	De Bra, TUE	Sigmond, RM	1 PhD TUE	RM
CHIP			1 Postdoc TI
			1 Progr. TUE

Table 2: Distribution of core projects over themes, universities and CH institutions

Universities:

MPI = Max-Planck-Institut fOr Psycholinguistik, Nijmegen

RUG = Rijksuniversiteit Groningen

TI = Telematica Instituut, Enschede

TUE = Technische Universiteit Eindhoven

VU = Vrije Universiteit, Amsterdam

UM = Universiteit Maastricht

UvT = Universiteit van Tilburg

Cultural Heritage Institutions:

B&G = Nederlands Instituut voor Beeld en Geluid, Hilversum

KB = Koninklijke Bibliotheek, Den Haag

NA = Nationaal Archief, Den Haag

Naturalis = Nationaal Natuurhistorisch Museum Naturalis, Leiden

RM = Rijksmuseum, Amsterdam

ROB = Rijksdienst voor het Oudheidkundig Bodemonderzoek, Amersfoort

During its lifetime, CATCH will be able to fund a total of 17 of research teams, i.e., 34 temporary researchers and 17 programmers⁴• The 11 remaining teams will be selected in competition on the basis of research plans. All Dutch universities can enter the competition, which will be organised by NWO and obey the usual NWO rules and regulations for research programmes like this.

2.2.3 Design Principles

CATCH focuses on knowledge-based access of the cultural heritage (sources, resources, and knowledge). IT provides tools to facilitate access. Three themes are formulated to gUide the research and development of tools: semantic interoperability, knowledge enrichment, and personalisation. Moreover, strategy and organisation determine the constraints that the projects must meet. The software developed will have the character of open-source software.

The CATCH programme should start with determining a standard measure, i.e., an inventory of what is available on (say) November 1, 2004. This will be done in two respects. All PhD students and postdocs will start their project with a 'warming-up period' of two month to get acquainted with the state of affairs in their hosting cultural heritage institution. During this period they become aware of the problems the cultural heritage institution encounters in their IT-operations. The focus is, of course, on problems related to the research project to be executed. It is very important that during this period the researchers (and their supervisors) get to know the organizational structure of the hosting institution and the people outside the research team (often support staff) who can in some stage contribute to the progress of the actual research effort. One practical way of doing this, is by tackling a small practical IT-problem. This will benefit both the researchers (who will get a crash course about the hosting institution) and the hosting institution (who will have one of their small IT-problems solved).

The warming-up for all programmers consists of making an inventory of the existing and emerging (software) standards relevant to their hosting institutions. In a later stage the inventory can be broadened to requirements for standardisation accepted in the cultural sector.^s The inventory is very important, since the group of programmers will be responsible for implementing the interoperability results obtained by the researchers. The inventory can help in further focusing the research effort as the programme progresses.

4 The exact mix of personnel is to be determined during the execution of the programme.

S DEN is the organisation that guards these standards. DEN is in contact with the Netherlands Standardization Institute NEN. Both organisations investigate forms of co-operation in the field of digital cultural heritage.

Six organisational principles will lead to a uniform development of rules for projects within CATCH (see 3.2). Below we provide the notions of the gUidelines which will be set up in the first phase of the project. CATCH design principles are as follows.

· Distributed systems

· Extreme modularity

· Open standards

· Web enabled systems

· Interoperability

· Use of adaptive IT Techniques

· Digital durability

2.2.4 Integrators

To optimise the success factor and to assure the interoperability the software has to be implemented into at least two integrators: (1) The Memory of the Netherlands (a large database and website about digitised cultural objects maintained and developed by the Koninklijke Bibliotheek) and (2) a museum environment (e.g. the Rijksmuseum). Of course, the application must also be able to work with systems in use in the host cultural heritage institution. No software will be accepted that only runs in just one environment. Knowledge and software must contribute to the integration and interoperability of collections of participating cultural heritage institutions as well as non-participating institutions. The programme committee's second task is to see to it that all the software developed in each of the projects is embedded in at least one of the CATCH integrators.

The CATCH programme is structured according to three themes. All cultural-heritage institutes participating in the CATCH programme will be involved in the research lines in the first phase of the project. Figure 2 illustrates the general structure of the programme. The integrators form the centre of the programme, the testbeds where all techniques and methods come together. Going from the bottom to the top of the diagram, we observe the following stages. The cultural heritage institutions (depicted at the bottom of the diagram in Figure 1) digitise their heritage objects. Durable storage and knowledge enrichment techniques operate on the digitised objects.

generic services

context

[

user

context

[

user

context

[

user

Deploymen t and tools management, e.g.,

I usage

Knowledgebased access

}

Prp.c:.p.nt~tinn

DRMjIPR model

Testbed platform Digital Heritage

distributed infrastructure

Interoperability Metadata

metadata model

	Catch		tools
Legacy	&	manage-		}
systems	ment		(e.g.,		Knowledge-
and	backup col-				enrichment
databases	lection		ma-
	na ement				Durable
					storaae

Figure 1: Schematic illustration of the integrators and its relations to the CATCH research themes.

The results generated by, for instance, enrichment techniques lead to novel metadata. Within the integrator (the shaded area in the diagram), a metadata model is specified that prescribes the format of the newly created metadata. In addition, the integrator realises a distributed infrastructure in which the research line of interoperability plays a main role. At the user side (depicted at the top of the diagram), the research lines of theme 3 personalisation will enhance the accessibility for the user. Thus the integrators playa pivotal role in the CATCH programme: all research themes come together within or at the boundaries of the integrator.

3. SUPPORT STRATEGY

The CATCH programme is a demand-pull programme, with the aim to perform excellent research and produce tools and software that are valuable to the cultural heritage institutions. However, achieving such a twofold aim is not sufficient to boast in the near future on a successful project. Therefore, a support strategy has to be developed, in the form of a support programme with two aims.

1. To facilitate the transfer of knowledge and tools (a) within the programme and (b) to all other parties interested in the CATCH results.

2. To build and establish a structure which guarantees continuity for the results (in particular the tools, the software, and the knowledge) of the programme.

The support programme is run by the Programme Management Bureau (see section 4.4).

3.1 Transfer of knowledge and tools

In the programme's first year the Programme Committee will formulate and implement a specific plan for knowledge transfer. The costs of this plan will amount to approximately 10% of the research budget. The following seven items list the initiatives to be implemented.

Publications: The results of the fundamental strategic research will be published in the usual scientific media (doctoral theses, articles in journals, contributions to conferences and workshops).

Demonstrators: Researchers will be stimulated to develop demonstrators showing the potential of research results which can make an important contribution to the knowledge transfer.

Annual Seminars: Every year the CATCH and MultimediaN Programme Committees will jointly organize a seminar, the Dutch Multimedia Event. Furthermore, other seminars may be organised that focus on the Dutch researchers and cultural heritage experts active in fields closely related to the programme. Members of the International Scientific Advisory Board will also be invited to attend. Although these seminars will primarily focus on the Dutch experts, the organisation will invite a number of prominent foreign researchers who will be asked to comment on the status of research in the CATCH programme.

Workshops: Two international workshops will be organized: one after two and a half years and one at the end of the programme. The topics will be selected from the three themes. It is assumed that approximately 100 people will participate in these workshops; the majority of whom will be from abroad.

The workshops will in particular playa role in the programme's evaluation. To this end, the workshop halfway through the programme can be made to have consequences for the planning of the second half of the programme.

User group: User Groups will be formed in order to guarantee the transfer of knowledge to cultural heritage experts, the business community and society in general. At least three groups will be formed, one for each of the three research themes. Each User Group consists of representatives from interested industrial companies and institutions with a background that enables them to provide substantive feedback on the progress, course and results of

the research. Special user seminars will also be organised in consultation with the cultural heritage and business community.

Patents: Patent applications are an important form of knowledge transfer. The CATCH programme will strive to develop patentable knowledge. Project partners will lay down agreements with regard to patents and licenses. STW will assist the possible exploitation of patents.

Website: The programme will maintain a website which will be used to provide companies, institutions and the popular scientific press access to the results of the research. The researchers in the programme will be stimulated and - where necessary - supported so that they can present the results of their research in a way which makes it accessible to outsiders. Furthermore, there will be a members only section on the website which is only accessible to researchers immediately involved in the programme.

Moreover, the Programme Committee will link the programme to initiatives like "Boulevard van het actuele verleden" ("Boulevard of the current past"), which seek to create a historical "experience" for the general public. The aim of "Boulevard" is to submerge visitors in a virtual world, recreating an historical past. The Programme Committee will explore if and in what way CATCH research can contribute to initiatives like "Boulevard".

3.2 Continuity

Initially, knowledge transfer will be promoted by (a) the participation of cultural heritage institutes and knowledge institutions in the Programme Committee who control the research and (b) by the joint participation of cultural heritage experts and academic researchers in the programme projects. More specifically, in the individual CATCH projects the researchers and programmers will be hosted by cultural heritage institutes, i.e., they will actually perform a considerable part of their research within the environment of the cultural heritage institutes thus allowing for optimal knowledge transfer opportunities.

There are six organisational principles imposed by the CATCH programme that hold for all participants.

· The Programme Committee will ensure that the IPR to the software and tools developed within the CATCH programme will be properly secured.

· Tools and software developed within the CATCH programme must be centrally registered after completion of the project (during the development they will be provisionally registered). The Programme Committee has already established preliminary discussions with SURF about the support, maintenance and availability of the tools and software that will be developed within the CATCH programme (c.f. DARE repositories⁶).

· Tools and software are freely available and usable for the partners. Moreover, they will also be made available for cultural heritage institutions which do not directly participate in CATCH. However, these institutions should register their use of the tools at the administration controlling the software and tools.

· Cultural heritage institutions may elaborate on the software obtained. However, they have the duty to supply their results for free to the organisation serving as a clearing house for the CATCH programme results.

6 SURF DARE repositories: http://www.darenet.nl/en/toon

4. PROGRAMME MANAGEMENT AND BUDGET

This section contains an overview of tasks and responsibilities of three committees and the Programme Management Bureau. Furthermore, a global overview is given on the budget.

4.1 Steering Committee

The Steering Committee of the CATCH programme will be formed by the members of the Council for Physical Sciences supplemented by at least one representative of the Council for Humanities and a representative of the cultural heritage institutes. A SURF representative will also be invited sit in as an advisor. If other parties decide to contribute financially to the CATCH programme, the composition of the Steering Committee may be extended. The Steering Committee meets twice a year, or more often if necessary.

The tasks and responsibilities of the Steering Committee (SC) are as follows.

· The SC supervises the Programme Committee (PC) in the execution of the research programme with regard to progress and cohesion.

· At least once a year the SC reports to the financing bodies of the programme about the

progress of the programme and its financial situation.

· The SC formally appoints the members of the Programme Committee.

· The SC every year has to approve the PC's proposal for the budget.

· The SC makes the formal granting decisions on the basis of a PC proposal.

· The SC ensures that specific actions are taken to ensure the continued availability and maintenance of the programme results.

4.2 Programme Committee

The Programme Committee (PC) is appointed by the Steering Committee. The PC will consist of maximally 12 persons, who will be appointed on the basis of their expertise related to the CATCH programme. The Programme Committee will consist of:

· the two programme leaders

· the leaders of the three research themes: per theme one computer science and one CE

representative

• some representatives of related programmes.

The directors of the NWO Councils for Physical Sciences and Humanities will have a standing invitation for the meetings of the Programme Committee.

The tasks and responsibilities of the Programme Committee are as follows.

· The PC determines and monitors the course of the research programme.

· Within six months after the start of the programme, the PC will submit to the SC a list of success criteria which are to be used in evaluating the programme.

· Before the end of the first programme year, the PC will formulate a specified plan for knowledge transfer.

· The PC formulates Calls for Proposals, appropriate research themes and assessment criteria.

· Each year the PC reports to the SC about the progress of the research programme, its budgetary situation and its plans for the next years.

· The PC is responsible for organising a midterm evaluation and a final evaluation.

· At least three times a year the PC will organise a meeting at which all the researchers involved in the programme will present their results and their plans for future research. Foreign experts can be involved in these seminars.

The programme leaders, the programme manager and the directors of the Council for the Physical Sciences and the Council for the Humanities form an Executive Committee, which will be responsible for handling the day-to-day affairs.

4.3 International Scientific Advisory Board

The Programme Committee and Steering Committee will be assisted by an International Scientific Advisory Board (ISAB), consisting of internationally respected experts in the field of information science and the application of these techniques on cultural heritage data, and specialists from cultural heritage institutes with expertise in computer science. The ISAB functions as an external assessor of the six core projects that will form the basis of the CATCH programme. These projects can only start after approval from the ISAB. Moreover, the ISAB will review and prioritize the full proposals submitted in the competitions (section 4.7). Annually, the SC seeks the ISAB' s advice on the quality and the direction of the CATCH research seen in international perspective. The ISAB will also be involved in the midterm and final evaluation of the project. Finally, the members of this board will be invited to attend the CATCH workshops and can be consulted as advisors for those involved in the CATCH project.

4.4. User Groups

As was already mentioned User groups will be formed in order to guarantee the transfer of knowledge to cultural heritage experts, the business community and society in general. At least three groups will be formed, one for each of the three research themes. Each User Group consists of representatives from interested industrial companies and institutions with a background that enables them to provide substantive feedback on the progress, course and results of the research. Special user seminars will also be organised in consultation with the cultural heritage and business community. The chairman of each User Group will be part of the Programme Committee. These User Groups will also be actively involved in determining the programme's direction and in evaluating the progress of the individual projects and the programme as a whole.

4.5 Programme Management Bureau

The SC, PC, and ISAB will be supported by a Programme Management Bureau (PMB) which will be hosted by NWO. The CATCH PMB consists of a programme officer and his/her staff. The PMB costs will be covered by the programme budget.

The tasks and responsibilities of the Programme Management Bureau are as follows.

· The PMB supports the SC, the PC and ISAB and prepares their meetings.

· The PMB is responsible for the day-to-day scientific managerial and financial administrative affairs of the programme.

· The PMB organises the calls for proposals.

· The PMB monitors the progress of the programme projects and formulates the yearly

progress reports.

· The PMB stimulates the coherence and knowledge transfer within the programme.

· The PMB promotes the dissemination of the programme results.

· The PMB takes care of the practical organisation of programme workshops and evaluations.

4.6 Committee of Recommendation

The Cultural and Industrial Advisory Board will consist of a number of persons with an influential cultural or industrial position in the Netherlands who have agreed to function as ambassadors for the CATCH programme.

4.7 Budget

The total budget of the programme is estimated at M€ 15,3, of which M€ 12,5 will be made available as subsidies and M€ 2,8 will be contributed in kind by the participating cultural heritage institutions. The programme starts with M€ 6,0 in subsidies, committed by the NWO Councils for Physical Sciences and Humanities. The remaining M€ 6,5 in subsidies have been reserved by NWO (M€ 5,0) and the Ministry of Education, Culture and Science (M€ 1,5), but their definitive commitment to the programme depends amongst others on the progress the programme makes.

The in kind contribution of the cultural heritage institutions will be 25% of the subsidies provided by NWO. The contributions will be realised through the participation in the CATCH research teams of researchers, programmers and other staff employed by the cultural heritage institutions (cf. section 2.2.2), and through the participation of representatives of the cultural heritage institutions in CATCH's governing bodies.

Section 5.1.2 describes developments within the Royal Netherlands Academy of Arts and Sciences (KNAW) regarding a programme e-Science for humanities and social sciences. If granted, the programme is expected to have a budget of M€ 4,5. Although the programme will not be part of CATCH in the strict sense, there are clearly related issues in both programmes. Coordination and linkage is secured by the participation of Peter Doorn of the KNAW in the (preparatory) CATCH Programme Committee. If the KNAW programme is granted, it will contribute to the joint national effort with respect to the accessibility of digitised Dutch cultural heritage. In the table below the KNAW programme is added provisionally.

(Amounts in M€)	Phase 1	Phase 2	Total
NWO Physical Sciences	50	2.5	7.5
NWO Humanities	1 0		1.0
NWO General Board		2.5	2.5
Ministry of Education. Culture and Science		1.5	1.5
Total Subsidies	60	6.5	12.5
Contribution cultural heritaoe institutions	15	1.3	2.8
Total CATCH orooramme	75	78	153
KNAW e-Science for humanities and social sciences	PM	PM	(4.5)

The budget is available for the execution of the three CATCH strategies: research, implementation and support. As described in chapter 2 and 3, these strategies are closely intertwined. The preliminary distribution of the budget over the three strategies is depicted in figure 2.

RESEARCH

IMPLEMENTATION

investments:

M€ 0,5

SUPPORT: M€ 2,4 transfer of knowledge and tools continuity programme management

Figure 2: Preliminary distribution of the budget over the three CATCH strategies

Assuming an average project budget of k€ 565, a total 17 of projects can be funded. The subsidy allows for the payment of the wages for one PhD student, one postdoc for three years and one programmer for four years. Furthermore, within each project budget k€ 24 is available for the purchase of small computing equipment and software, on top of the usual bench fee of k€ 5 for each PhD student and postdoc.

The programme starts with six core projects. The eleven remaining teams will be selected in competition on the basis of research plans. All Dutch universities can enter the competition, which will be organised by NWO and obey the usual NWO rules and regulations for research programmes like this. The CATCH competitions will be part of the annual competition for NWO computer science programmes (call for proposals in November, deadline for submission in February, decision for acceptance/rejection in July).

Assuming a more or less even distribution of the research budget over the three research themes, the relation between core projects and projects to be granted in competition is:

(Amounts in k€:)	core oroiects⁷		comoetition		total oroiects
	no.	budaet	no.	budaet	no.	budaet
Theme 1	2	1.130	4	2.260	6	3.390
Theme 2	3	1.695	3	1.695	6	3.390
Theme 3	1	565	4	2.260	5	2.825
Total subsidy	6	l.390	11	6.215	17	9.605
Contribution CH⁸		848		1.554		2.402
Total research		L238		7.769		12.007

Theme 1 = Semantic interoperability through metadata

Theme 2 = Knowledge enrichment through automated analyses Theme 3 = Personalisation through presentation

For all budget figures holds that the actual distribution can be adjusted by the Programme Committee and the Steering Committee depending on the development of the programme or advise of the International Scientific Advisory Board.

7 In fact, the budget for the core projects is k€ 3.210 (and thus the budget for the other projects k€ 6.395), since the wages for the researchers and programmers are lower in 2004 than they will be in later years. For ease of presentation, the average project budget of k€ 565 has been used in this table.

8 On top of the k€ 2.400 mentioned in this table, the cultural heritage institution will contribute k€ 400 through the participation of their representatives in the Programme Committee, Steeering Committee and International Scientific Advisory Board.

5. NATIONAL AND INTERNATIONAL CONTEXT

Digital access to cultural heritage for the general public as well as education and humanities research has become an important policy area since the second half of the 1990s. At the G7 Conference on the Information Society in 1995, the potential offered by Information Technologies for "Multimedia Access to World Cultural Heritage" was officially recognized. Since then, "digital heritage" and "e-culture" took important positions on the political agenda of the information society in many countries and international organizations. It is hardly possible to sum up the programmes and projects that were set up in the past decade in the field of digital culture. Nevertheless, this section aims to give a broad overview of the context in which the CATCH programme can be placed, both nationally (in 5.1) and internationally (in 5.2).

5.1 National context

In 1997 the Royal Netherlands Academy of Arts and Sciences (KNAW) published a report calling for enhanced digital access to cultural heritage information and improved ICT for humanities research.⁹ In 1998 the report Alles uit de Kasfo outlined the contours of a national investment programme for establishing a digital infrastructure for cultural heritage. This was followed by a plan by NWO to create a virtual digital research library for the humanities.^ll In the beginning of 2002 the eCultuurnota¹² appeared. The report sketched the outline of a digital infrastructure for the cultural domain. In particular, the report identified the need for enhanced accessibility of cultural sources and the possibility of reusing cultural material. In May 2002 the governmental letter Digitalisering van het Cultureel erfgoed¹³ appeared. The letter described in more detail how the digitalisation of the cultural heritage should come about.

Meanwhile, in 2000, the Ministry of Economic Affairs had published a report called Concurreren met lCT-Competenties, Kennis en lnnovatie voor De Digitale Delta¹⁴ emphasizing the importance of enlarging ICT competence in the Netherlands. In 2001 the taskforce "ICT-en-kennis" (the Le Pair Committee) issued the report titled Samen, Strategischer en Sterker^S recommending the exploitation of scientific expertise in the multimedia sector to develop new application areas.

De computer en het alfaonderzoek. Advies van de Commissie Geesteswetenschappen over de toepassing van de informatietechnologie bij het onderzoek op het gebied van de geesteswetenschappen, voorbereid door de Subcommissie Informatietechnologie Alfaonderzoek (1997) KNAW.

10 Alles uit de Kast - Op weg naar een nationaal investeringsprogramma digitale infrastructuur cultureel erfgoed (1998). Wetenschappelijk Technische Raad SURF.

11 Een Digitale Bibliotheek voor de Geesteswetenschappen. Aanzet voor een programma voor investering in een landelijke kennisinfrastructuur voor geesteswetenschappen en cultuur (december 1999). NWO-Gebiedsbestuur Geesteswetenschappen.

12 eCultuur in Beeld, letter of the Dutch Parliamentary Undersecretary van der Ploeg to the Tweede Kamer der Staten Generaal on April 22 2002 (Kenmerk MLBjMj2002.14.192).

13 Digitalisering van het cultureel erfgoed, letter of the Dutch Parliamentary Undersecretary van der Ploeg to the Tweede Kamer der Staten Generaal on May 27 2002 (Kenmerk DCEj02j18765).

14 Concurreren met ICT-Competenties.Kennis en Innovatie voor De Digitale Delta, report of the Dutch Minister of Economic Affairs A. Jorritsma-Lebbink and Minister of Education Drs. L.M.L.H.A. Hermans, Onderwijs, Cultuur en Wetenschappen April 2000.

15 Samen, strategischer en sterker, final report of the Taskforce ICT-en-kennis (Committee Le Pair).

April 2001.

The growing policy relevance of innovative digital techniques for the domain of cultural heritage and the humanities is an international phenomenon. Research into virtual libraries and museums, digital longevity of archival sources, techniques of digitization and access to cultural content is taking place in many countries by researchers from computer and information science, humanities computing and the heritage sector itself.

The umbrella organisations for the sciences and humanities in the Netherlands (KNAW, NWO and SURF; a brief overview of their activities is given below in 5.1.1, 5.1.2, and 5.1.3, respectively) have started to develop new plans to give a strong impetus to the intersection of computing, heritage and humanities.¹⁶ Meanwhile, computer and information science is increasingly aware of the research challenges posed by the cultural domain. In the national research agenda for computer science 2001-2005 (NOAG-i) this domain is present in several themes and programs (e.g., ToKeN 2000, Cognition, Language and Speech Technology). In section 5.1.4 we provide some information on MultiMediaN.

5.1.1 The Royal Netherlands Academy of Arts and Sciences

On the basis of several commission reports regarding the future of the Netherlands Institute for Scientific Information Services (NIWI), the KNAW has decided to start an e-Science programme for the humanities and social sciencesY The new program is part of a broader KNAW policy aiming at significant advances in the effective use of ICT in the humanities and social sciences. This new policy includes actions on different levels: principles of open access to research output and data, investments in ICT infrastructure, and the establishment of data archiving networked services (jointly with the Netherlands Research Council NWO). With this new e-science research program, the KNAW seeks to fuel the development of this emerging field in the Netherlands and achieve a leading position internationally.

The KNAW e-science program needs to address a dual mission: (i) to stimulate the development of e-science in the humanities and social sciences, and (ii) to study the effects of e-science on the practice, activity and quality of research in those fields. This mission is to be pursued by an integrated program of cooperative research between the humanities, social sciences and information sciences.

The development of ICT and in particular the Internet, have brought significant changes in three areas: (i) the ever-growing availability of computing power, both in the personal computer and through the emerging GRID technologies linking many computers together;

16 NWO with the present Catch plan; the KNAW with a programme on e-Science in the humanities and social sciences, cf.: Building the KNAW International Research Institute on e-Science Studies in the Humanities and Social Sciences (IRISS) Committee on a KNAW Research Institute for e-Science (Chair: Prof. dr. ir. Wiebe E. Bijker) (2003) KNAW; SURF has published the report E-based Humanities and E-humanities on a SURF platform, by Joost Kircz (2004) SURF.

17 KNAW (Commissie van Bemmel), E-wetenschapsonderzoek in het alfa- en gamma-domein, Advies van de tijdelijke commissie Strategie NIWI-KNAW. Koninklijke Nederlandse Akademie van Wetenschappen (Amsterdam, 2002). Commissie Informatiediensten NIWI (voorzitter: dr. N.M.H. van Dijk), Behouden Toekomst: Een advies met betrekking tot de toekomst van de diensten van het Nederlands Instituut voor Wetenschappelijke Informatiediensten (Amsterdam, 2003). Committee on a KNAW Research Institute for e-Science (Chair: Prof. dr. ir. Wiebe E. Bijker) Building the KNAW International Research Institute on e-Science Studies in the Humanities and Social Sciences (IRISS) (Amsterdam, 2003).

(ii) facilities for communication and collaboration through the internet and applications such as e-mail and the world wide web; (iii) access to digital collections of data, including text, sound and images.

E-science is regarded as the combined use of these advances. Potentially e-science can have a profound influence on research, the questions researchers ask and the way research is carried out. E-science first took off in the natural and life sciences, but interest from the social sciences and humanities is growing rapidly; each of the three areas mentioned above has seen increasing activity. Computers are being widely used, and the growing power has led to new research tools.

On the whole, the development of e-science research practices in the humanities and social sciences appears to be in its early stages. This raises two sorts of questions: (1) To what extent are researchers posing new questions, or are existing questions approached in a different (new) way; are new methods desired and developed, and are new patterns of interaction and cooperation emerging among researchers internationally? and (2): How do researchers organize their electronic environment, what are the problems they encounter and how can these be overcome?

The combination of these two sorts of questions, the one more reflective, the other more practice oriented, necessary to gain new insights into to the new possibilities and pitfalls of e-science, is the essential characteristic for an e-science research programme as envisaged by the Academy.

5.1.2 The Netherlands Organisation for Scientific Research

In 1999, the NWO Research Council for Humanities established a platform to prepare the development of a production line for the Digital Library for the Humanities.^1s It recognized the importance of ICT techniques for providing adequate and broad accessibility to cultural heritage and the possibilities this would create for future research in the humanities. Meanwhile, the Research Council for Physical Sciences launched a cooperation with researchers in the cognition domain. Their project was called ToKeN2000, and one of the major application areas was the cultural heritage sector. As a natural consequence of these two developments, in 2002 both councils joined forces which has led to the present CATCH proposal. In summary, the motivation of NWO reads:

· to stimulate innovative research;

· to encourage cooperation between front-ranked researchers of different disciplines;

· to strengthen ties between researchers, research applications, and society.

5.1.3 SURF

SURF, the higher education and research partnership organisation for network services and information and communications technology in the Netherlands, is active in the field of

18 Een Digitale Bibliotheek voor de Geesteswetenschappen. Aanzet voor een programma voor investering in een landelijke kennisinfrastructuur voor geesteswetenschappen en cultuur (december 1999). NWO-Gebiedsbestuur Geesteswetenschappen.

digital heritage, humanities and computer science in several ways. The Mission of SURF is to exploit and improve a common advanced ICT infrastructure that will enable higher education institutes better realise their own ambitions and improve the quality of learning, teaching and research. In the SURF Strategic Plan 2003-2006 'The heart of the matter', SURF has changed its perspective radically: the user is now central. With this change, SURF tries to optimise the quality of education and research by applying advanced ICT support where possible. The SURF programme Digital Academic Repositories (DARE) is a joint initiative of the Dutch universities to make all their research results digitally accessible. The KB, the KNAW and NWO are also cooperating in this unique project.

SURF is developing new plans for e-science in the humanities. In a recent report, an attempt has been made to develop a better understanding of those activities and processes in the humanities that are fit for dedicated ICT stimulation and support1⁹•

5.1.4 MultimediaN

MultimediaN is an initiative of leading researchers in the area of multimedia analysis, database technology, and human computer interaction to improve the scientific base in the Netherlands for applications and services relying on analysis and enrichment of multimedia data. MultimediaN commits itself to a co-ordinated research program based on its current position in the leading edge in multimedia content extraction, efficient multimedia content management, personalised multimedia, and man-machine interaction. The consortium aims to expand and exploit the knowledge in multimedia information systems, standards, interaction, information extraction and condensation, and also in video compression, cognitive assessment of information content, and intelligent interfacing. Results are suited for implementation in the multimedia value chain in its full breadth from content enabling to service delivery.

MultimediaN is conceived as a joint venture with a co-ordinated research program. The form is a virtual centre for knowledge transfer based on multimedia science, where techniques will be demonstrated in prototypes, half-products and first time applications. MultimediaN derives its scientific goals from close interaction with both large national digital archives as emerging high-end multimedia services over (mobile) internet. Every year the CATCH and MultimediaN Programme Committees will jointly organize a seminar, the Dutch Multimedia Event.

5.2 International Context

The CATCH consortium is well aware of the international context. For example: Het Geheugen is related to the American Memory project of the Library of Congress, but is more complex, since it does not deal with the collection of the National Library only, but with collections of over 40 museums, archives and libraries. The CATCH-project will of course build on the knowledge from existing international projects. CATCH differs from the Dspace project in that it deals with the massive digital-legacy collections in a wide range of Dutch cultural heritage institutions, while Dspace deals with newly generated digital material only.

19 E-based Humanities and E-humanities on a SURF Platform, Joost Kircz, Kircz Research Amsterdam (2004).

The MIT Media Lab has been very influential in the past in demonstrating on a small scale what is intended to be implemented in a more modern and advanced way, on a very large scale within the CATCH project. Many of our consortium members have close ties with or participate in international projects. Below we deal with several of the projects. We have subdivided the overview as follows: European Union (in 5.2.1), International Networks (in 5.2.2), Related Programmes in the European Union (in 5.2.3), Related Programmes in the World (in 5.2.4).

5.2.1 European Union

'Digital Heritage and Cultural Content' (DigiCULT) is a domain of research activity in the Information Society Technologies (1ST) Programme, a European Commission programme addressing the pervasion of Information and Communication Technologies (lCT) into all aspects of the European citizen's life. This programme was already part of the Fifth Framework Programme for Research and Technological Development (RTD) which ran from 1998-2002, and continues to exist as a key thematic priority area within the 6th Framework Programme (2002-2006).

The Work Programme 2003-2004 "Integrating and strengthening the European Research Area in the Community sixth Framework Programme" specifies the content of the activities. "The focus is on improving accessibility, visibility and recognition of the commercial value of Europe's cultural and scientific resources, by developing: advanced digital libraries services, providing high-bandwidth access to distributed and highly interactive repositories of European culture, history and science; environments for intelligent heritage and tourism, recreating and visualising cultural and scientific objects and sites for enhancing user experience in cultural tourism; advanced tools, platforms and services in support of highly automated digitisation processes and workflows, digital restoration and preservation of film and video material, and digital memory management and exploitation".

With a research focus on eCulture and eScience (i.e., culture and science in a networked environment), DigiCULT aims at establishing a lasting infrastructure of technologies, guidelines, standards, human and institutional networks that will support and extend the role of Europe's libraries, museums and archives in the digital age.

Objectives of the research activities are:

· Enhancing access to and preservation of cultural and scientific heritage resources particularly those in digital form- thus supporting Europe's heritage institutions and organisations in their core functions,

· Accelerating the appropriation of advanced technologies by Europe's libraries, museums and archives,

· Encouraging convergence in technical approaches and applications for various cultural institutions and networked services by promoting agreement on standards and gUidelines critical to managing, preserving and delivering digital cultural and scientific content,

· Fostering increased co-operation between cultural and scientific content holders, i.e. libraries, archives, museums, and the research community or technological application developers, i.e. research centres, academic institutions, ICT companies, etc.

5.2.2 International Networks

In the field of digital cultural heritage, a number of international networks exist, with which the CATCH program will interact and be in contact. Below we mention two of them.

The DELOS Network of Excellence on Digital LibrarieS2° - Digital Libraries (DL) have been made possible through the integration and use of a number of IC technologies, the availability of digital content on a global scale and a strong demand for users who are now online. They are destined to become essential part of the information infrastructure in the 21st century.

The DELOS network conducts a joint program of activities aimed at integrating and coordinating the ongoing research activities of the major European teams working in DLrelated areas with the goal of developing the next generation DL technologies. The objective is to:

· define unifying and comprehensive theories and frameworks over the life-cycle of DL information,

· build interoperable multimodal/multilingual services and integrated content

management ranging from the personal to the global for the specialist and the general population. The Network aims at developing generic DL technology to be incorporated into industrial-strength DL Management Systems (DLMSs), offering advanced functionality through reliable and extensible services.

The Network will also disseminate knowledge of DL technologies to many diverse application domains. To this end a Virtual DL Competence Centre has been established which provides specific user communities with access to advanced DL technologies, services, testbeds, and the necessary expertise and knowledge to facilitate their take-up.

The Digital Library Federation (DLF) is a consortium of libraries and related agencies that are pioneering in the use of electronic-information technologies to extend their collections and services. Through its members, the DLF provides leadership for libraries broadly by -

· identifying standards and "best practices" for digital collections and network access,

· coordinating leading-edge research-and-development in libraries' use of electronicinformation technology,

· helping start projects and services that libraries need but cannot develop individually.

The DLF operates under the administration umbrella of the Council on Library and Information Resources (CLIR).

5.2.3 Related programmes in the European Union

In the framework of the European Union there are many projects in the cultural-heritage sector. They are certainly interesting but no project coincides with our approach. Below we mention some of the important projects but we refrain from pointing out the differences with the CATCH programme.

20 http://www.delos.info/

Interoperability

In the 5th Framework, relevant activities were coordinated by the European Commission's Cultural Heritage Applications unit, DG XIII-E2 in Luxembourg. Some activities are HyperMuseum (http://www.hypermuseum.com/), CHIOS (http://www.dlforum .de/Foerderung/Projekte/CHIOS/), CIDOC (http://www .cidoc.icom.org), META-e (Metadata Engine), SCHEMAS: Forum for Metadata Schema implementers.

Also in the 6th Framework (2002-2006), the European Commission is committed to supporting this area. The research domain "Digital Heritage and Cultural Content" (a research activity in the Information Society Technologies (1ST) Programme) will continue to exist as a key thematic priority area within the 6th Framework Programme.

In the domain of semantic interoperability the four most recent programmes in the 5^th Framework are CHIMER, COINE, ECHO, and INTERA. Below we provide a brief description.

CHIMER (Children's Heritage Interactive Models for Evolving Repositories; http://dbs.cordis.lu/fep-cgi/srchidadb). CHIMER aims to establish an open international network of children, teachers and museologists for developing an Open Evolving Multimedia Multilingual Digital Heritage Archive as a long-term storage medium for European cultural repositories.

COINE (Cultural Objects in Networked Environments) (http://dbs.cordis.lu/fepcgi/srchidadb). Empowering European citizens to tell their own stories lies at the heart of the COINE (Cultural Objects in Networked Environments) Project. It will provide the tools needed to create structured, World Wide Web-based environments which are hospitable to local cultural activity but which allow content to be shared locally, regionally, nationally and internationally.

ECHO (European Cultural Heritage Online) (http://echo.mpiwg-berlin.mpg.de, http://www.mpi.nl/echo) is a new project that has as task to provide a rich interdisciplinary access to objects of cultural heritage. Aspects of interoperability at the metadata level between the 4 included disciplines is one of the core aspects.

INTERA: Integrated European language Resource Area is an attempt to solve interoperability problems on a vertical line by creating not only a large metadata domain of language resources, but also by integrating the domain of resource descriptions with those of tool descriptions. The goal is that dependent on the type of selected resources appropriate tools will be selected automatically.

Besides these four programmes, it is relevant to mention TEL.

TEL: The European Library. The objective of TEL is to set up a cooperative framework which will lead to a system for access to the major national and deposit collections (mainly digital, but not precluding paper) in European national libraries. TEL will investigate how to make a mixture of traditional and electronic formats available in a coherent manner to both local and remote users. TEL will contribute to the cultural and scientific knowledge infrastructure within Europe by developing co-operative and concerted approaches to technical and

business issues associated with distributed access to large-scale content. It will lay down the policy and develop the technical groundwork for a sustainable pan-European digital library based on distributed digital collections and on the operational digital library developments in the participating libraries and agencies. Project website: htto://www.eurooeanlibrarv.ora htto:/ /www.kb.nl/kb/sbo/netwerk/tel-en.html

For an overview of the many activities in Europe we provide the following list.

CHIOS (Cultural Heritage Interchange Ontology Standardization), CHLT (Cultural Heritage Language Technologies),

CHOSA (Application of new technology to increase access to the cultural heritage of St. Albans),

CLEF (Cross-Language Evaluation Forum).

COVAX (Contemporary Culture Virtual Archive in XML), CULTIVATE EU (Cultural Heritage Applications network), CYCLADES (An open Collaborative Virtual Archive Environment), DELOS (A Network of Excellence on Digital Libraries), DOMINICO (On the trace of DOMINICO dell'Allio),

LEAF (Linking and Exploring Authority Files),

MATAHARI (Mobile Access To Artefacts and Heritage At Remote Installations) MIND (Multimedia International Digital Libraries),

PAST (exPeriencing Archaeology across Space and Time), POUCE (Portails Culturels Collectifs),

PULMAN (Public Libraries Mobilising Advanced Networks),

PULMAN XT (Extending the European Research Network for Public Libraries, Museums, Archives),

RENARDUS (Academic Subject Gateway Service Europe), and

SANDALYA (An open platform for accessing, co-operatively authoring and publishing the digital heritage of manuscripts and rare books).

Knowledge Enrichment

At the level of manuscripts, an internationally well-known example of cultural-heritage knowledge disclosure is the Electronic Beowulf project. Handwritten manuscripts are presented on-line and are annotated in great detail, disclosing the temporal evolution of the famous Beowulf texts (see further in 5.2.4). This example, however, is one of the few that we consider as exemplary. Many other approaches simply do not address the power of information technology. An example of the latter kind concerns the Historical Archives of the European Communities (http://wwwarc.iue.it/). basically a directory service to physical documents which are only accessible by visiting the archive in persona. A considerably better example is the "Digitale Bibliothek" by the Bayerische Staatsbibliothek, showing transcriptions as well as facsimile images of important printed works (http://mdz.bibbvb.de). However, navigation is difficult, and no use of hyperlinks from within the images is possible. No panning and zooming facilities are available and the facsimiles are in monochrome black and white. Many projects actually do much worse, merely presenting the facsimiles in a coarse resolution, giving superficial impressions only. A number of 'modern'

European projects do exist, such as MUMIS²¹ (Multimedia Indexing and Searching Environment) with an emphasis on streaming media (video).

The COLLATE Collaboratory project2² comes close to what is ultimately needed in culturalheritage knowledge disclosure: it "aims at the development and practical usage of a contentcentric, user-driven information system for the management of surrogates of fragile historic multimedia objects. As a distributed Web-based multimedia repository, it will function as a 'collaboratory' supporting distributed user groups by dedicated knowledge management facilities such as content-based access, comparison and in-depth indexing/annotation of digitised sources." However, the application examples concern the domain of the cultural heritage of European movies in the 1920s and 1930s. In the audio domain, current technology for content-based retrieval and indexing is quickly developing to a usable level (Zhang & Kuo, 2001)23. The European CIMWOS project2⁴ "aims to facilitate common procedures of archiving and retrieval of audio-visual material. The objective of the project is to develop and integrate a robust unrestricted keyword spotting algorithm and an efficient image spotting algorithm specially designed for digital audio-visual content, leading to the implementation and demonstration of a practical system for efficient retrieval in multimedia databases". This project thus aims at the development of retrieval engines only, without solving the problems of knowledge disclosure around specific high-value objects of the cultural-heritage domain.

In conclusion: although a number of efforts do exist at the European level, the potential for a successful European successor to the Electronic Beowulf approach is much greater if a focused collection from within the Netherlands is used, by researchers from the humanities and from computer science who share a common culture and enthusiasm to preserve it digitally.

Personalisation

There are initiatives on personalisation in the European Union. We provide a few references below. For an example project we refer to the Hermitage Museum's New Web Site. HyperMuseum (http://www.hypermuseum.com/)

CHIOS (http://www.dl-forum.de/Foerderung/Projekte/CHIOS/) CIDOC (http://www.cidoc.icom.org)

The Open Heritage initiative (http://www.openheritage.com/intro.html)

5.2.4 Related programmes in the World

There are many international initiatives, most of them of recent date. None of the programmes encountered so far, covers the three themes of the CATCH Programme.

A project to mention is the Hermitage Museum's New Web Site, a cooperation between IBM (Yorktown Heights, NY) and the Hermitage Museum. The project followed the then (1997) visionary ideas of Mikhael Piotrovski, director of the Hermitage. Three end-user applications

21 http://parlevink.cs.utwente.nl/projects/mumis/ 22 http://www.collate.de/

23 Zhang, T. & Kuo, c.-c.]. (2001). Audio content analysis for on-line audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4), pp. 441-457.

24 http://www.xanthi.ilsp.gr/cimwos/

were identified: (1) multimedia-based art education housed in an education and technology centre, (2) visitor information links, and (3) a new Web site ("that would permit the Hermitage's collections to be searched and better experienced from afar")25. For more relevant information worldwide we refer to Kumar et al.²⁶ In the USA attention is given to the adequate accessibility of The Library of Congress (www.loc.gov).

Another famous and successful pioneering project is "Electronic Beowulf" (Kiernan, 1995) on the famous Beowulf manuscripts. In this project, the original handwriting has been scanned in high resolution and has been augmented with a very detailed annotation at both the level of script (the written shapes) and at the level of the textual content. Due to the high quality if this work, the on-line results on Internet and CDROM represent a true form of knowledge disclosure towards experts and regular interested users. A project with a wider scope is represented by "Digital Scriptorium" (Faulhaber, 1999). In this latter project, a wide range of mediaeval text is disclosed in digital form, to experts and the general public. The goal of Digital Scriptorium is the knowledge transfer in the area of palaeography (http://sunsite.berkeley.edu/scriptorium/). Fortunately, for the multi-level coding of (a) semantic content, (b) geometric layout structure and (c) typography new standards are emerging, such as TEl (Text Encoding Initiative, http://www.tei-c.org/). These successful international projects may serve as an example for initiatives which are aimed at the preservation of the Dutch cultural heritage.

Finally, we mention the Open Archive Initiative (www.openarchives.org).

25 F. Mintzer, G.W. Braudaway, F.P. Giordano, J.e. Lee, K.A. Magerlein, S. D'Auria, A. Ribah, G. Shapir, F. Schiattarella, J. Tolva, and A. Zelenkov (2001). Populating the Hermitage's Museum New Web Site. Communicaitons of the ACM, Vol. 44, No.8, pp. 52-60.

26 Kumar, K.G., et al. The Hot Media architecture: Progressive & Interactive rich media for the Internet.

See www.developer.ibm.comjlibraryjarticlesjhotmedia.html

APPENDIX I: SIX CORE PROJECTS

Core Project 640.001.401

la) Project title:

SemanTic Interoperability To access Cultural Heritage

lb) Project acronym STITCH

lc) Principal investigators

Prof. dr. F. Van Harmelen (Vrije Universiteit) Drs. H. Matthezing (Koninklijke Bibliotheek)

Dr. P. Wittenburg (Max Planck Institute for Psycholinguistics)

ld) Main project location Koninklijke Bibliotheek

2) Composition of research team

· 1 Ph.D Student

· 1 Postdoc

· 1 Scientific programmer

· Prof. dr. F. van Harmelen (Vrije Universiteit)

· Drs. H. Matthezing (Koninklijke Bibliotheek)

· Drs. M.C. de Niet (Koninklijke Bibliotheek)

· Prof. dr. G. Schreiber (Vrije Universiteit)

· Dr. P. Wittenburg (Max Planck Institute for Psycholinguistics)

3) Description of the proposed research

3a) Problem statement and research objectives

Cultural-heritage collections are typically indexed with metadata derived from a range of different vocabularies, such as AAT, Iconclass and in-house standards. This presents a problem when one wants to use multiple collections in an interoperable way. In general, it is unrealistic to assume unification of vocabularies. Vocabularies have been developed in many sub-domains, each with their own emphasis and scope. Still, there is significant overlap between the vocabularies used for indexing.

The prime research objective of this subproject is to develop theory, methods and tools for allowing metadata interoperability through semantic links between the vocabularies. This research challenge is similar to what is called the "ontology mapping" problem in ontology research.

The overall objective can be divided into three research questions:

1. What kind of semantic links can be identified?

2. Which methods and tools can support manual and semi-automatic identification of semantic links between vocabularies?

3. How can such semantic links be employed to enable interoperable access to multiple collections indexed with heterogeneous vocabularies?

3b) Scientific approach and methodology

The project will be application-oriented. The goal will be to develop methods and tools that can be shown to work for relevant use cases. The project will focus on 19^th century culturalheritage objects in different Dutch collections. For this project we assume that syntactic interoperability has been achieved through the representation of metadata and the vocabularies in RDFjOWL format [Brickley and Guha, 2004; McGuinness and van Harmelen, 2004]. This allows the project to zoom in on the semantic interoperability problems.

The project will build on research in ontology mapping. Several authors have proposed mapping relations for use in semantic linking [e.g. Niles and Pease, 2003]. These include equality, equivalence, subclass, instance and domains-specific relations. The project will use these as a starting point and evaluate and extend/revise this set of mapping relations. Research of identification of links will first focus on baseline methods for manual

specification of links such as developed within the ICES-KIS 2 project "Multimedia Information Analysis" [Hollink, 2003]. This will be supplemented with techniques from ontology learning targeted at finding such links automatically. The state-of-the-art techniques are not full proof [Handschuh and Staab, 2003], so some form of human validation of the links will need to take place. This is not a big hurdle, as semantic links between vocabularies are a one-time thing. Another technique to consider is the generalization of existing annotations to semantic vocabulary links. For example, if according to a particular annotation the artist of a particular painting belongs to a certain art school, we may hypothesize that this link also exists for other works of the same artist.

With respect to the use of semantic links we will identify a number of typical use cases that should be handled by the tools being developed. Some prototypical use cases are:

· User sees painting of a historic event, such as the events in Brussels in 1830. She wants information about this event and about other art works concerning this event as well as written witness reports.

· User wants to find monuments that constitute particular types of defence works, such as those part of the "Hollandse waterlinie'~ She also wants information about the architects involved and pointers to writings containing background information.

· User wants to find for a particular artist the places where the person lived and worked.

· User wants additional information that can be found about certain histories figures (e.g. King William I of The Netherlands or Thorbecke) depicted an a painting?

These use cases typically require the combination of information from different collection databasesY The target user audience for these use cases is the interested lay person.

The following collection databases will be considered for application within the project:

· Catalogue of the Koninklijke Bibliotheek

· Monument preservation

· Army museum

· RKD collection

· Bibliopolis

· Rijksmuseum

· "Geheugen van Nederland" (Memory of The Netherlands)

Vocabularies and thesauri that are of potential interest here include:

· RKD Artist (i.e. Dutch version of ULAN)

· Dutch AAT

· Historic thesauri, such as under development at the Koninklijke Bibliotheek

· Iconclass

· GOO ("Gemeenschappelijke Onderwerpen Ontsluiting"), Koninklijke Bibliotheek

· GTAA (Sound and Vision, see CHOICE subproject)

3c) Scientific relevance

Ontology mapping is becoming an increasingly important research topic. It may provide the background knowledge required for accessing distributed information repositories, both within (large) companies and on the Word Wide Web. Until now, much of the research effort has been spent on making syntactic interoperability feasible, i.e. to represent data models and data in a common (exchange) format. With the advent of XML, and RDF/OWL, these syntactic problems are now (at least in theory) solvable, but this potential is still largely unexplored. Given the fact that semantic interoperability has not been studied very much

27 This is an indicative list with the aim of making clear the kind of questions this project tries to answer. The project may choose to work on other examples for pragmatic reasons.

yet, this project has taken a use-case driven approach. We expect to show that this technology can be employed to answer a new class of queries over different collections.

3d) Related work

Finnish Museums Online [Hyvonen et aI., 2003]:

The joint national museum network developed by the University of Helsinki and The Helsinki Institute for Information Technology HIlT has recently been taken into trial use. The system is based on semantic web technology being seemingly the first of its kind in the world. This project is unique in that it includes a semantic data search system connecting the various collections with each other.

3e) Work programme

The research proceeds in four stages of one year each. Below, the annual planned activities are outlined.

Year 1

· Selection of initial set of collections and vocabularies

· Syntactic transformations to XMljRDF/OWL, where required

· Refinement of initial target use cases into full-blown scenarios

· Construction of baseline manual semantic-linking tool

· First semantic-search prototype

Year 2

· Small-scale user experiments with initial prototype

· Revision of the set of semantic-link primitives

· Facilities for semi-automatic elicitation of semantic links, including generalization from existing annotations

· Second semantic-search prototype

Year 3 &4

Additional development cycles involving a wider scope of collections, vocabularies and/or use-case fu nctio na Iities.

3f) Oeliverables

01: Theory of mapping relations required for semantic links between heterogeneous vocabularies

02: Method and tool for manual identification of semantic links

03: Algorithms for semi-automatic elicitation of semantic links

04: Semantic-search tool

4) Expected use of instrumentation

No special equipment is expected to be required.

5) Literature

Sa) References to cited work

D. Brickley and R. V. Guha. RDF vocabulary description. Recommendation, W3C Consortium, 10 February 2004. See: http://www.w3.org.

S. Handschuh and S. Staab. Annotation of the shallow and the deep web. In S. Handschuh and S. Staab, editors, Annotation for the Semantic Web, volume 96 of Frontiers in Artificial Intelligence and applications, pages 25-45. IOS Press, Amsterdam, 2003.

E. Hyvonen, S. Kettula, V. Raatikka, S. Saarela, and K. Viljanen. Finnish museums on the semantic web. In Proceedings of WWW2003, Budapest, poster papers, 2003.

D. McGuinness and F. van Harmelen (eds.). OWL Web Ontology Language Overview. W3C Recommendation, World Wide Web Consortium, 10 February 2004. Latest version: http://www .w3.org/TR/owl-features/.

Alistair Miles and Brian Matthews. Review of RDF thesaurus work. Deliverable 8.2, version 0.1, SWAD-Europe, 2004. URL: http://www.w3c.rl.ac.uk/SWAD/deliverables/8.2.html.

I. Niles and A. Pease. Linking lexicons and ontologies: Mapping Wordnet to the suggested upper merged ontology. In Proceedings of the 2003 International Conference on Information and Knowledge Engineering (IKE '03), Las Vegas, Nevada, June 23-26 2003.

T. Peterson. Introduction to the Art and Architecture Thesaurus. Oxford University Press, 1994. See also: http://www.getty.edu/research/tools/vocabulary/aat/.

ULAN: Union List of Artist Names. The Getty Foundation.

http://www .getty .edu/research/tools/vocabulary /ulan/, 2000.

H. van der Waal. ICONCLASS: An inconographic classification system. Technical report, Royal Dutch Academy of Sciences (KNAW), 1985.

Sb) Most important publications of the research team

I. Horrocks, P. F. Patel-Schneider and F. van Harmelen, From SHIQ and RDF to OWL: The Making of a Web Ontology Language, Journal of Web Semantics, 1(1), 2003.

L. Hollink, A. Th. Schreiber, J. Wielemaker, and B. J. Wielinga. Semantic annotation of image collections. In S. Handschuh, M. Koivunen, R. Dieng, and S. Staab, editors, Knowledge Capture 2003 - Proceedings Knowledge Markup and Semantic Annotation Workshop, pages 41-48,2003.

A. Th. Schreiber, I. I. Blok, D. Carlier, W. P. C. van Gent, J. Hokstam, and U. Roos. A miniexperiment in semantic annotation. In I. Horrocks and J. Hendler, editors, The Semantic Web - ISWC 2002, number 2342 in Lecture Notes in Computer Science, pages 404-408, Berlin, 2002. Springer-Verlag. ISSN 0302-9743.

A. Th. Schreiber, B. Dubbeldam, J. Wielemaker, and B. J. Wielinga. Ontology-based photo annotation. IEEE Intelligent Systems, 16(3):66-74, May/June 2001.

J. Wielemaker, A. Th. Schreiber, and B. J. Wielinga. Prolog-based infrastructure for rdf: performance and scalability. In D. Fensel, K. Sycara, and J. Mylopoulos, editors, The Semantic Web - Proceedings ISWC'03, Sanibel Island, Florida}, volume 2870 of Lecture Notes in Computer Science, pages 644-658, Berlin/Heidelberg, October 2003. Sringer Verlag. ISSN 0302-9743.

B. J. Wielinga, A. Th. Schreiber, J. Wielemaker, and J. A. C. Sandberg. From thesaurus to ontology. In Y. Gil, M. Musen, and J. Shavlik, editors, Proceedings 1st International Conference on Knowledge Capture, Victoria, Canada, pages 194-201, New York, 21-23 October 2001. ACM Press.

Core Project 640.001.402

la) Project title:

CHarting the informatiOn landscape employIng ContExt information

lb) Project acronym CHOICE

lc) Principal investigators

Dr. M.J.A. Veenstra (Telematica Instituut) Prof. Dr. G. Schreiber (Vrije Universiteit)

Drs. J.F. Oomen (Nederlands Instituut voor Beeld en Geluid)

ld) Main project location

Nederlands Instituut voor Beeld en Geluid

2) Composition of research team

· 1 Ph.D Student

· 1 Postdoc

· 1 Scientific programmer

· Drs. J.F. Oomen (Nederlands Instituut voor Beeld en Geluid)

· Dr. M.J.A. Veenstra (Telematica Instituut)

· Prof. Dr. G. Schreiber (Vrije Universiteit)

· Dr. P. Wittenburg (Max Planck Instituut for Psycho linguistics)

· Drs. A. Kok (Instituut Collectie Nederland)

· Drs. A van Loo (Nederlands Instituut voor Beeld en Geluid)

3) Description of the proposed research

3a) Problem statement and research objectives

The CATCH research programme will develop key technology to ensure continuous access to the cultural riches of the world. The CHOICE project seeks to chart the uncharted information landscape, focusing on semi-automatic semantic annotation and employing context information.

Semantic annotation involves the annotation of archived objects, such as video, images and books with semantic categories from some standardized metadata repository, such as domain thesauri and ontologies. The use of semantic annotation allows one to widen the search facilities in a collection. For example, annotating a photograph with the semantic category "bed" (in the sense of: to sleep in) from the Word Net thesaurus makes it pOSSible to search for "sleeping beds" while not retrieving other "beds" such as "river beds". As most thesauri have a hierarchical broader/narrower structure, it also makes it pOSSible to generalize or specialize a query in semantic terms: e.g. retrieving photographs of "cribs' (a narrower semantic category) when searching for beds in the "sleeping" sense. Hyvonen (2003) describes an example of a working system in the cultural heritage domain that allows semantic search.

The driving use case of this project is the Sound and Vision video archive. The objective is 1) to show how semantic annotation can be supported in the archiving process by exploiting the available context information and 2) to show how these annotations can subsequently be used to improve search facilities. Hollink et al. (2003) show that linking a number of diverging thesauri to an annotation application for images of paintings can improve both the semantic annotation process for human annotators and the search process. In the CHOICE project, the annotation application developed by Hollink et al. will be adjusted for video annotation. The aim is to construct a video annotation system based on a shared annotation structure (in the Sound and Vision case: iMMix), allowing annotators to mark up video with relevant semantic categories from multiple thesauri relevant for the field.

At the moment automatic techniques for video analysis are still of limited value for the derivation of semantic categories (e.g., Hollink et aI., 2004). On the other hand, manual semantic annotation is time-consuming. Therefore, this project will focus on speeding up the manual annotation process by applying natural language processing (NLP) techniques to generate candidate semantic categories that appear in the selected thesauri from (textual) context information. Context information provides peripheral insights into an object; how it was perceived, how it was created, how it relates to other objects made during the same era and so on. Having access to these sources enables users to expand their explorations into greater depth. In the audiovisual realm, examples of sources to be somehow linked to objects include: commentary sheets, external reviews, broadcast schedules, viewer ratings and awards. Within CHOICE, possibly relevant statements and setting descriptions from the textual context information will be offered to the human annotator for approval or rejection. Whether a fragment of the context information is (possibly) relevant for semantic annotation is determined by checking whether concepts from relevant thesauri or from the metadata belonging to the video occur in it. Machine learning and statistical methods for natural language processing and information extraction are applied to determine which terms from fragments or sentences will be used in the statements that are offered to the annotator (Hearst (1999), Jackson and Moulinier (2002), Mitchell (1999).

For the development of a semantic-annotation system for video annotation the following research issues need to be tackled:

1. How should the annotation interface for images, as developed by Hollink et al., be adapted to video annotation? In this Sound and Vision case this means integrating the iMMix model into the annotation architecture and incorporating facilities for video browsing and searching, and viewing context information.

2. Which thesauri and/or ontologies can be used as repositories of relevant semantic categories for archive search? Typical example corpora could be WordNet, a geographical thesaurus such as TGN, and the "Gemeenschappelijke Thesaurus Audiovisuele Archieven" developed by Sound and Vision and the Filmmuseum.

3. How can these thesauri/ontologies be partially mapped/integrated? This issue will build upon the work in the CATCH project STITCH project, also carried out within the CATCH framework.

4. How can we use NLP and learning techniques to derive relevant semantic categories from the text? There is a link here to the MITCH project of CATCH.

S. How can these semantic categorization techniques be used to support the search process? For example, when searching for video fragments about Limburg, one could use TGN to find geographical parts of Limburg (towns, rivers, lakes, mountains) to enhance the search. As another example, when searching for videos about "crime" it should be possible to find fragments about "murder".

Scoping remarks:

· Allowing all visitors and experts to add additional (semantic) annotation is a avid voluntary cataloguers who will find surprising ways to mine and exploit the treasure trove offered. However, conducting extensive research in this topic is expected to be out of scope for this particular project.

· Integration into the Sound and Vision business process is strictly speaking not part of the project. However, the project will consider business-integration issues that have a general flavor, such as the storage of the actual context information objects and the storage of resulting annotations.

3b) Scientific approach and methodology

The proposed research is methodological. It is aimed at exploiting the possibilities of combining semantic categorization techniques with techniques for natural language processing to make possible semi-automatic semantic annotation. The NLP techniques are provided with relevant concepts (e.g. from thesauri, term lists and metadata) to focus the processing. Thus, the research is not aimed at developing new techniques for natural language processing but on applying existing techniques in a goal-oriented way.

The project will build on existing open standards for data and metadata representation, such as XML and RDFjOWL.

3c) Scientific relevance

The CHOICE project will explore a novel combination of existing semantic categorization techniques and NLP techniques in the context of semantic video annotation. These techniques will be useful in all situations were there are textual annotations of multimedia material and also a set of relevant (possibly heterogeneous) thesauri and/or ontologies. This is a common theme in the cultural-heritage setting. Almost all collections have been annotated with text. In some collections there is some degree of formality because characteristics have already been described with standardized metadata repositories such as AAT. But even in those collections the textual parts may contain relevant parts suitable for semantic search. For example, in painting collections the subject of the painting is typically only described with an informal piece of text. The techniques developed in this project could thus help making semantic subject search possible. A possible use case could be: searching for paintings about fruit will retrieve paintings about apples, pears, grapes, etc.

3d) Related work

CHOICE is a project on the intersection of semantic annotation and natural language processing with an emphasis on (semi-automatic) semantic annotation. CHOICE builds on several projects and work groups the project members are and were involved in with respect to the Semantic Web (e.g, W3C SWBPD²8), semantic annotation (Hollink et aI., 2003, Schreiber et al. 2001), video annotation (IMMix²⁹), semantics-based presentation (CHIME³⁰, Topia³¹) and semantic interoperability (Wittenburg et al. 2004a; 2004b).

Semantic annotation is studied in the semantic-web research field. Both manual techniques and automatic techniques are being used. Annotea³² is a W3C project targeted at baseline semantic annotation. The CREAM toolset (Handschuh and Staab, 2002b) provides a mix of manual and semi-automatic annotation techniques. The Armadillo approach (Ciravegna et aI., 2004) is mainly aimed at using automatic (natural-language) techniques for constructing semantic annotations. These efforts are mainly aimed at text documents. There is relatively little work on semantic annotation of multimedia documents. One of the few examples in the PhD work of Troncy (2003), who did a case study with the archives of INA, the French equivalent of Sound and Vision.

A good overview of current research on semantic annotation van be found in the proceedings of recent Semantic Annotation and Knowledge Markup Workshops (Handschuh et aI., 2002a, 2003).

Hyvonen et al. (2003) describe work related to CHOICE an STITCH in the cultural heritage domain. The joint Finnish national museum network developed by the University of Helsinki and The Helsinki Institute for Information Technology HIlT has recently been taken into trial use. The system is based on semantic web technology being seemingly the first of its kind in the world. This project is unique in that it includes a semantic data search system connecting the various collections with each other.

3e) Work programme

The research proceeds in four stages of one year each. Below, the annually planned activities are outlined.

Year 1

Selection of a subset of the Sound and Vision archive well-suited for an early prototype, e.g. because of the availability of relevant thesauri. Selection of thesauri. Mapping of thesauri. First version of semantic annotation interface based on the iMMix model.

28 Semantic Web Best Practices and Deployment Group: http://www.w3.org/2001/sw/BestPractices/

29 IMMix is a new information system by Netherlands Institute for Sound and Vision, in collaboration

with Ministry of Economic Affairs and the Dutch public broadcasters. 30 http://www.niwi.knaw.nl/en/oi/nod/onderzoekjOND 1287669jtoon 31 http://topia.telin.nl and Rutledge et al. (2003)

32 http://www.w3.org/2001/Annotea

Year 2

Selection of suitable NLP techniques. Integration of NLP techniques into semantic annotation tool resulting in a second version of the annotation tool. Including semantic search facilities.

Year 3

Exploring the use of the developed techniques outside the Sound and Vision collection, e.g. for the ICN video collection of interviews with painters from the INNCCA project3³ and a linguistic corpus containing audio, video as well as text from MPI. Final version of the semiautomatic semantic annotation tool.

Year 4

Writing of documentation and dissertation.

3f) Oeliverables

The project aims to deliver the following products of research:

· Three successive version of a semantic annotation tool

· Conference proceedings papers about the application of NLP techniques in a semantic annotation context etc.

· A Ph.D. thesis

4) Expected use of instrumentation

The team needs sufficient computing power besides normal desktop computers to operate. One high-end computer (dual-CPU, high on memory and permanent storage capactities) will act as computing server.

5) Literature

Sa) References to cited work

Fabio Ciravegna, Sam Chapman, Alexiei Dingli and Yorick Wilks, Learning to Harvest Information for the Semantic Web, in Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece, May 10-12, 2004.

S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS Press, 2002a

S. Handschuh, M. Koivunen, R. Dieng and S. Staab (eds.): Knowledge Capture 2003 -Proceedings Knowledge Markup and Semantic Annotation Workshop, October 2003

S. Handschuh & S. Staab Authoring and annotation of web pages in CREAM. 11^th International conference on World Wide Web Honolulu, Hawaii, USA, pp. 462 - 473 , 2002b. ISBN: 1-58113-449-5

Hearst, M. Untangling text data mining. In Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999

Hollink, L., G. Schreiber, J. Wielemaker and B. Wielinga. Semantic Annotation of Image Collections. In S. Handschuh, M. Koivunen, R. Dieng and S. Staab (eds.): Knowledge Capture 2003 -- Proceedings Knowledge Markup and Semantic Annotation Workshop, October 2003.

Hollink, L., G. Nguyen, D. Koelma, G. Schreiber, M. Worring. User Strategies In Video Retrieval: a Case Study. International Conference on Image and Video Retrieval CIVR 2004,Dublin, July 2004.

Hyvonen, E., S. Kettula, V. Raatikka, S. Saarela, and K. Viljanen. Finnish museums on the

33 INNCCA is a project of a group of eleven international modern art museums and related institutions.

INCCA's most important set of objectives, which are closely interlinked, focuses on the building of a website with underlying databases that will facilitate the exchange of professional knowledge and information about modern art. Furthermore, INCCA partners are involved in a collective effort to gather information directly from artists.

semantic web. In Proceedings of WWW2003, Budapest, poster papers, 2003.

Jackson, P. and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction & Categorization. Amsterdam: John Benjamins, 2002.

Mitchell, T. Machine Learning. McGraw-Hili, 1999.

Lloyd Rutledge, Martin Alberink, Rogier Brussee, Stanislav Pokraev, William van Dieten, and Mettina Veenstra. Finding the Story - Broader Applicability of Semantics and Discourse for Hypermedia Generation. In: Proceedings of the 14th ACM conference on Hypertext and Hypermedia (pages 67-76), August 23-2003, Nottingham, UK

Guus Schreiber, Barbara Dubbeldam, Jan Wiele maker, and Bob Wielinga. Ontology-based photo annotation. IEEE Intelligent Systems, May/June 2001.

R. Troncy. Integrating Structure and Semantics into Audio-visual Documents. In: D. Fensel, K. Sycara and J. Mylopoulos (eds.) The Semantic Web - Proceedings ISWC'03, Sanibel Island, Florida. Lecture Notes in Computer Science, volume 2870, Berlin/Heidelber, Springer-Verlag, 2003.

P. Wittenburg, D. Broeder, P. Buitelaar: Towards Metadata Interoperability. Proceedings of the ACL 2004 Conference. To appear. 2004a

Peter Wittenburg, Greg Gulrajani, Daan Broeder, Marcus Uneson:Cross-Disciplinary Integration of Metadata Descriptions. Proceedings of the LREC2004 Conference. To appear. 2004b

Sb) Most important publications of the research team

Guus Schreiber, Hans Akkermans, Anjo Anjewierden, Robert de Hoog, Nigel Shad bolt, Walter Van de Velde and Bob Wielinga. Knowledge Engineering and Management: The CommonKADS Methodology, MIT Press, ISBN 0262193000. 2000.

Guus Schreiber, Barbara Dubbeldam, Jan Wiele maker, and Bob Wielinga. Ontology-based photo annotation. IEEE Intelligent Systems, May/June 2001.

Mike Dean, Guus Schreiber (eds.), Sean Bechofer, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah McGuinness, Peter Patel-Scheider and Lynn Andrea Stein. OWL Web Ontology Language Reference. W3C Recommendation 10 February 2004.

P. Wittenburg, D. Broeder, P. Buitelaar: Towards Metadata Interoperability. Proceedings of the ACL 2004 Conference. To appear. 2004a

Core Project 640.002.401

la) Project title

Reading Images in the Cultural Heritage

lb) Project acronym RICH

lc) Principal investigator

Prof. dr. E. Postma (Maastricht University)

ld) Main project location ROB

2) Composition ofthe research team:

· 1 PhD student (AI, machine learning, and image recognition)

· 1 Postdoc (AI, machine learning, and image recognition)

· 1 Scientific Programmer

· Dr. A.G. Lange (ROB)

· Prof.dr. E. Postma (UM)

· Prof.dr. J. van den Herik (UM)

· Ir. N. Bergboer (UM)

· Drs. E. Drenth (ROB)

3) Description of the proposed research

3a) Problem statement and research objectives

The archaeological heritage covers in time 99% of our collective memory. Its material of study usually lends itself especially to studying everyday life. The scarce remains of our past that are available for study consist mainly of fragmentary and dispersed (parts of) objects. Fundamental in the process of identification of archaeological remains is comparison of the finds with similar objects from elsewhere and recombining the existing knowledge on these objects. To be able to explain archaeological phenomena one compares in first instance (images of) objects at hand with the (images of) objects kept elsewhere. When images match, in depth analysis of descriptions follow and eventually will lead to an enriched knowledgebase.

Archaeology as a discipline has lately seen many changes in the way it is practiced. Under the influence of the new European legislation (Treaty of Valetta, Malta 1992) the number of excavations grew fast. The number of active archaeologists has grown accordingly: from less than 100 before "Malta", to more than 1000 now.

Perhaps the privatisation of field research has the most profound impact. Instead of a situation where excavation and desktop research, policy making and Archaeological Heritage Management were integrated into one or a few rather big institutions, we see the development of an archaeology market with, mainly, small excavation units.

Together these mechanisms put the accumulation of knowledge under severe pressure. Many of the smaller firms have no direct access to the knowledge base, be it in the form of specialist knowledge or in the form of literature. What we see is a stand still in data accumulation and a threshold to the access of knowledge, while the need for ready access to state-of-the-art knowledge is growing at high rate at the same time.

The amount of recovered archaeological objects is beyond our imagination. In the archives and storerooms of the archaeological institutions there are billions of sherds, flints, metal objects, etc. The variation in form, texture (fabric) and decoration has been studied in a

scientific manner for over 200 years. From this collection a corpus of knowledge has been build on the distribution in space and time, the evolution of the technology to make things, and the function and role of particular objects in ancient society. The magnitude of this corpus, partly laid down in books, is nearly just as overwhelming as the number of objects themselves. Because archaeology destroys its own primary sources by excavating, old excavation reports, monographs and catalogues, being the only remaining (secondary) sources, are still essential part of the knowledge base. To communicate all this information archaeologists traditionally use the concept of reference collections. Much like the use of type specimens in biology, archaeologists classify the finds in types and series of types. This is a mental process that combines and recombines evidence and theory from the finds at hand and from earlier archaeological research. The result of this process is usually a theory of the site's socio-economic and cultural role and the presentation of the evidence on which this theory has been build. Sometimes this evidence is presented as a catalogue-like addendum. The ordering of the finds is described and the key objects are depicted in line drawings and photographs. Other researchers may refer to this body of knowledge, make amendments to the interpretation and consequently adjust the classification.

This is what is meant by a reference collection: a constantly updated body of knowledge, consisting of type series, that can be subject of study in itself, but also refer to explicit knowledge accessible in books and implicit knowledge accessible by talking to a specialist, available to all who are interested.

Today we are facing four challenges:

1. How can we safeguard the existing knowledge base?

2. How can we guarantee ready access for all?

3. How can we guarantee the incorporation of new knowledge in a sustainable way?

4. How can we enrich the existing and forthcoming knowledge by new techniques?

To these questions the development of an electronic National Reference Collection (NRc), which is under way, as part of an European wide network of portals to reference collections (eRC) will be an answer. Archaeology is in the first instance firmly and profoundly based on visual inspection and recognition of objects. Images will be central in this development.

The field of digital vision has been developing in such a direction, that now it becomes realistic to incorporate these new techniques into the eRC to enhance the quality of archaeological research and archaeological heritage management in a fundamental way. Automatic recognition of form, fabric, and decoration of physical objects and of printed images is the focus of the RICH-project. This instrument will not only benefit archaeological practice and knowledge building but is of equal importance in education and training.

The results of the RICH project are essential contributions in this development that has as ultimate aims

1. increasing the efficacy and efficiency of digital access to archaeological core knowledge

2. reinforcing the infrastructure on archaeological core knowledge

3. improving the quality of material studies in Dutch archaeological heritage management and archaeological research in Europe, including the formulation of new research area's.

Research question

How can artificial intelligence support the automatic visual analysis of archaeological objects?

3b) Scientific approach and methodology

The approach followed in the RICH project is empirical. Machine-learning algorithms are trained on large collections of images. After training, the ability to recognize or classify previously unseen images is assessed yielding a measure of generalisation performance. The scientific methodology employed consists of four phases: (1) data collection, (2) data preprocessing, (3) training, and (4) evaluation.

Data collection. For the archaeological domain, digital data is collected incrementally by digitizing stored objects or newly found objects. Digitization may proceed indirectly by

scanning photographs of multiple views of the objects or directly by means of a digital camera. During the project, the size of the digital collection grows steadily. The collection of data is restricted to four classes of objects: pottery, glass, flint and coins. We briefly discuss each of these classes.

· Pottery. Often, large collections of pottery are unearthed at archaeological sites. The shapes of the (fragments of) objects obey certain geometrical laws. Together with texture, the shape can be related to a certain period, location, and socio-economical or cultural entity. High-quality classification systems for pottery are available and support the archaeologist in assigning the found object to a certain class. However, the subjective nature of examining the shape and texture of objects hampers the reliability of classification. The pottery project aims at supporting the archaeologist in the classification of unearthed objects by means of advanced visual analysis techniques. It will draw attention both from professional archaeologists and from a potentially wide non-professional audience.

· Glass. The late medieval glass collection of the ROB is well classified, dated and documented and consists of a limited number of object shapes. These shapes are often depicted on late-medieval paintings. Archaeologists and art historians are interested to find matches between the documented and depicted shapes because they put constraints on the time and location of the glass under consideration. Using artificial-intelligence techniques, documented two-dimensional drawings or pictures of an object are translated into digital representations of corresponding three-dimensional objects. These representations are matched to the contents of digitized late-medieval paintings in the Rijksmuseum.

· Flint. The classification of flint artefacts is a human endeavour. Archaeological experts analyze visual characteristics such as shape and texture to assign the artefact to a certain time and location. In the flint project, a system that is trained to recognize twodimensional views of flint artefacts is developed along the same lines as in the pottery project. The complex three-dimensional shape of flint artefacts may necessitate a usergUided classification that proceeds as follows. An artefact is presented to a digital camera (under standard light conditions). Using feature-extraction techniques the digital image is transformed and classified with a certain reliability. Initially, the reliability is rather low. However, the user can enhance the reliability by manually rotating the flint artefact in front of the camera until an acceptable classification is achieved.

· Coins. Coins are among the most imaginative finds and were collected and studied in the Netherlands, even before Archaeology became a scientific discipline in 1818 at Leiden Universit_y34. In coins only the illustration is significant. Without having to account for variations in form and texture they are a good starting point for computer vision analysis. For learning and comparison, both digitals images and the coins themselves are in large quantities available at the Koninklijk Penningen en Munten Kabinet.

The advantages and effects of digitally-guided determination of new coins that are offered by amateur archaeologist should not be underestimated. While it will not replace the expert, it will free him/her from trivial tasks and allows concentrating on more scientific activities. It will have a positive social effect when amateurs can learn about their finds without having to pass thresholds. The net effect will be that much more finds will be reported and that our knowledge will grow tremendously. A similar effect has been noted in Great Britain where the Portable Antiquities Scheme³⁵ is highly successful.

Data pre-processing. The pre-processing of image data is necessary for three reasons. First, variations in lighting conditions should be minimized as much as possible. The best way to achieve standard lighting conditions is to employ standardize lighting during digitization. Second, noise and sampling artefacts have to be removed to avoid mistakes in the recognition process. Third, the image data has to be transformed into a format suitable for a machine-learning algorithm. A commonly-used method is to apply a wavelet transform in

34 Brongers, J. A. 2002. Een vroeg begin van de moderne archeologie; Leven en werken van Cas Reuvens (1793-1835). ROB, Amersfoort.

35 http://www.finds.org.uk/