Anton Eliëns &
Martin Kersten
CWI
email: eliens@cs.vu.nl, M.Kersten@cwi.nl
Introduction
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
With the growth of information spaces
the retrieval of information, based on indexing schemes,
becomes increasingly important.
As it comes to information embedded in multimedia objects,
we must observe that progress in automatic indexing
is rather limited.
Obviously, taking the World Wide Web as our information space,
manual classification schemes do not suffice, simply
because they do not scale.
The ACOI project [ACOI] provides a large scale
experimentation platform to study issues in
the indexing and retrieval of multimedia objects.
The resulting ACOI framework is intended to
provide a sound model for indexing and retrieval based
on feature detection, as well
as an effective system architecture accomodating a variety
of algorithms to extract relevant properties
from multimedia objects.
The ACOI approach to multimedia feature detection
is based on the deployment of high-level feature grammars
augmented with media-specific feature detectors
to describe the structural properties of multimedia objects.
The structured objects that correspond to the parse trees
may be used for the retrieval of information.
Key challenges here are to find sufficiently selective properties
for a broad range of multimedia objects
and realistic similarity measures for the retrieval
of information.
In this report, we will look at the indexing and retrieval
of musical fragments.
We aim at providing suitable support for a user to find a musical piece of his likening, by
lyrics, by
genre, by musical instruments, tempo, similarity to other pieces, melody and mood. We propose an
indexing scheme
that allows for the efficient retrieval of musical objects, using descriptive properties,
as well as content-based properties, including lyrics and melody.
This study is primarily aimed at establishing the
architectural requirements for the detection of musical features
and to indicate directions for exploring the
inherently difficult problem of finding proper discriminating
features and similarity measures in the musical domain.
In this study we have limited ourselves to the analysis
of music encoded in MIDI, to avoid the technical difficulties
involved in extracting basic musical properties
from raw sound material.
Currently we have a simple running prototype for
extracting higher level features from MIDI files.
In our approach to musical feature detection,
we extended the basic grammar-based ACOI framework
with an embedded logic component to facilitate
the formulation of predicates and constraints over
the musical structure obtained from the input.
The prototype does at this stage not include actual query
facilities. However, we will discuss
what query facilities need to be incorporated and how to approach
similarity matching for musical structures to achieve efficient
retrieval.
We will also look at the issues that play a role in content-based
retrieval by briefly reviewing what we consider to
be the most significant attempts in this direction.
Structure
The structure of this report is as follows.
First we will discuss search facilities for music on the Web.
We will then look at the ACOI framework and
the interaction of components supporting grammar-based
feature detection.
We will describe a grammar for musical fragments
and a corresponding feature detector for the extraction
of features from a MIDI file or MIDI fragment.
Also, we will discuss the options for processing
queries and give a brief review of the results
that have been achieved for content-based retrieval,
in particular the recognition of melody based on
similarity metrics.
Finally, we will draw some conclusions
and indicate directions for
further research.
The ACOI framework
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
slide: The extended ACOI architecture
The ACOI framework is intended to accomodate
a broad spectrum of classification schemes,
manual as well as (semi) automatic, for the indexing
and retrieval of multimedia objects.
What is stored are not the actual multimedia objects
themselves, but structural descriptions of
these objects (including their location) that may be used for
retrieval.
The ACOI model is based on the assumption that
indexing an arbitrary multimedia object
is equivalent to deriving a grammatical structure
that provides a namespace to reason about the object
and to access its components.
However there is an important difference with
ordinary parsing in that the lexical and grammatical items
corresponding to the components of the multimedia object
must be created dynamically by inspecting the actual object.
Moreover, in general, there is not a fixed sequence
of lexicals as in the case of natural or formal languages.
To allow for the dynamic creation of lexical
and grammatical items the ACOI framework supports both
black-box and white-box (feature)
detectors.
Black-box detectors are algorithms, usually
developed by a specialist in the media domain,
that extract properties from the media object
by some form of analysis.
White-box detectors, on the other hand,
are created by defining logical
or mathematical expressions over the grammar itself.
In this paper we will focus on black-box detectors only.
The information obtained from parsing
a multimedia object is stored in the Monet database.
The feature grammar and its associated detector
further result in updating the data schemas
stored in the (Monet) database.
The Monet database, which underlies the ACOI framework,
is a customizable, high-performance, main-memory database
developed at the CWI and the University of Amsterdam [Monet].
At the user end, a feature grammar is related to
a View, Query
and Report component,
that respectively allow for inspecting a feature grammar,
expressing a query, and delivering a response
to a query.
Some examples of these components are currently implemented as applets
in Java 1.1 with Swing. See [ACOI].
The processing which occurs for a MIDI file,
by using the grammar and associated detectors
described in section [Detector]
is depicted in slide [midi-processing].
slide: Processing MIDI file
The input is a MIDI file.
As indicated in the top line, the MIDI file itself may
be generated from a score.
As indicated on the bottom line, processing a MIDI file
results in a collection of features as well as in a (simplified) MIDI file
and corresponding score.
In the current prototype, a collection of Prolog facts is used
as an intermediate representation, from which higher level
features are derived by an appropriate collection of rules.
The (result) MIDI file contains an extract
of the original (input) MIDI file
that may be presented to the (end) user as the result of a query.
This setup allows us to verify whether our extract or
abstraction of the original musical structure is effective,
simply by comparing the input
musical structure with the output (MIDI) extract.
Formal specification
Formally, a feature grammar G may be defined
as , where V is a collection of
variables or non-terminals,
T a collection of terminals,
P a collection of productions of the form
and S a start symbol.
A token sequence ts belongs to the
language if .
Sentential token sequences, those belonging to
or its sublanguages
for , correspond to a complex object
, which is the object corresponding to the parse tree
for v.
The parse tree defines a hierarchical structure
that may be used to access and manipulate the components
of the multimedia object subjected to the detector.
See [Features] for further details.
MIDI feature grammar
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
slide: MIDI features
MIDI feature grammar
detector song; to get the filename
detector lyrics; extracts lyrics
detector melody; extracts melody
detector check; to walk the tree
atom str name;
atom str text;
atom str note;
midi: song;
song: file lyrics melody check;
file: name;
lyrics: text*;
melody: note*;
slide: A simple feature grammar for MIDI files
The anatomy of a MIDI feature detector
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
Automatic indexing for musical data
is an inherently difficult problem.
Existing systems mostly rely on hand-crafted solutions,
geared towards a particular group of users,
such as for example composers of film music [MM].
In this section, we will look at a simple feature detector for
MIDI encoded musical data.
It provides a skeleton for future experimentation.
slide: MIDI features
The hierarchical information structure that we consider
is depicted in slide [midi-structure].
It contains only a limited number of basic
properties and must be extended with information
along the lines of the musical ontology proposed
in [AI].
However, the detector presented here provides
a skeleton solution that accomodates an extension with
arbitrary predicates over the musical structure in a transparent manner.
The grammar given below corresponds in an obvious way with
the structure depicted in slide [midi-structure].
detector song; to get the filename
detector lyrics; extracts lyrics
detector melody; extracts melody
detector check; to walk the tree
atom str name;
atom str text;
atom str note;
midi: song;
song: file lyrics melody check;
file: name;
lyrics: text*;
melody: note*;
slide: A simple feature grammar for MIDI files
The start symbol is a song.
The detector that is associated with song
reads in a MIDI file.
The musical information contained in the MIDI file is then stored
as a collection of Prolog facts.
This translation is very direct.
In effect the MIDI file header information is stored, and
events are recorded as facts, as illustrated below
for a note_on and note_off event.
event('twinkle',2,time=384, note_on:[chan=2,pitch=72,vol=111]).
event('twinkle',2,time=768, note_off:[chan=2,pitch=72,vol=100]).
After translating the MIDI file into a Prolog format, the other detectors
will be invoked, that is the composer,
lyrics and melody detector,
to extract the information related to these properties.
To extract relevant fragments of the melody we
use the melody detector, of which a partial listing is given below.
int melodyDetector(tree *pt, list *tks ){
char buf[1024]; char* _result;
void* q = _query;
int idq = 0;
idq = query_eval(q,"X:melody(X)");
while ((_result = query_result(q,idq)) ) {
putAtom(tks,"note",_result);
}
return SUCCESS;
}
slide: The melody detector
The embedded logic component is given the
query X:melody(X), which results in the notes
that constitute the (relevant fragment of the) melody.
These notes are then added to the tokenstream.
A similar detector is available for the lyrics.
Parsing a given MIDI file, for example twinkle.mid,
results in updating the Monet database as indicated below.
V1 := newoid();
midi_song.insert(oid(V0),oid(V1));
V2 := newoid();
song_file.insert(oid(V1),oid(V2));
file_name.insert(oid(V2),"twinkle");
song_lyrics.insert(oid(V1),oid(V2));
lyrics_text.insert(oid(V2),"e");
lyrics_text.insert(oid(V2),"per-");
lyrics_text.insert(oid(V2),"sonne");
lyrics_text.insert(oid(V2),"Moi");
lyrics_text.insert(oid(V2),"je");
lyrics_text.insert(oid(V2),"dis");
lyrics_text.insert(oid(V2),"que");
lyrics_text.insert(oid(V2),"les");
lyrics_text.insert(oid(V2),"bon-");
lyrics_text.insert(oid(V2),"bons");
lyrics_text.insert(oid(V2),"Val-");
lyrics_text.insert(oid(V2),"ent");
song_melody.insert(oid(V1),oid(V2));
melody_note.insert(oid(V2),"a-2");
melody_note.insert(oid(V2),"a-2");
melody_note.insert(oid(V2),"g-2");
melody_note.insert(oid(V2),"g-2");
melody_note.insert(oid(V2),"f-2");
melody_note.insert(oid(V2),"f-2");
melody_note.insert(oid(V2),"e-2");
melody_note.insert(oid(V2),"e-2");
melody_note.insert(oid(V2),"d-2");
melody_note.insert(oid(V2),"d-2");
melody_note.insert(oid(V2),"e-2");
melody_note.insert(oid(V2),"c-2");
slide: Update of Monet database
The updates clearly reflect the structure of
the musical information object that corresponds to
the properties defined in the grammar.
Implementation status
Currently, we have a running prototype of the MIDI
feature detector.
It uses an adapted version of public domain
MIDI processing software.
The embedded logic component is part of the hush
framework. It uses an object extension of Prolog
that allows for the definition of native objects
to interface with the MIDI processing software.
A description of the logic component is beyond the
scope of this paper, but will be provided in [OO].
The logic component, however, allows for
the definition of arbitrary predicates to extract
the musical information, such as the melody and the lyrics.
As stated before,
the current detector must be regarded as
a skeleton implementation that provides
the basis for further experimentation.
Queries -- the user interface
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
Assuming that we have an adequate solution
for indexing musical data, we need to define how end-users may access these data,
that is search for musical objects in the information space
represented by the database, for the ACOI project the World Wide Web.
slide: Query interface
For a limited category of users, those with some musical skills,
a direct interface such as a keyboard
or a score editor, as provided by the hush framework
[Jamming], might provide a suitable interface for
querying the musical database.
Yet, for many others, a textual description, or a form-based
query will be more appropriate.
In slide [query-interface] our envisaged user interface
for querying is depicted.
It provides limited score editing facilities, to enable
the user to indicate a melody (including rhythmic structure)
in common musical notation.
In this stage, we assume some basic musical skills, indeed.
In addition to the melodic fragment, which is depicted
in the middle frame, we allow the user to give additional
information, such as the composer, the name of the song,
and (possibly) a text-based outline.
The user may also indicate a genre, the instrumentation
and additional descriptive features in a free text format,
which may include fragments of the lyrics.
As concerns the matching algorithm used,
the user may express a preference for either strict
or approximate matching for the melody or melodic contour,
with or without rhythm. See section [Match] for a discussion
of matching algorithms.
slide: User Query Processing
In processing a query, we may
derive a partial melody or rhythmic structure from the query,
as well as some additional features or criteria.
As explained in the previous section, the output of indexing MIDI files
consists of both information concerning features
as well as a musical rendering of some of these features.
These features can be used to match against
the criteria formulated in the query.
The musical renderings, which include a partial score,
may be presented to the user in response to a query,
to establish whether the result is acceptable.
The output of a query will be a ranked list
of items found in the database.
Each item in the list will be represented by a
thumbnail of the score,
an auditory icon representing the musical fragment found,
the name of the song, the composer and a reference
to the original musical object on the Web.
Conclusions
Reader's Guide
contents
abstract
intro
web
ACOI
detector
query
retrieval
conclusions
References
This report presents an approach to the detection
of musical features which is based on the use of feature
grammars as developed in the ACOI framework
to describe the structural properties of musical data.
The goal of this work is to support the user
in finding a musical piece of his likening, by
lyrics, by
genre, by musical instruments, tempo,
similarity to other pieces, melody and mood.
At this stage we have a prototype for the extraction of relatively
simple features from a MIDI file, which uses an embedded logic
to extract content-related properties of
that data.
The next step in our research will consist of
creating suitable ways for querying the musical database.
We will have to explore how to present
a possibly large collection of matches,
and how to assist the end user in refining a
query as to obtain the desired result.
The greatest effort, however, will be to arrive at a matching
schema, that allows the retrieval of musical information from
a large database. Looking at the literature, in particular [Compare],
we discovered suitable dynamic programming algorithms that
may be used to detect similarities in melodic and rhythmic
structure. However, due to the structural complexity of
the algorithms, actual search in a large database will
be prohibitively expensive, unless some compact representation
of the original musical material can be thought of, to restrict
the matching process to what may be regarded as a minimal invariant
abstraction of the original piece of music.
An alternative solution would be to create additional indexes
based on, for example, the distribution of instrument usage,
intervals, and note durations, that may augment the matching
process by acting as an extra filter.