Musical feature detection in ACOI

Anton Eliëns & Martin Kersten
CWI email: eliens@cs.vu.nl, M.Kersten@cwi.nl

Introduction

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References
With the growth of information spaces the retrieval of information, based on indexing schemes, becomes increasingly important. As it comes to information embedded in multimedia objects, we must observe that progress in automatic indexing is rather limited. Obviously, taking the World Wide Web as our information space, manual classification schemes do not suffice, simply because they do not scale.

The ACOI project  [ACOI] provides a large scale experimentation platform to study issues in the indexing and retrieval of multimedia objects. The resulting ACOI framework is intended to provide a sound model for indexing and retrieval based on feature detection, as well as an effective system architecture accomodating a variety of algorithms to extract relevant properties from multimedia objects. The ACOI approach to multimedia feature detection is based on the deployment of high-level feature grammars augmented with media-specific feature detectors to describe the structural properties of multimedia objects. The structured objects that correspond to the parse trees may be used for the retrieval of information. Key challenges here are to find sufficiently selective properties for a broad range of multimedia objects and realistic similarity measures for the retrieval of information.

In this report, we will look at the indexing and retrieval of musical fragments. We aim at providing suitable support for a user to find a musical piece of his likening, by lyrics, by genre, by musical instruments, tempo, similarity to other pieces, melody and mood. We propose an indexing scheme that allows for the efficient retrieval of musical objects, using descriptive properties, as well as content-based properties, including lyrics and melody.

This study is primarily aimed at establishing the architectural requirements for the detection of musical features and to indicate directions for exploring the inherently difficult problem of finding proper discriminating features and similarity measures in the musical domain. In this study we have limited ourselves to the analysis of music encoded in MIDI, to avoid the technical difficulties involved in extracting basic musical properties from raw sound material. Currently we have a simple running prototype for extracting higher level features from MIDI files. In our approach to musical feature detection, we extended the basic grammar-based ACOI framework with an embedded logic component to facilitate the formulation of predicates and constraints over the musical structure obtained from the input.

The prototype does at this stage not include actual query facilities. However, we will discuss what query facilities need to be incorporated and how to approach similarity matching for musical structures to achieve efficient retrieval. We will also look at the issues that play a role in content-based retrieval by briefly reviewing what we consider to be the most significant attempts in this direction.

Structure

The structure of this report is as follows. First we will discuss search facilities for music on the Web. We will then look at the ACOI framework and the interaction of components supporting grammar-based feature detection. We will describe a grammar for musical fragments and a corresponding feature detector for the extraction of features from a MIDI file or MIDI fragment. Also, we will discuss the options for processing queries and give a brief review of the results that have been achieved for content-based retrieval, in particular the recognition of melody based on similarity metrics. Finally, we will draw some conclusions and indicate directions for further research.

The ACOI framework

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References


slide: The extended ACOI architecture

The ACOI framework is intended to accomodate a broad spectrum of classification schemes, manual as well as (semi) automatic, for the indexing and retrieval of multimedia objects. What is stored are not the actual multimedia objects themselves, but structural descriptions of these objects (including their location) that may be used for retrieval.

The ACOI model is based on the assumption that indexing an arbitrary multimedia object is equivalent to deriving a grammatical structure that provides a namespace to reason about the object and to access its components. However there is an important difference with ordinary parsing in that the lexical and grammatical items corresponding to the components of the multimedia object must be created dynamically by inspecting the actual object. Moreover, in general, there is not a fixed sequence of lexicals as in the case of natural or formal languages. To allow for the dynamic creation of lexical and grammatical items the ACOI framework supports both black-box and white-box (feature) detectors. Black-box detectors are algorithms, usually developed by a specialist in the media domain, that extract properties from the media object by some form of analysis. White-box detectors, on the other hand, are created by defining logical or mathematical expressions over the grammar itself. In this paper we will focus on black-box detectors only.

The information obtained from parsing a multimedia object is stored in the Monet database. The feature grammar and its associated detector further result in updating the data schemas stored in the (Monet) database. The Monet database, which underlies the ACOI framework, is a customizable, high-performance, main-memory database developed at the CWI and the University of Amsterdam  [Monet].

At the user end, a feature grammar is related to a View, Query and Report component, that respectively allow for inspecting a feature grammar, expressing a query, and delivering a response to a query. Some examples of these components are currently implemented as applets in Java 1.1 with Swing. See  [ACOI].

The processing which occurs for a MIDI file, by using the grammar and associated detectors described in section Detector is depicted in slide midi-processing.



slide: Processing MIDI file

The input is a MIDI file. As indicated in the top line, the MIDI file itself may be generated from a score. As indicated on the bottom line, processing a MIDI file results in a collection of features as well as in a (simplified) MIDI file and corresponding score. In the current prototype, a collection of Prolog facts is used as an intermediate representation, from which higher level features are derived by an appropriate collection of rules.

The (result) MIDI file contains an extract of the original (input) MIDI file that may be presented to the (end) user as the result of a query. This setup allows us to verify whether our extract or abstraction of the original musical structure is effective, simply by comparing the input musical structure with the output (MIDI) extract.

Formal specification

Formally, a feature grammar G may be defined as G = (V,T,P,S), where V is a collection of variables or non-terminals, T a collection of terminals, P a collection of productions of the form V -> (V \union T) and S a start symbol. A token sequence ts belongs to the language L(G) if S -*-> ts. Sentential token sequences, those belonging to L(G) or its sublanguages L(G_v) = (V_v,T_v,P_v,v) for v \e (T \union V), correspond to a complex object C_v, which is the object corresponding to the parse tree for v. The parse tree defines a hierarchical structure that may be used to access and manipulate the components of the multimedia object subjected to the detector. See  [Features] for further details.

MIDI feature grammar

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References


slide: MIDI features

MIDI feature grammar



  
  detector song; 
to get the filename
detector lyrics;
extracts lyrics
detector melody;
extracts melody
detector check;
to walk the tree
atom str name; atom str text; atom str note; midi: song; song: file lyrics melody check; file: name; lyrics: text*; melody: note*;

slide: A simple feature grammar for MIDI files

The anatomy of a MIDI feature detector

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References
Automatic indexing for musical data is an inherently difficult problem. Existing systems mostly rely on hand-crafted solutions, geared towards a particular group of users, such as for example composers of film music  [MM]. In this section, we will look at a simple feature detector for MIDI encoded musical data. It provides a skeleton for future experimentation.

slide: MIDI features

The hierarchical information structure that we consider is depicted in slide midi-structure. It contains only a limited number of basic properties and must be extended with information along the lines of the musical ontology proposed in  [AI]. However, the detector presented here provides a skeleton solution that accomodates an extension with arbitrary predicates over the musical structure in a transparent manner.

The grammar given below corresponds in an obvious way with the structure depicted in slide midi-structure.



  
  detector song; 
to get the filename
detector lyrics;
extracts lyrics
detector melody;
extracts melody
detector check;
to walk the tree
atom str name; atom str text; atom str note; midi: song; song: file lyrics melody check; file: name; lyrics: text*; melody: note*;

slide: A simple feature grammar for MIDI files

The start symbol is a song. The detector that is associated with song reads in a MIDI file. The musical information contained in the MIDI file is then stored as a collection of Prolog facts. This translation is very direct. In effect the MIDI file header information is stored, and events are recorded as facts, as illustrated below for a note_on and note_off event.

  event('twinkle',2,time=384, note_on:[chan=2,pitch=72,vol=111]).
  event('twinkle',2,time=768, note_off:[chan=2,pitch=72,vol=100]).
  
After translating the MIDI file into a Prolog format, the other detectors will be invoked, that is the composer, lyrics and melody detector, to extract the information related to these properties.

To extract relevant fragments of the melody we use the melody detector, of which a partial listing is given below.


  int melodyDetector(tree *pt, list *tks ){
  char buf[1024]; char* _result;
  void* q = _query;
  int idq = 0; 
  
    idq = query_eval(q,"X:melody(X)");
    while ((_result = query_result(q,idq)) ) {
           putAtom(tks,"note",_result);
           }
    return SUCCESS;
  } 
  

slide: The melody detector

The embedded logic component is given the query X:melody(X), which results in the notes that constitute the (relevant fragment of the) melody. These notes are then added to the tokenstream. A similar detector is available for the lyrics.

Parsing a given MIDI file, for example twinkle.mid, results in updating the Monet database as indicated below.


  V1 := newoid();
    midi_song.insert(oid(V0),oid(V1));
  V2 := newoid();
      song_file.insert(oid(V1),oid(V2));
        file_name.insert(oid(V2),"twinkle");
      song_lyrics.insert(oid(V1),oid(V2));
        lyrics_text.insert(oid(V2),"e");
        lyrics_text.insert(oid(V2),"per-");
        lyrics_text.insert(oid(V2),"sonne");
        lyrics_text.insert(oid(V2),"Moi");
        lyrics_text.insert(oid(V2),"je");
        lyrics_text.insert(oid(V2),"dis");
        lyrics_text.insert(oid(V2),"que");
        lyrics_text.insert(oid(V2),"les");
        lyrics_text.insert(oid(V2),"bon-");
        lyrics_text.insert(oid(V2),"bons");
        lyrics_text.insert(oid(V2),"Val-");
        lyrics_text.insert(oid(V2),"ent");
      song_melody.insert(oid(V1),oid(V2));
        melody_note.insert(oid(V2),"a-2");
        melody_note.insert(oid(V2),"a-2");
        melody_note.insert(oid(V2),"g-2");
        melody_note.insert(oid(V2),"g-2");
        melody_note.insert(oid(V2),"f-2");
        melody_note.insert(oid(V2),"f-2");
        melody_note.insert(oid(V2),"e-2");
        melody_note.insert(oid(V2),"e-2");
        melody_note.insert(oid(V2),"d-2");
        melody_note.insert(oid(V2),"d-2");
        melody_note.insert(oid(V2),"e-2");
        melody_note.insert(oid(V2),"c-2");
  
  

slide: Update of Monet database

The updates clearly reflect the structure of the musical information object that corresponds to the properties defined in the grammar.

Implementation status

Currently, we have a running prototype of the MIDI feature detector. It uses an adapted version of public domain MIDI processing software. The embedded logic component is part of the hush framework. It uses an object extension of Prolog that allows for the definition of native objects to interface with the MIDI processing software. A description of the logic component is beyond the scope of this paper, but will be provided in  [OO]. The logic component, however, allows for the definition of arbitrary predicates to extract the musical information, such as the melody and the lyrics. As stated before, the current detector must be regarded as a skeleton implementation that provides the basis for further experimentation.

Queries -- the user interface

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References
Assuming that we have an adequate solution for indexing musical data, we need to define how end-users may access these data, that is search for musical objects in the information space represented by the database, for the ACOI project the World Wide Web.

slide: Query interface

For a limited category of users, those with some musical skills, a direct interface such as a keyboard or a score editor, as provided by the hush framework  [Jamming], might provide a suitable interface for querying the musical database. Yet, for many others, a textual description, or a form-based query will be more appropriate.

In slide query-interface our envisaged user interface for querying is depicted. It provides limited score editing facilities, to enable the user to indicate a melody (including rhythmic structure) in common musical notation. In this stage, we assume some basic musical skills, indeed. In addition to the melodic fragment, which is depicted in the middle frame, we allow the user to give additional information, such as the composer, the name of the song, and (possibly) a text-based outline. The user may also indicate a genre, the instrumentation and additional descriptive features in a free text format, which may include fragments of the lyrics. As concerns the matching algorithm used, the user may express a preference for either strict or approximate matching for the melody or melodic contour, with or without rhythm. See section Match for a discussion of matching algorithms.



slide: User Query Processing

In processing a query, we may derive a partial melody or rhythmic structure from the query, as well as some additional features or criteria. As explained in the previous section, the output of indexing MIDI files consists of both information concerning features as well as a musical rendering of some of these features. These features can be used to match against the criteria formulated in the query. The musical renderings, which include a partial score, may be presented to the user in response to a query, to establish whether the result is acceptable.

The output of a query will be a ranked list of items found in the database. Each item in the list will be represented by a thumbnail of the score, an auditory icon representing the musical fragment found, the name of the song, the composer and a reference to the original musical object on the Web.

Conclusions

Reader's Guide


contents abstract intro web ACOI detector query retrieval conclusions References
This report presents an approach to the detection of musical features which is based on the use of feature grammars as developed in the ACOI framework to describe the structural properties of musical data.

The goal of this work is to support the user in finding a musical piece of his likening, by lyrics, by genre, by musical instruments, tempo, similarity to other pieces, melody and mood.

At this stage we have a prototype for the extraction of relatively simple features from a MIDI file, which uses an embedded logic to extract content-related properties of that data.

The next step in our research will consist of creating suitable ways for querying the musical database. We will have to explore how to present a possibly large collection of matches, and how to assist the end user in refining a query as to obtain the desired result.

The greatest effort, however, will be to arrive at a matching schema, that allows the retrieval of musical information from a large database. Looking at the literature, in particular  [Compare], we discovered suitable dynamic programming algorithms that may be used to detect similarities in melodic and rhythmic structure. However, due to the structural complexity of the algorithms, actual search in a large database will be prohibitively expensive, unless some compact representation of the original musical material can be thought of, to restrict the matching process to what may be regarded as a minimal invariant abstraction of the original piece of music. An alternative solution would be to create additional indexes based on, for example, the distribution of instrument usage, intervals, and note durations, that may augment the matching process by acting as an extra filter.