topical media & game development

talk show tell print

...



1

video

Automatic content description is no doubt much harder for video than for any other media type. Given the current state of the art, it is not realistic to expect content description by feature extraction for video to be feasible. Therefore,to realize content-based search for video, we have rely on some knowledge representation schema that may adequately describe the (dynamic) properties of video fragments.

In fact, the description of video content may reflect the story-board, that after all is intended to capture both time-independent and dynamically changing properties of the objects (and persons) that play a role in the video.

In developing a suitable annotation for a particular video fragment, two questions need to be answered:

video annotation


Which aspects are of interest is something you have to decide for yourself. Let's see whether we can define a suitable knowledge representation scheme.

One possible knowledge representation scheme for annotating video content is proposed in  [MMDBMS]. The scheme proposed has been inspired by knowledge representation techniques in Artificial Intelligence. It captures both static and dynamic properties.

video content



  video v, frame f 
  f has associated objects and activities 
  objects and activities have properties
  
First of all, we must be able to talk about a particular video fragment v, and frame f that occurs in it. Each frame may contain objects that play a role in some activity. Both objects and activities may have properties, that is attributes that have some value.

property


  property: name = value 
  
As we will see in the examples, properties may also be characterized using predicates.

Some properties depend on the actual frame the object is in. Other properties (for example sex and age) are not likely to change and may considered to be frame-independent.

object schema


   (fd,fi) -- frame-dependent and frame-independent properties 
  
Finally, in order to identify objects we need an object identifier for each object. Summing up, for each object in a video fragment we can define an object instance, that characterizes both frame-independent and frame-dependent properties of the object.

object instance: (oid,os,ip)

Now, with a collection of object instances we can characterize the contents of an entire video fragment, by identifying the frame-dependent and frame-independent properties of the objects.

Look at the following example, borrowed from  [MMDBMS] for the Amsterdam Drugport scenario.

example


frameobjectsframe-dependent properties
1Janehas(briefcase), at(path)
-housedoor(closed)
-briefcase
2Janehas(briefcase), at(door)
-Dennisat(door)
-housedoor(open)
-briefcase

In the first frame Jane is near the house, at the path that leads to the door. The door is closed. In the next frame, the door is open. Jane is at the door, holding a briefcase. Dennis is also at the door. What will happen next?

Observe thatwe are using predicates to represent the state of affairs. We do this, simply because the predicate form has(briefcase) looks more natural than the other form, which would be has = briefcase. There is no essential difference between the two forms.

Now, to complete our description we can simply list the frame-independent properties, as illustrated below.

frame-independent properties


objectframe-independent propertiesvalue
Janeage35
height170cm
houseaddress...
colorbrown
briefcasecolorblack
size40 x 31

How to go from the tabular format to sets of statements that comprise the object schemas is left as an (easy) exercise for the student.

Let's go back to our Amsterdam Drugport scenario and see what this information might do for us, in finding possible suspects. Based on the information given in the example, we can determine that there is a person with a briefcase, and another person to which that briefcase may possibly be handed. Whether this is the case or not should be disclosed in frame 3. Now, what we are actually looking for is the possible exchange of a briefcase, which may indicate a drug transaction. So why not, following  [MMDBMS], introduce another somewhat more abstract level of description that deals with activities.

activity

  • activity name -- id
  • statements -- role = v
An activity has a name, and consists further simply of a set of statements describing the roles that take part in the activity.

example


   { giver : Person, receiver : Person, item : Object } 
   giver = Jane, receiver = Dennis, object = briefcase 
  
For example, an exchange activity may be characterized by identifying the giver, receiver and object roles. So, instead of looking for persons and objects in a video fragment, you'd better look for activities that may have taken place, by finding a matching set of objects for the particular roles of an activity. Consult  [MMDBMS] if you are interested in a further formalization of these notions.

...



2

video libraries

Assuming a knowledge representation scheme as the one treated above, how can we support search over a collection of videos or video fragments in a video library.

What we are interested in may roughly be summarized as

video libraries



  which videos are in the library 
  what constitutes the content of each video
  what is the location of a particular video
  
Take note that all the information about the videos or video fragments must be provided as meta-information by a (human) librarian. Just imagine for a moment how laborious and painstaking this must be, and whata relief video feature extraction would be for an operation like Amsterdam Drugport.

To query the collection of video fragments, we need a query language with access to our knowledge representation. It must support a variety of retrieval operations, including the retrieval of segments, objects and activities, and also property-based retrievals as indicated below.

query language for video libraries


  • segment retrievals -- exchange of briefcase
  • object retrievals -- all people in v:[s,e]
  • activity retrieval -- all activities in v:[s,e]
  • property-based -- find all videos with object oid
 [MMDBMS] lists a collection of video functions that may be used to extend SQL into what we may call VideoSQL. Abstractly, VideoSQL may be characterized by the following schema:

VideoSQL



  SELECT -- v:[s,e] 
  FROM -- video:<source><V> 
  WHERE -- term IN funcall 
  
where v:[s,e] denotes the fragment of video v, starting at frame s and ending at frame e, and term IN funcall one of the video functions giving access to the information about that particular video. As an example, look at the following VideoSQL snippet:

example



  SELECT  vid:[s,e]
  FROM video:VidLib
  WHERE (vid,s,e) IN VideoWithObject(Dennis) AND
  	object IN ObjectsInVideo(vid,s,e) AND
  	object != Dennis AND
  	typeof(object) = Person
  
Notice that apart from calling video functions also constraints can be added with respect to the identity and type of the objects involved.

...



3

example(s) -- video retrieval evaluation

The goal of the TREC conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Since 2003 their is an independent video track devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. In the TRECVID 2004 workshop, thirty-three teams from Europe, the Americas, Asia, and Australia participated. Check it out!

...



4

research directions -- presentation and context

Let's consider an example. Suppose you have a database with (video) fragments of news and documentary items. How would you give access to that database? And, how would you present its contents? Naturally, to answer the first question, you need to provide search facilities. Now, with regard to the second question, for a small database, of say 100 items, you could present a list of videos thatb matches the query. But with a database of over 10.000 items this will become problematic, not to speak about databases with over a million of video fragments. For large databases, obviously, you need some way of visualizing the results, so that the user can quickly browse through the candidate set(s) of items.

 [Video] provide an interesting account on how interactive maps may be used to improve search and discovery in a (digital) video library. As they explain in the abstract:

To improve library access, the Informedia Digital Video Library uses automatic processing to derive descriptors for video. A new extension to the video processing extracts geographic references from these descriptors.

The operational library interface shows the geographic entities addressed in a story, highlighting the regions discussed in the video through a map display synchronized with the video display.

So, the idea is to use geographical information (that is somehow available in the video fragments themselves) as an additional descriptor, and to use that information to enhance the presentation of a particular video. For presenting the results of a query, candidate items may be displayed as icons in a particular region on a map, so that the user can make a choice.

Obviously, having such geographical information:

The map can also serve as a query mechanism, allowing users to search the terabyte library for stories taking place in a selected area of interest.

The approach to extracting descriptors for video fragments is interesting in itself. The two primary sources of information are, respectively, the spoken text and graphic text overlays (which are common in news items to emphasize particular aspects of the news, such as the area where an accident occurs). Both speech recognition and image processing are needed to extract information terms, and in addition natural language processing, to do the actual 'geocoding', that is translating this information to geographical locations related to the story in the video.

Leaving technical details aside, it will be evident that this approach works since news items may relevantly be grouped and accessed from a geographical perspective. For this type of information we may search, in other words, with three kinds of questions:

questions


  • what -- content-related
  • when -- position on time-continuum
  • where -- geographic location
and we may, evidently, use the geographic location both as a search criterium and to enhance the presentation of query results.

mapping information spaces

Now, can we generalize this approach to other type of items as well. More specifically, can we use maps or some spatial layout to display the results of a query in a meaningful way and so give better access to large databases of multimedia objects. According to  [Atlas], we are very likely able to do so:

More recently, it has been recognized that the process of spatialization -- where a spatial map-like structure is applied to data where no inherent or obvious one does exist -- can provide an interpretable structure to other types of data.

Actually, we are taking up the theme of visualization, again. In  [Atlas] visualizations are presented that (together) may be regarded as an atlas of cyberspace.

atlas of cyberspace


We present a wide range of spatializations that have employed a variety of graphical techniques and visual metaphors so as to provide striking and powerful images that extend from two dimension 'maps' to three-dimensional immersive landscapes.

As you may gather from chapter 7 and the afterthoughts, I take a personal interest in the (research) theme of virtual reality interfaces for multimedia information systems. But I am well aware of the difficulties involved. It is an area that is just beginning to be explored!

(C) Æliens 04/09/2009

You may not copy or print any of this material without explicit permission of the author or the publisher. In case of other copyright issues, contact the author.