topical media & game development

1
video
Automatic content description is no doubt much harder
for video than for any other media type.
Given the current state of the art, it is not realistic
to expect content description by feature extraction for video
to be feasible.
Therefore,to realize content-based search for video,
we have rely on some knowledge representation schema
that may adequately describe the (dynamic) properties
of video fragments.
In fact, the description of video content may reflect
the story-board, that after all is intended to capture
both time-independent and dynamically changing properties
of the objects (and persons) that play a role in the video.
In developing a suitable annotation for a particular
video fragment, two questions need to be answered:
video annotation
- what are the interesting aspects?
- how do we represent this information?
Which aspects are of interest is something you have to
decide for yourself.
Let's see whether we can define a suitable
knowledge representation scheme.
One possible knowledge representation scheme for annotating
video content is proposed in [MMDBMS].
The scheme proposed has been inspired by knowledge representation
techniques in Artificial Intelligence.
It captures both static and dynamic properties.
video content
video v, frame f
f has associated objects and activities
objects and activities have properties

First of all, we must be able to talk about a particular
video fragment v, and frame f that
occurs in it.
Each frame may contain objects that play a role
in some activity.
Both objects and activities may have properties,
that is attributes that have some value.
property
property: name = value
As we will see in the examples, properties may
also be characterized using predicates.
Some properties depend on the actual frame the object is in.
Other properties (for example sex and age) are not likely to
change and may considered to be frame-independent.
object schema
(fd,fi) -- frame-dependent and frame-independent properties
Finally, in order to identify objects we need an
object identifier for each object.
Summing up, for each object in a video fragment
we can define an object instance, that characterizes
both frame-independent and frame-dependent properties
of the object.
object instance: (oid,os,ip)

Now, with a collection of object instances we can
characterize the contents of an entire video fragment,
by identifying the frame-dependent and frame-independent
properties of the objects.
Look at the following example,
borrowed from [MMDBMS] for the Amsterdam Drugport
scenario.
example
frame | objects | frame-dependent properties |
1 | Jane | has(briefcase), at(path) |
- | house | door(closed) |
- | briefcase | |
2 | Jane | has(briefcase), at(door) |
- | Dennis | at(door) |
- | house | door(open) |
- | briefcase | |

In the first frame Jane is near the house,
at the path that leads to the door.
The door is closed.
In the next frame, the door is open.
Jane is at the door, holding a briefcase.
Dennis is also at the door.
What will happen next?
Observe thatwe are using predicates to represent
the state of affairs.
We do this, simply because the predicate form
has(briefcase) looks more natural
than the other form, which would be has = briefcase.
There is no essential difference between the two forms.
Now, to complete our description we can
simply list the frame-independent properties,
as illustrated below.
frame-independent properties
object | frame-independent properties | value |
Jane | age | 35 |
| height | 170cm |
house | address | ... |
| color | brown |
briefcase | color | black |
| size | 40 x 31 |

How to go from the tabular format to sets of statements
that comprise the object schemas is left as an (easy)
exercise for the student.
Let's go back to our Amsterdam Drugport
scenario and see what this information might do for us,
in finding possible suspects.
Based on the information given in the example,
we can determine that there is a person with a briefcase,
and another person to which that briefcase may possibly
be handed.
Whether this is the case or not should be disclosed
in frame 3.
Now, what we are actually looking for is the possible exchange
of a briefcase, which may indicate a drug transaction.
So why not, following [MMDBMS],
introduce another somewhat more abstract level of description
that deals with activities.
activity
- activity name -- id
- statements -- role = v
An activity has a name, and consists further
simply of a set of statements
describing the roles that take part
in the activity.
example
{ giver : Person, receiver : Person, item : Object }
giver = Jane, receiver = Dennis, object = briefcase

For example, an exchange activity may be characterized
by identifying the giver,
receiver and object roles.
So, instead of looking for persons and objects
in a video fragment,
you'd better look for activities
that may have taken place,
by finding a matching set of objects for
the particular roles of an activity.
Consult [MMDBMS] if you are interested in a further
formalization of these notions.

2
video libraries
Assuming a knowledge representation scheme as the one
treated above, how can we support search over a collection
of videos or video fragments in a video library.
What we are interested in may roughly be
summarized as
video libraries
which videos are in the library
what constitutes the content of each video
what is the location of a particular video

Take note that all the information about
the videos or video fragments must be provided
as meta-information by a (human) librarian.
Just imagine for a moment how laborious and painstaking this
must be,
and whata relief video feature extraction would be
for an operation like Amsterdam Drugport.
To query the collection of video fragments, we need
a query language with access to our knowledge representation.
It must support a variety of retrieval operations,
including the retrieval of segments, objects and activities,
and also property-based retrievals as indicated below.
query language for video libraries
- segment retrievals -- exchange of briefcase
- object retrievals -- all people in v:[s,e]
- activity retrieval -- all activities in v:[s,e]
- property-based -- find all videos with object oid

[MMDBMS] lists a collection of video functions
that may be used to extend SQL into what we may call
VideoSQL.
Abstractly, VideoSQL may be characterized by the following schema:
VideoSQL
SELECT -- v:[s,e]
FROM -- video:<source><V>
WHERE -- term IN funcall

where v:[s,e] denotes the fragment of video v,
starting at frame s and ending at frame e,
and term IN funcall
one of the video functions giving access to
the information about that particular video.
As an example, look at the following VideoSQL snippet:
example
SELECT vid:[s,e]
FROM video:VidLib
WHERE (vid,s,e) IN VideoWithObject(Dennis) AND
object IN ObjectsInVideo(vid,s,e) AND
object != Dennis AND
typeof(object) = Person

Notice that apart from calling video functions
also constraints can be added with respect to
the identity and type
of the objects involved.

3
example(s) -- video retrieval evaluation
The goal of the
TREC
conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results.
Since 2003 their is an independent video
track
devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video.
In the TRECVID
2004 workshop, thirty-three teams from Europe, the Americas, Asia, and Australia participated.
Check it out!

4
research directions -- presentation and context
Let's consider an example.
Suppose you have a database with (video) fragments of news
and documentary items.
How would you give access to that database?
And, how would you present its contents?
Naturally, to answer the first question,
you need to provide search facilities.
Now, with regard to the second question,
for a small database, of say 100 items,
you could present a list of videos thatb matches the query.
But with a database of over 10.000 items this will become
problematic, not to speak about databases with over a million
of video fragments.
For large databases, obviously, you need some way of
visualizing the results, so that the user can quickly browse
through the candidate set(s) of items.
[Video] provide an interesting account on how interactive maps
may be used to improve search and discovery in a (digital) video library.
As they explain in the abstract:
To improve library access, the Informedia
Digital Video Library uses automatic processing to derive
descriptors for video.
A new extension to the video processing extracts geographic
references from these descriptors.
The operational library interface shows the geographic entities addressed in a story, highlighting the regions discussed in the video through a map display synchronized with the video display.

So, the idea is to use geographical information
(that is somehow available in the video fragments themselves)
as an additional descriptor, and to use that information to
enhance the presentation of a particular video.
For presenting the results of a query, candidate items
may be displayed as icons in a particular region on a map,
so that the user can make a choice.
Obviously, having such geographical information:
The map can also serve as a query mechanism, allowing users to search the terabyte library for stories taking place in a selected area of interest.

The approach to extracting descriptors for video fragments is
interesting in itself.
The two primary sources of information are, respectively, the
spoken text and graphic text overlays (which are common in news
items to emphasize particular aspects of the news, such as the
area where an accident occurs).
Both speech recognition and image processing are needed
to extract information terms, and in addition natural language
processing, to do the actual 'geocoding',
that is translating this information to geographical locations
related to the story in the video.
Leaving technical details aside, it will be evident
that this approach works since news items may relevantly
be grouped and accessed from a geographical perspective.
For this type of information we may search, in other words,
with three kinds of questions:
questions
- what -- content-related
- when -- position on time-continuum
- where -- geographic location

and we may, evidently, use the geographic location
both as a search criterium and to enhance the
presentation of query results.
mapping information spaces
Now, can we generalize this approach to other type of items as well.
More specifically, can we use maps or some spatial layout
to display the results of a query in a meaningful way
and so give better access to large databases of multimedia
objects.
According to [Atlas], we are very likely able to do so:
More recently, it has been recognized that
the process of spatialization -- where a spatial
map-like structure is applied to data where no inherent
or obvious one does exist -- can provide an interpretable
structure to other types of data.

Actually, we are taking up the theme of visualization, again.
In [Atlas] visualizations are presented that (together)
may be regarded as an atlas of cyberspace.
atlas of cyberspace
We present a wide range of spatializations that have
employed a variety of graphical techniques and visual metaphors
so as to provide striking and powerful images that extend
from two dimension 'maps' to three-dimensional immersive landscapes.

As you may gather from chapter 7
and the afterthoughts, I take a personal
interest in the (research) theme of
virtual reality interfaces for multimedia information systems.
But I am well aware of the difficulties involved.
It is an area that is just beginning to be explored!
(C) Æliens
04/09/2009
You may not copy or print any of this material without explicit permission of the author or the publisher.
In case of other copyright issues, contact the author.