Indeed, if we are looking for a general characterization it
would be that MPEG-4 is primarily
and, moreover, one that is suitable for a variety of
display devices and networks, including low bitrate
mobile networks.
MPEG-4 supports scalability on a variety of levels:
Let's give an example, taken from the MPEG-4 standard
document.
example
Imagine, a talking figure standing next to a desk
and a projection screen, explaining the contents of
a video that is being projected
on the screen, pointing at a globe that stands on the desk.
The user that is watching that scene decides to
change from viewpoint to get a better look at the globe ...
How would you describe such a scene?
How would you encode it?
And how would you approach decoding
and user interaction?
The solution lies in defining media objects
and a suitable notion of composition
of media objects.
media objects
- media objects -- units of aural, visual or audiovisual content
- composition -- to create compound media objects (audiovisual scene)
- transport -- multiplex and synchronize data associated with media objects
- interaction -- feedback from users' interaction with audiovisual scene
For 3D-scene description, MPEG-4 builds on concepts
taken from VRML (Virtual Reality Modeling Language,
discussed in chapter 7).
Composition, basically, amounts to building
a scene graph, that is
a tree-like structure that specifies the relationship between
the various simple and compound media objects.
Composition allows for
placing media objects anywhere in a given coordinate system,
applying transforms to change the appearance of a media object,
applying streamed data to media objects, and
modifying the users viewpoint.
composition
- placing media objects anywhere in a given coordinate system
- applying transforms to change the appearance of a media object
- applying streamed data to media objects
- modifying the users viewpoint
So, when we have a multimedia presentation or
audiovisual scene, we
need to get it accross some network and deliver it
to the end-user, or as phrased in [MPEG-4]:
transport
The data stream (Elementary Streams)
that result from the coding process can be transmitted
or stored separately and need
to be composed so as to create the actual
multimedia presentation at the receivers side.
At a system level, MPEG-4 offers the following
functionalities to achieve this:
scenegraph
- BIFS (Binary Format for Scenes) -- describes spatio-temporal arrangements of (media) objects in the scene
- OD (Object Descriptor) -- defines the relationship between the elementary streams associated with an object
- event routing -- to handle user interaction

4
In addition, MPEG-4 defines a set of functionalities
For the delivery of streamed data, DMIF, which stands for
DMIF
Delivery Multimedia Integration Framework
that allows for transparent interaction with resources,
irrespective of whether these are available from local
storage, come from broadcast, or must be obtained from
some remote site.
Also transparency with respect to network type is
supported.
Quality of Service is only supoorted to the
extent that it ispossible to indicate needs for
bandwidth and transmission rate.
It is however the responsability of the network provider to
realize any of this.
|
|
(a) scene graph | (b) sprites |

5
authoring
What MPEG-4 offers may be summarized as follows
benefits
- end-users -- interactive media accross all platforms and networks
- providers -- transparent information for transport optimization
- authors -- reusable content, protection and flexibility
In effect, although MPEG-4 is primarily concerned
with efficient encoding
and scalable transport and delivery,
the object-based approach has also clear
advantages from an authoring perspective.
One advantage is the possibility of reuse.
For example, one and the same background can be reused
for multiplepresentations or plays,
so you could imagine that even an amateur game
might be 'located' at the centre-court of Roland Garros or
Wimbledon.
Another, perhaps not so obvious, advantage
is that provisions have been made for
managing intellectual property
of media objects.
And finally, media objects may potentially be
annotated with meta-information to facilitate
information retrieval.

6
syntax
In addition to the binary formats, MPEG-4 also specifies a syntactical
format, called XMT, which stands for
eXtensible MPEG-4 Textual format.
XMT
- XMT contains a subset of X3D
- SMIL is mapped (incompletely) to XMT
when discussing RM3D which is of interest from a
historic perspective, we will further establish
what the relations between, respectively MPEG-4,
SMIL and RM3D are,
and in particular where there is disagreement,
for example with respect to the timing model
underlying animations and the temporal control of
media objects.

7
example(s) -- structured audio
The Machine Listening Group
of the MIT Media Lab
is developing a suite of tools for structered audio,
which means transmitting sound by describing
it rather than compressing it.
It is claimed that tools based on the MPEG-4 standard will be the future platform for computer music, audio for gaming, streaming Internet radio, and other multimedia applications.
The structured audio project is part of a more encompassing research effort of the Music, Mind and Machine Group of the MIT Media Lab, which
envisages a new future of audio technologies and interactive applications that will change the way music is conceived, created, transmitted and experienced,
SMIL
SMIL is pronounced as smile.
SMIL, the Synchronized Multimedia Integration Language,
has been inspired by the Amsterdam Hypermedia Model (AHM).
In fact, the dutch research group at CWI that developed the AHM
actively participated in the SMIL 1.0 committee.
Moreover, they have started a commercial spinoff
to create an editor for SMIL, based on the editor
they developed for CMIF.
The name of the editor is GRINS. Get it?
As indicated before SMIL is intended to be used for
SMIL
TV-like multimedia presentations
The SMIL language is an XML application, resembling HTML.
SMIL presentations can be written using a simple text-editor
or any of the more advanced tools, such as GRINS.
There is a variety of SMIL players.
The most wellknown perhaps is the RealNetworks G8 players,
that allows for incorporating RealAudio and RealVideo
in SMIL presentations.
parallel and sequential
Authoring a SMIL presentation comes down, basically, to
name media components for text, images,audio and video with URLs, and to schedule their presentation either in parallel or in sequence.
Quoting the SMIL 2.0 working draft, we can characterize
the SMIL presentation characteristics as follows:
presentation characteristics
- The presentation is composed from several components that are accessible via URL's, e.g. files stored on a Web server.
- The components have different media types, such as audio, video, image or text. The begin and end times of different components are specified relative to events in other media components. For example, in a slide show, a particular slide is displayed when the narrator in the audio starts talking about it.
- Familiar looking control buttons such as stop, fast-forward and rewind allow the user to interrupt the presentation and to move forwards or backwards to another point in the presentation.
- Additional functions are "random access", i.e. the presentation can be started anywhere, and "slow motion", i.e. the presentation is played slower than at its original speed.
- The user can follow hyperlinks embedded in the presentation.

Where HTML has become successful as a means to write simple hypertext
content,
the SMIL language is meant to become a vehicle of choice
for writing synchronized hypermedia.
The working draft mentions a number of possible applications,
for example a photoalbun with spoken comments,
multimedia training courses, product demos with explanatory
text, timed slide presentations, onlime music with controls.
applications
- Photos taken with a digital camera can be coordinated with a commentary
- Training courses can be devised integrating voice and images.
- A Web site showing the items for sale, might show photos of the product range in turn on the screen, coupled with a voice talking about each as it appears.
- Slide presentations on the Web written in HTML might be timed so that bullet points come up in sequence at specified time intervals, changing color as they become the focus of attention.
- On-screen controls might be used to stop and start music.

As an example, let's consider an interactive news bulletin,
where you have a choice between viewing
a weather report or listening to some story about,
for example, the decline of another technology stock.
Here is how that could be written in SMIL:
example
<par>
<a href="#Story"> <img src="button1.jpg"/> </a>
<a href="#Weather"> <img src="button2.jpg"/></a>
<excl>
<par id="Story" begin="0s">
<video src="video1.mpg"/>
<text src="captions.html"/>
</par>
<par id="Weather">
<img src="weather.jpg"/>
<audio src="weather-rpt.mp3"/>
</par>
</excl>
</par>

Notice that there are two parallel (PAR)
tags, and one exclusive (EXCL) tag.
The exclusive tag has been introduced in SMIL 2.0
to allow for making an exclusive choice,so that only
one of the items can be selected at a particular time.
The SMIL 2.0 working draft defines a number of elements
and attributes to control presentation, synchronization
and interactivity, extending the functionality of SMIL 1.0.
Before discussing how the functionality proposed
in the SMIL 2.0working draft may be realized,
we might reflect on how to position SMIL
with respect to the many other approaches to
provide multimedia on the web.
As other approaches we may think of flash,
dynamic HTML (using javascript), or java applets.
In the SMIL 2.0 working draft we read the following comment:
history
Experience from both the CD-ROM community and from the Web multimedia community suggested that it would be beneficial to adopt a declarative format for expressing media synchronization on the Web as an alternative and complementary approach to scripting languages.
Following a workshop in October 1996, W3C established a first working group on synchronized multimedia in March 1997. This group focused on the design of a declarative language and the work gave rise to SMIL 1.0 becoming a W3C Recommendation in June 1998.
In summary,
SMIL 2.0 proposes a declarative format
to describe the temporal behavior of a multimedia presentation,
associate hyperlinks with media objects, describe the form of the
presentation on a screen, and specify interactivity
in multimedia presentations.
Now,why such a fuzz about "declarative format"?
Isn't scripting more exciting?
And aren't the tools more powerful?
Ok, ok. I don't want to go into that right now.
Let's just consider a declarative format
to be more elegant. Ok?
To support the functionality proposed for SMIL 2.0
the working draft lists a number of modules
that specify the interfaces for accessing the attributes
of the various elements.
SMIL 2.0 offers modules for animation,
content control, layout, linking, media objects, meta information,
timing and synchronization, and transition effects.
SMIL 2.0 Modules
- The Animation Modules
- The Content Control Modules
- The Layout Modules
- The Linking Modules
- The Media Object Modules
- The Metainformation Module
- The Structure Module
- The Timing and Synchronization Module
- The Time Manipulations Module
- The Transition Effects Module

This modular approach allows to
reuse SMIL syntax and semantics in other XML-based languages, in particular those that need to represent timing and synchronization. For example:
module-based reuse
- SMIL modules could be used to provide lightweight multimedia functionality on mobile phones, and to integrate timing into profiles such as the WAP forum's WML language, or XHTML Basic.
- SMIL timing, content control, and media objects could be used to coordinate broadcast and Web content in an enhanced-TV application.
- SMIL Animation is being used to integrate animation into W3C's Scalable Vector Graphics language (SVG).
- Several SMIL modules are being considered as part of a textual representation for MPEG4.

The SMIL 2.0 working draft is at the moment of writing
being finalized.
It specifies a number of language profiles
topromote the reuse of SMIL modules.
It also improves on the accessibility features of SMIL 1.0,
which allows for,
for example,, replacing captions by audio descriptions.
In conclusion,
SMIL 2.0 is an interesting standard, for a number of reasons.
For one, SMIL 2.0 has solid theoretical underpinnings
in a well-understood, partly formalized, hypermedia model (AHM).
Secondly,
it proposes interesting functionality, with which
authors can make nice applications.
In the third place, it specifies a high level
declarative format, which is both expressive and flexible.
And finally, it is an open standard
(as opposed to proprietary standard).
So everybody can join in and produce players for it!

8
RM3D -- not a standard
The web started with simple HTML hypertext pages.
After some time static images were allowed.
Now, there is support for all kinds of user interaction,
embedded multimedia and even synchronized hypermedia.
But despite all the graphics and fancy animations,
everything remains flat.
Perhaps surprisingly, the need for a 3D web standard arose
in the early days of the web.
In 1994, the acronym VRML was coined by Tim Berners-Lee,
to stand for Virtual Reality Markup Language.
But, since 3D on the web is not about text but more
about worlds, VRML came to stand for
Virtual Reality Modeling Language.
Since 1994, a lot of progress has been made.
www.web3d.org
- VRML 1.0 -- static 3D worlds
- VRML 2.0 or VRML97 -- dynamic behaviors
- VRML200x -- extensions
- X3D -- XML syntax
- RM3D -- Rich Media in 3D
In 1997, VRML2 was accepted as a standard, offering rich means
to create 3D worlds with dynamic behavior and user interaction.
VRML97 (which is the same as VRML2) was, however, not the success
it was expected to be, due to (among others)
incompatibility between browsers,
incomplete implementations of the standards,
and high performance requirements.
As a consequence, the Web3D Consortium (formerly the VRML Consortium)
broadened its focus, and started thinking about
extensions or modifications of VRML97 and an XML version of
VRML (X3D).
Some among the X3D working group felt the need to rethink
the premisses underlying VRML and started
the Rich Media Working Group:
groups.yahoo.com/group/rm3d/
The Web3D Rich Media Working Group was formed to develop a Rich Media standard format (RM3D) for use in next-generation media devices. It is a highly active group with participants from a broad range of companies including 3Dlabs, ATI, Eyematic, OpenWorlds, Out of the Blue Design, Shout Interactive, Sony, Uma, and others.
In particular:
RM3D
The Web3D Consortium initiative is fueled by a clear need for a standard high performance Rich Media format. Bringing together content creators with successful graphics hardware and software experts to define RM3D will ensure that the new standard addresses authoring and delivery of a new breed of interactive applications.
The working group is active in a number of areas including,
for example, multitexturing and the integration of video
and other streaming media in 3D worlds.
Among the driving forces in the RM3D group
are Chris Marrin and Richter Rafey, both from Sony,
that proposed Blendo, a rich media extension
of VRML.
Blendo has a strongly typed object model,
which is much more strictly defined than the VRML object model,
to support both declarative and programmatic extensions.
It is interesting to note that the premisse underlying the
Blendo proposal confirms (again) the primacy of the TV metaphor.
That is to say, what Blendo intends to support
are TV-like presentations which allow for user
interaction such as the selection of items or playing a game.
Target platforms for Blendo include graphic PCs, set-top boxes,
and the Sony Playstation!

9
requirements
The focus of the RM3D working group is not syntax
(as it is primarily for the X3D working group)
but semantics,
that is to enhance the VRML97 standard to effectively
incorporate rich media.
Let's look in more detail at the requirements as
specified in the RM3Ddraft proposal.
requirements
- rich media -- audio, video, images, 2D & 3D graphics
(with support for temporal behavior, streaming and synchronisation)
- applicability -- specific application areas, as determined by
commercial needs and experience of working group members
The RM3D group aims at interoperability with other
standards.
- interoperability -- VRML97, X3D, MPEG-4, XML (DOM access)
In particular, an XML syntax is being defined in parallel
(including interfaces for the DOM).
And, there is mutual interest and exchange of ideas between the
MPEG-4 and RM3D working group.
As mentioned before, the RM3D working group has a strong
focus on defining an object model
(that acts as a common model for the representation of
objects and their capabilities) and suitable
mechanisms for extensibility
(allowing for the integration of new objects defined in Java or
C++, and associated scripting primitives and declarative
constructs).
- object model -- common model for representation of objects and capabilities
- extensibility -- integration of new objects (defined in Java or C++), scripting capabilities and declarative content
Notice that extensibility also requires the definition of
a declarative format, so that the content author need
not bother with programmatic issues.
The RM3D proposal should result in effective
3D media presentations.
So as additional requirements we may,
following the working draft, mention:
high-quality realtime rendering, for realtime interactive
media experiences;
platform adaptability, with query functions for programmatic
behavior selection;
predictable behavior, that is a well-defined order of execution;
a high precision number systems, greater than single-precision IEEE
floating point numbers; and
minimal size, that is both download size and memory footprint.
- high-quality realtime rendering -- realtime interactive media experiences
- platform adaptability -- query function for programmatic behavior selection
- predictable behavior -- well-defined order of execution
- high precision number systems -- greater than single-precision IEEE floating point numbers
- minimal size -- download and memory footprint
Now, one may be tempted to ask how the RM3D proposals
is related to the other standard proposals
such as MPEG-4 and SMIL, discussed previously.
Briefly put, paraphrased from one of Chris Marrin's
messages on the RM3D mailing list
SMIL is closer to the author
and RM3D is closer to the implementer.
MPEG-4, in this respect is even further away from the
author since its chief focus is on compression
and delivery across a network.
RM3D takes 3D scene description as a starting point
and looks at pragmatic ways to integrate rich media.
Since 3D is itself already computationally intensive,
there are many issues thatarise in finding
efficient implementations for the proposed solutions.

10
timing model
RM3D provides a declarative format formany
interesting features, such as for example texturing objects
with video.
In comparison to VRML, RM3D is meant to provide more temporal
control over time-based media objects and animations.
However, there is strong disagreement among the working
group members as to what time model the dynamic capabilities
of RM3D should be based on.
As we read in the working draft:
working draft
Since there are three vastly different proposals for this section (time model), the original <RM3D> 97 text
is kept. Once the issues concerning time-dependent nodes are resolved, this section can be
modified appropriately.
Now, what are the options?
Each of the standards discussed to far
provides us with a particular solution to timing.
Summarizing, we have a time model based on a spring metaphor in MPEG-4,
the notion of cascading time in SMIL (inspired by
cascading stylesheets for HTML) and timing based on the
routing of events in RM3D/VRML.
time model
- MPEG-4 -- spring metaphor
- SMIL -- cascading time
- RM3D/VRML -- event routing
The MPEG-4 standard introduces the spring metaphor
for dealing with temporal layout.
MPEG-4 -- spring metaphor
- duration -- minimal, maximal, optimal
The spring metaphor amounts to the ability
to shrink or stretch a media object within given bounds
(minimum, maximum)
to cope with, for example, network delays.
The SMIL standard is based on a model
that allows for propagating durations and time manipulations
in a hierarchy of media elements.
Therefore it may be referred to as a
cascading modelof time.
SMIL -- cascading time
- time container -- speed, accelerate, decelerate, reverse, synchronize
Media objects, in SMIL, are stored in some sort of container
of which the timing properties can be manipulated.
<seq speed="2.0">
<video src="movie1.mpg" dur="10s"/>
<video src="movie2.mpg" dur="10s"/>
<img src="img1.jpg" begin="2s" dur="10s">
<animateMotion from="-100,0" to="0,0" dur="10s"/>
</img>
<video src="movie4.mpg" dur="10s"/>
</seq>
In the example above,we see that the speed is set to 2.0,
which will affect the pacing of each of the individual
media elements belonging to that (sequential) group.
The duration of each of the elements is specified
in relation to the parent container.
In addition, SMIL offers the possibility to
synchronize media objects to control, for example,
the end time of parallel media objects.
VRML97's capabilities for timing
rely primarily on the existence of a
TimeSensor thatsends out time events
that may be routed to other objects.
RM3D/VRML -- event routing
- TimeSensor -- isActive, start, end, cycleTime, fraction, loop
When a TimeSensor starts to emit time events,
it also sends out an event notifying other objects
that it has become active.
Dependent on itsso-called cycleTime,
it sends out the fraction it covered
since it started.
This fraction may be send to one of the standard
interpolators or a script so that some value can be set,
such as for example the orientation,
dependent on the fraction of the time intercal that has passed.
When the TimeSensor is made to loop,
this is done repeatedly.
Although time in VRML is absolute,
the frequency with which fraction events are emitted depends
on the implementation and processor speed.
Lacking consensus about a better model,
this model has provisionally been adopted,
with some modifications, for RM3D.
Nevertheless, the SMIL cascading time model
has raised an interest in the RM3D working group,
to the extent that Chris Marrin remarked (in the mailing list)
"we could go to school here".
One possibility for RM3D would be to
introduce time containers
that allow for a temporal transform of
their children nodes,
in a similar way as grouping containers
allow for spatial transforms of
their children nodes.
However,
that would amount to a dual hierarchy,
one to control (spatial) rendering
and one to control temporal characteristics.
Merging the two hierarchies,
as is (implicitly) the case in SMIL,
might not be such a good idea,
since the rendering and timing semantics of
the objects involved might be radically different.
An interesting problem, indeed,
but there seems to be no easy solution.

11
example(s) -- rich internet applications
In a seminar held by Lost Boys,
which is a dutch subdivison if
Icon Media Lab,
rich internet applications (RIA), were
presented as the new solutions to present
applications on the web.
As indicated by
Macromedia, who is one of the leading
companies in this fiwld,
experience matters,
and so plain html pages pages do not suffice since they
require the user to move from one page to another
in a quite unintuitive fashion.
Macromedia presents its new line of flash-based products
to create such rich internet applications.
An alternative solution, based on general W3C recommendations,
is proposed by
BackBase.
Interestingly enough, using either technology, many of
the paricipants of the seminar indicated a strong preference
for a backbuuton, having similar functionality as the
often used backbutton in general internet browsers.
research directions -- meta standards
All these standards!
Wouldn't it be nice to have one single standard
that encompasses them all?
No, it would not!
Simply, because such a standard is inconceivable,
unless you take some proprietary standard or a particular
platform as the defacto standard
(which is the way some people look at the Microsoft win32
platform, ignoring the differences between 95/98/NT/2000/XP/...).
In fact, there is a standard that acts as a glue between
the various standards for multimedia, namely XML.
XML allows for the interchange of data between various
multimedia applications, that is the transformation of one encoding
into another one.
But this is only syntax.
What about the semantics?
Both with regard to delivery and presentation
the MPEG-4 proposal makes an attempt to delineate
chunks of core fuctionality that may be shared between applications.
With regard to presentation, SMIL may serve as an example.
SMIL applications themselves already (re)use
functionality from the basic set of XML-related
technologies,
for example to access the document structure through
the DOM (Document Object Model).
In addition, SMIL defines components that it may potentially share
with other applications.
For example, SMIL shares its animation facilities
with SVG (the Scalable Vector Graphics format recommended
by the Web Consortium).
The issue in sharing is, obviously, how to relate
constructs in the syntax to their operational support.
When it is possible to define a common base
of operational support for a variety of multimedia applications
we would approach our desired meta standard, it seems.
A partial solution to this problem has
been proposed in the now almost forgotten HyTime
standard for time-based hypermedia.
HyTime introduces the notion
of architectural forms
as a means to express the operational support needed
for the interpretation of particular encodings,
such as for example synchronization or
navigation over bi-directional links.
Apart from a base module, HyTime compliant architectures
may include a units measurement module,
a module for dealing with location addresses,
a module to support hyperlinks, a scheduling module
and a rendition module.
To conclude, wouldn't it be wonderful if, for example,
animation support could be shared between
rich media X3D
and SMIL?
Yes, it would!
But as you may remember from the discussion on the timing
models used by the various standards, there is still
to much divergence to make this a realoistic option.
(C) Æliens
04/09/2009
You may not copy or print any of this material without explicit permission of the author or the publisher.
In case of other copyright issues, contact the author.