INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC
JTC1/SC29/WG11 N3747
La Baule October 2000
Source: WG11 (MPEG)
Status: Final
Title: MPEG-4 Overview - (V.16 La BauleVersion)
Editor: Rob Koenen (rob@intertrust.com)
Overview of the MPEG-4 Standard
MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the Emmy Award winning standards known as MPEG-1 and MPEG-2. These standards made interactive video on CD-ROM and Digital Television possible.
MPEG-4 is the result of another international effort involving hundreds of researchers and engineers from all over the world.
MPEG-4, whose formal ISO/IEC designation is ISO/IEC 14496, was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early 2000. Some work, on extensions in specific domains, is still in progress.
MPEG-4 builds on the proven success of three fields:
MPEG-4 provides the standardized technological elements enabling the integration of the production,
distribution
and content access paradigms of the three fields.
More information about MPEG-4 can be found
at MPEGs home page (case sensitive): . This web page contains links to a wealth of information about MPEG, including much about MPEG-4, many publicly available documents, several lists of Frequently Asked Questions and links to other MPEG-4 web pages.
The standard can be
bought from ISO, send mail to sales@iso.ch. Notably, the complete software for MPEG-4 version
1 can be bought on a CD ROM,
for 56 Swiss Francs (approximately 40 US Dollar). This software is free of copyright restrictions
when used for implementing
MPEG-4 compliant technology. (This does not mean that the software is fee of patents, also see
section 7, The MPEG-4 Industry
Forum.)
This document gives an overview of the MPEG-4 standard, explaining which
pieces of technology it includes and what
sort of applications are supported by this technology.
The
MPEG-4 standard provides a set of technologies
to satisfy the needs of authors, service providers and end users alike.
For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web (WWW) pages and their extensions. Also, it is now possible to better manage and protect content owner rights.
For network service providers MPEG-4 offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters set for each media to the network QoS are beyond the scope of MPEG-4 and are left to network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end enables transport optimization in heterogeneous networks.
For end users, MPEG-4 brings higher levels of interaction with content, within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bitrate, and mobile ones. An MPEG-4 applications document exists on the MPEG Home page (www.cselt.it/mpeg), which describes many end user applications, including interactive multimedia broadcast and mobile communications.
For all parties involved, MPEG seeks to avoid a multitude of proprietary, non-interworking formats and players.
MPEG-4 achieves these goals by providing standardized ways to:
1. represent units of aural, visual or audiovisual content, called "media objects".
These media objects can be of natural or synthetic origin; this means they could be recorded with a camera or microphone,
or generated with a computer;
2. describe the composition of these objects to create compound media objects that form audiovisual scenes;
3. multiplex and synchronize the data associated with media objects, so that they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects; and
4. interact with the audiovisual scene generated at the receivers end.
The following sections illustrate the MPEG-4 functionalities described
above, using the audiovisual scene depicted in Figure 1.
coded representation MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as:
still images (e.g. as a fixed background),
video objects (e.g. a talking person - without the background)
audio objects (e.g. the voice associated with that person);
etc.
MPEG-4 standardizes a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional.
In addition to the media objects mentioned above and shown in Figure 1, MPEG-4 defines the coded representation of objects such as:
text and graphics;
talking synthetic heads and associated text used to synthesize the speech and animate the head;
synthetic sound
media object A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as
well as of associated
streaming data, if needed. It is important to note that in its coded form, each media object can be represented
independent
of its surroundings or background.
The coded representation of media objects is as efficient as possible
while taking into
account the desired functionalities. Examples of such functionalities are error robustness, easy extraction
and editing
of an object, or having an object available in a scaleable form.
Figure 1 explains the way in which an audiovisual scene in MPEG-4 is described as composed of individual objects. The figure contains compound media objects that group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees.
As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person.
Such grouping allows authors to construct complex scenes, and enables
consumers
to manipulate meaningful (sets of) objects.
More generally, MPEG-4 provides a standardized way to describe
a scene,
allowing for example to:
grouping place media objects anywhere in a given coordinate system;
apply transforms to change the geometrical or acoustical appearance of a media object;
group primitive media objects in order to form compound media objects;
apply streamed data to media objects, in order to modify their attributes (e.g. a sound, a moving texture belonging to an object; animation parameters driving a synthetic face);
change, interactively, the users viewing and listening points anywhere in the scene.
The scene description builds on several concepts from the Virtual Reality
Modeling
language (VRML) in terms of both its structure and the functionality of object composition nodes and extends it
to fully
enable the aforementioned features.
compound objects
Figure
1 - an example of an MPEG-4 Scene
Media
objects may need streaming data, which is conveyed in one or
more elementary streams. An object descriptor identifies all
streams associated to one media object. This allows handling
hierarchically encoded data as well as the association of
meta-information about the content (called object content information) and the intellectual property rights associated with it.
Each stream
itself is characterized by a set of descriptors for configuration
information, e.g., to determine the required decoder
resources and the precision of encoded timing information. Furthermore
the descriptors may carry hints to the Quality
of Service (QoS) it requests for transmission (e.g., maximum bit rate, bit
error rate, priority, etc.)
Synchronization
of elementary streams is achieved through time stamping of individual access
units within elementary streams. The synchronization
layer manages the identification of such access units and the time
stamping. Independent of the media type, this layer allows
identification of the type of access unit (e.g., video or audio
frames, scene description commands) in elementary streams,
recovery of the media objects or scene descriptions time base, and it enables synchronization among them. The syntax of this layer is configurable in a large number of ways, allowing use in a broad spectrum of systems.
The synchronized delivery of streaming information from source to destination, exploiting
different
QoS as available from the network, is specified in terms of the synchronization layer and a delivery layer containing
a two-layer
multiplexer, as depicted in Figure 2.
The first multiplexing layer is managed according to the
DMIF specification, part
6 of the MPEG-4 standard. (DMIF stands for Delivery Multimedia Integration Framework) This multiplex
may be embodied by
the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams (ESs) with a low multiplexing
overhead. Multiplexing
at this layer may be used, for example, to group ES with similar QoS requirements, reduce the number
of network connections
or the end to end delay.
The "TransMux" (Transport Multiplexing) layer in Figure
2 models the layer that offers
transport services matching the requested QoS. Only the interface to this layer is specified
by MPEG-4 while the concrete
mapping of the data packets and control signaling must be done in collaboration with the bodies
that have jurisdiction over
the respective transport protocol. Any suitable existing transport protocol stack such as
(RTP)/UDP/IP, (AAL5)/ATM,
or MPEG-2s Transport Stream over a suitable link layer may become a specific TransMux instance. The choice is left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of operation environments.
system layer model
Figure 2 - The MPEG-4 System Layer
Model
Use of the FlexMux multiplexing tool is optional
and, as shown in Figure 2, this layer may be empty if the underlying
TransMux instance provides all the required functionality. The synchronization
layer, however, is always present.
With
regard to Figure 2, it is possible to:
identify access units, transport timestamps and clock reference information and identify data loss.
optionally interleave data from different elementary streams into FlexMux streams
convey control information to:
indicate the required QoS for each elementary stream and FlexMux stream;
translate such QoS requirements into actual network resources;
associate elementary streams to media objects
convey the mapping of elementary streams to FlexMux and TransMux channels
Parts of
the control functionalities are available only in conjunction with
a transport control entity like the DMIF framework.
In general,
the user observes a scene that is composed following the design of the scenes author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include:
change the viewing/listening point of the scene, e.g. by navigation through a scene;
drag objects in the scene to a different position;
trigger a cascade of events by clicking on a specific object, e.g. starting or stopping a video stream ;
select the desired language when multiple language tracks are available;
More
complex kinds of behavior can also
be triggered, e.g. a virtual phone rings, the user answers and a communication link is
established.
It is important to have the possibility to identify intellectual
property
in MPEG-4 media objects. Therefore, MPEG has worked with representatives of different creative industries in
the definition
of syntax and tools to support this. A full elaboration of the requirements for the identification of intellectual
property
can be found in Management and Protection of Intellectual Property in MPEG-4, which is publicly available from the MPEG home page.
MPEG-4 incorporates identification the intellectual property by storing unique identifiers,
that are
issued by international numbering systems (e.g. ISAN, ISRC, etc. [ ISAN: International Audio-Visual Number,
ISRC: International
Standard Recording Code] ). These numbers can be applied to identify a current rights holder of a media
object. Since not
all content is identified by such a number, MPEG-4 Version 1 offers the possibility to identify intellectual
property by
a key-value pair (e.g.:»composer«/»John Smith«). Also, MPEG-4 offers a standardized interface that is integrated tightly into the Systems layer to people who want to use systems that control access to intellectual property. With this interface, proprietary control systems can be easily amalgamated with the standardized part of the decoder.
This section contains, in an itemized fashion, the
major functionalities
that the different parts of the MPEG-4 Standard offers in the finalized MPEG-4 Version 1. Description
of the functionalities
can be found in the following sections.
DMIF
supports the following
functionalities:
A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media.
Control of the establishment of FlexMux channels
Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN.
As explained above, MPEG-4 defines a toolbox of
advanced compression
algorithms for audio and visual information. The data streams (Elementary Streams, ES) that result
from the coding process
can be transmitted or stored separately, and need to be composed so as to create the actual multimedia
presentation at the
receiver side.
The systems part of the MPEG-4 addresses the description of the relationship between
the audio-visual components
that constitute a scene. The relationship is described at two main levels.
The Binary Format for Scenes (BIFS) describes the spatio-temporal arrangements of the objects in the scene. Viewers may have the possibility of interacting with the objects, e.g. by rearranging them on the scene or by changing their own point of view in a 3D virtual environment. The scene description provides a rich set of nodes for 2-D and 3-D composition operators and graphics primitives.
At a lower level, Object Descriptors (ODs) define the relationship between the Elementary Streams pertinent to each object (e.g the audio and the video stream of a participant to a videoconference) ODs also provide additional information such as the URL needed to access the Elementary Steams, the characteristics of the decoders needed to parse them, intellectual property and others.
Other
issues addressed by MPEG-4 Systems:
Interactivity, including: client and server-based interaction; a general event model for triggering events or routing user actions; general event handling and routing between objects in the scene, upon user or scene triggered events.
A tool for interleaving of multiple streams into a single stream, including timing information (FlexMux tool).
A tool for storing MPEG-4 data in a file (the MPEG-4 File Format, MP4)
Interfaces to various aspects of the terminal and networks, in the form of Java APIs (MPEG-J)
Transport layer independence. Mappings to relevant transport protocol stacks, like (RTP)/UDP/IP or MPEG-2 transport stream can be or are being defined jointly with the responsible standardization bodies.
Text representation with international language support, font and font style selection, timing and synchronization.
The initialization and continuous management of the receiving terminals buffers.
Timing identification, synchronization and recovery mechanisms.
Datasets covering identification of Intellectual Property Rights relating to media objects.
MPEG-4 Audio facilitates a wide variety of applications which
could range from intelligible speech
to high quality multichannel audio, and from natural sounds to synthesized sounds.
In particular, it supports the highly
efficient representation of audio objects consisting of:
Speech signals: Speech coding can be done using bitrates from 2 kbit/s up to 24 kbit/s using the speech coding tools. Lower bitrates, such as an average of 1.2 kbit/s, are also possible when variable rate coding is allowed. Low delay is possible for communications applications. When using the HVXC tools, speed and pitch can be modified under user control during playback. If the CELP tools are used, a change of the playback speed can be achieved by using and additional tool for effects processing.
Synthesized Speech: Scalable TTS coders bitrate range from 200 bit/s to 1.2 Kbit/s which allows a text, or a text with prosodic parameters (pitch contour, phoneme duration, and so on), as its inputs to generate intelligible synthetic speech. It includes the following functionalities.
Speech synthesis using the prosody of the original speech
Lip synchronization control with phoneme information.
Trick mode functionality: pause, resume, jump forward/backward.
International language and dialect support for text. (i.e. it can be signaled in the bitstream which language and dialect should be used)
International symbol support for phonemes.
support for specifying age, gender, speech rate of the speaker
support for conveying facial animation parameter(FAP) bookmarks.
General audio signals: Support for coding general audio ranging from very low bitrates up to high quality is provided by transform coding techniques. With this functionality, a wide range of bitrates and bandwidths is covered. It starts at a bitrate of 6 kbit/s and a bandwidth below 4 kHz but also includes broadcast quality audio from mono up to multichannel.
Synthesized Audio: Synthetic Audio support is provided by a Structured Audio Decoder implementation that allows the application of score-based control information to musical instruments described in a special language.
Bounded-complexity Synthetic Audio: This is provided by a Structured Audio Decoder implementation that allows the processing of a standardized wavetable format.
Examples
of additional functionality are speed
control and pitch change for speech signals and scalability in terms of bitrate,
bandwidth, error robustness, complexity,
etc. as defined below.
The speed change functionality allows the change of the time scale without altering the pitch during the decoding process. This can, for example, be used to implement a "fast forward" function (data base search) or to adapt the length of an audio sequence to a given video sequence, or for practicing dance steps at slower play back speed.
The pitch change functionality allows the change of the pitch without altering the time scale during the encoding or decoding process. This can be used, for example, for voice alteration or Karaoke type applications. This technique only applies to parametric and structured audio coding methods.
Bitrate scalability allows a bitstream to be parsed into a bitstream of lower bitrate such that the combination can still be decoded into a meaningful signal. The bitstream parsing can occur either during transmission or in the decoder.
Bandwidth scalability is a particular case of bitrate scalability, whereby part of a bitstream representing a part of the frequency spectrum can be discarded during transmission or decoding.
Encoder complexity scalability allows encoders of different complexity to generate valid and meaningful bitstreams.
Decoder complexity scalability allows a given bitstream to be decoded by decoders of different levels of complexity. The audio quality, in general, is related to the complexity of the encoder and decoder used.
Audio Effects provide the ability to process decoded audio signals with complete timing accuracy to achieve functions for mixing , reverberation, spatialization, etc.
The MPEG-4 Visual standard will allow the hybrid coding
of
natural (pixel based) images and video together with synthetic (computer generated) scenes. This will, for example,
allow
the virtual presence of videoconferencing participants. To this end, the Visual standard will comprise tools and
algorithms
supporting the coding of natural (pixel based) still images and video sequences as well as tools to support the
compression
of synthetic 2-D and 3-D graphic geometry parameters (i.e. compression of wire grid parameters, synthetic
text).
The
subsections below give an itemized overview of functionalities that the tools and algorithms of the MPEG-4 visual
standard
will support.
The following formats
and bitrates will be supported by MPEG-4 Version
1:
bitrates: typically between 5 kbit/s and 10 Mbit/s
Formats: progressive as well as interlaced video
Resolutions: typically from sub-QCIF to beyond HDTV
Efficient compression of video will be supported for all bit rates addressed. This includes the compact coding of textures with a quality adjustable between "acceptable" for very high compression ratios up to "near lossless".
Efficient compression of textures for texture mapping on 2-D and 3-D meshes.
Random access of video to allow functionalities such as pause, fast forward and fast reverse of stored video.
Content-based coding of images and video to allow separate decoding and reconstruction of arbitrarily shaped video objects.
Random access of content in video sequences to allow functionalities such as pause, fast forward and fast reverse of stored video objects.
Extended manipulation of content in video sequences to allow functionalities such as warping of synthetic or natural text, textures, image and video overlays on reconstructed video content. An example is the mapping of text in front of a moving video object where the text moves coherently with the object.
Complexity scalability in the encoder allows encoders of different complexity to generate valid and meaningful bitstreams for a given texture, image or video.
Complexity scalability in the decoder allows a given texture, image or video bitstream to be decoded by decoders of different levels of complexity. The reconstructed quality, in general, is related to the complexity of the decoder used. This may entail that less powerful decoders decode only a part of the bitstream.
Spatial scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display textures, images and video objects at reduced spatial resolution. For textures and still images, a maximum of 11 levels of spatial scalability will be supported. For video sequences, a maximum of three levels will be supported.
Temporal scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display video at reduced temporal resolution. A maximum of three levels will be supported.
Quality scalability allows a bitstream to be parsed into a number of bitstream layers of different bitrate such that the combination of a subset of the layers can still be decoded into a meaningful signal. The bitstream parsing can occur either during transmission or in the decoder. The reconstructed quality, in general, is related to the number of layers used for decoding and reconstruction.
Shape coding will be supported to assist the description and composition of conventional images and video as well as arbitrarily shaped video objects. Applications that benefit from binary shape maps with images are content based image representations for image data bases, interactive games, surveillance, and animation. Efficient techniques are provided that allow efficient coding of binary shape. A binary alpha map defines whether or not a pixel belongs to an object. It can be on or off.
Gray Scale or alpha Shape Coding
An alpha plane
defines the transparency of an object, which is not necessarily uniform. Multilevel alpha maps are frequently used to blend different layers of image sequences. Other applications that benefit from associated binary alpha maps with images are content based image representations for image databases, interactive games, surveillance, and animation. Efficient techniques are provided, that allow efficient coding of binary as well as gray scale alpha planes. A binary alpha map defines whether or not a pixel belongs to an object. It can be on or off. A gray scale map offers the possibility to define the exact transparency of each pixel.
Error
resilience will be supported to assist the access of image and video over a wide range of storage and transmission media.
This includes the useful operation of image and video compression algorithms in error-prone environments at low bit-rates
(i.e., less than 64 Kbps). Tools are provided which address both the band limited nature and error resiliency aspects for
access over wireless networks.
The Face Animation part of the standard allow sending parameters that calibrate and animate synthetic faces. These models themselves are not standardized by MPEG-4, only the parameters are.
Definition and coding of face animation parameters (model independent):
Feature point positions and orientations to animate the face definition meshes
Visemes, or visual lip configurations equivalent to speech phonemes
Definition and coding of face definition parameters (for model calibration):
3-D feature point positions
3-D head calibration meshes for animation
Texture map of face
Personal characteristics
Mesh-based prediction and animated texture transfiguration
2-D Delaunay or regular mesh formalism with motion tracking of animated objects
Motion prediction and suspended texture transmission with dynamic meshes.
Geometry compression for motion vectors:
2-D mesh compression with implicit structure & decoder reconstruction
Version 2 was frozen in December 1999. Existing tools and profiles from Version
1 are not replaced
in Version 2; technology will be added to MPEG-4 in the form of new profiles. Figure 3 below depicts the relationship
between the two versions.
Version 2 is a backward compatible extension of Version 1.
Figure 3 - relation between MPEG-4 Versions
Version 2 builds on Version 1 of MPEG-4.
The Systems layer of Version 2 is backward compatible with Version 1. In the area of Audio and Visual, Version 2 will add Profiles
to Version 1. The work on MPEG-4 does not stop after MPEG-4; more functionality will be added, albeit in particular, well-defined
areas. The same principle applies, and new tools will find their way in the standard in the form of new Profiles. This means
that existing systems will always remain compliant, because Profiles will not be changed in retrospect.
Version
2 of the MPEG-4 systems extends Version 1 to cover issues like extended BIFS functionalities,
and Java (MPEG-J) support.
Version 2 also specifies a file format to store MPEG-4 content. In Section 8, these new elements
will be introduced in more
detail.
MPEG-4 Visual Version
2 adds technology
in the following areas:
increased flexibility in object-based scalable coding,
improved coding efficiency,
improved temporal resolution stability with the low buffering delay,
improved error robustness,
coding of multiple views: Intermediate views or stereoscopic views will be supported based on the efficient coding of multiple images or video sequences. A particular example is the coding of stereoscopic images or video by redundancy reduction of information contained between the images of different views.
See Section 9 for more detail.
Version 2 adds Body Animation to the Face
Animation already present
in V.1
Version
2 MPEG-4 provides a suite of tools for coding
3-D polygonal meshes. Polygonal meshes are widely used as a generic representation
of 3-D objects. The underlying technologies
compress the connectivity, geometry, and properties such as shading normals,
colors and texture coordinates of 3-D polygonal
meshes.
MPEG-4 Audio
Version 2 is an extension to MPEG-4 Audio Version 1. It adds new tools
and functionalities to the MPEG-4 Standard, while none
of the existing tools of Version 1 is replaced. The following additional
functionalities are provided by MPEG-4 Audio Version
2:
Increased error robustness
Audio coding that couples high quality to low delay
Fine Grain scalability (scalability resolution down to 1 kbit/s per channel)
Parametric Audio Coding to allow sound manipulation at low speeds
CELP Silence compression, to further lower bitrates in speech conding
Error resilient parametric speech coding
Environmental spatialization the possibility to recreate sound environment using perceptual and/or physical modeling techniques
A back channel that is helpful to adjust encoding or scalable play out in real time
A low overhead, MPEG-4caudio-specific transport mechanism
See Section 10, Detailed technical description of MPEG-4
Audio
The major features
introduced by DMIF Version 2 cover (limited) support to mobile
networks and QoS monitoring. A few minor additions were introduced
as well.
In conjunction
with ITU-T, the H.245 specification has been extended (H.245v6) to include support to MPEG-4 Systems; the DMIF specification
provides the appropriate walkthrough and mapping to H.245 signals. Mobile terminals can now use MPEG-4 Systems features
such as BIFS and OD streams, although with some limitation (the MPEG-4 presentation is uniquely selected by the target peer)
DMIF V.2 introduces the concept of monitoring
the Quality of Service actually delivered by a network. The
DMIF-Application Interface has been extended accordingly.
The model allows for three different modes of QoS monitoring:
continuous monitoring, specific queries, and QoS violation
notification.
The DMIF model
allows peer applications to exchange user messages of any kind (included stream control messages). DMIF V2 adds to V.1 the
support for acknowledgment messages.
V.2 enhances the DMIF model to
allow applications to exchange application-specific data
with the DMIF layer. This addition was introduced to enable,
within the model, the exchange of Sync Layer Protocol Data Units
as a combination of pure media data (PDU) and logical Sync
Layer information. The model acknowledges that within the existing
transport stacks there are features that overlap with
the MPEG-4 Systems Sync Layer. This is the case of RTP and MPEG-2 PES
(Packetized Elementary Steams) as well as MP4 atoms in
the file format: in all such cases the obvious implementation of a
DMIF instance is to map the Sync Layer information extracted
from those structures, as well as from a true SL-PDU, into a uniform
logical representation of the Sync Layer Packet Header.
As a consequence, the appropriate parameters have been introduced
at the DAI, taking care, as usual, to make their semantic
independent of both transport stack and application.
DMIF V.2 includes an informative annex
that gives a C/C++ syntax for the DMIF
Application Interface, as a recommended API syntax.
MPEG is currently working on a number of extensions to Version 2., in the Visual and
Systems areas. There is no work
on extending MPEG-4 DMIF or Audio beyond Version 2.
In the visual area, the following
technologies are in the
process of being added:
Fine Grain scalability is in balloting phase, with proposed Streaming Video Profiles (Advanced Simple and Fine Grain Scalability). Fine Grain Scalability is a tool that allows small quality steps by adding or deleting layers of extra information. It is useful in a number of environments, notably for streaming purposes but also for dynamic (statistical) multiplexing of pre-encoded content in broadcast environments.
Tools for usage of MPEG-4 in the Studio. For these tools, care has been taken to preserve some form of compatibility with MPEG-2 profiles. Currently, the Simple Studio Profile is in a balloting phase, this is a profile with I-frame only coding at very high bitrates (several hundred Mbits/s) which employs shape coding. Addition of a Core Studio Profile (with I and P frames) is expected.
Digital Cinema is under study. This application will require truly lossless coding, and not just the visually lossless that MPEG-4 has provided so far. A Preliminary Call for Proposals was issued in October 2000.
Provides new
nodes to be used in the scene
graph for monitoring available media and managing media, such as sending commands to a server,
advanced control of media
playback, and the so-called EXTERNPROTO, a node that provides further compatibility with VRML,
and that allows writing
macros that define behavior of objects. Also, advanced compression of BIFS data is covered, and
in particular optimal compression
for mesh and for arrays of data.
The Extensible
MPEG-4 Textual format
(XMT) is a framework for representing MPEG-4 scene description using a textual syntax. The XMT allows
the content authors
to exchange their content with other authors, tools or service providers, and facilitates interoperability
with both
the Extensible 3D (X3D) being developed by the Web3D Consortium, and the Synchronized Multimedia Integration Language (SMIL) from the W3C consortium.
textual format
MPEG-4
Representation
(e.g. mp4 file)
SMIL
MPEG-7
SVG
Parse
Compile
SMIL
Player
VRML Browser
MPEG-4
Player
X3D
The XMT format can be interchanged between SMIL
players, VRML players, and MPEG-4
players. The format can be parsed and played directly by a W3C SMIL player, preprocessed
to Web3D X3D and played back by a VRML
player, or compiled to an MPEG-4 representation such as mp4, which can then be played
by an MPEG-4 player. See below for a graphical
description of interoperability of the XMT. It encompasses MPEG-4, a large
part of SMIL, Scalable Vector Graphics, X3D
and also gives a textual representation for MPEG-7 Descrptions (see www.cselt.it/mpeg
for documentation on the MPEG-7 Content Description Standard)
The XMT framework consists of two levels of textual syntax
and
semantics: the XMT-A format and the XMT-Ù format.
The XMT-A is an XML-based version of MPEG-4 content, which
contains a subset of the X3D. Also contained in XMT-A is an MPEG-4 extension to the X3D to represent MPEG-4 specific features.
The XMT-A provides a straightforward, one-to-one mapping between the textual and binary formats.
The XMT-Ù
is
a high-level abstraction of MPEG-4 features based on the W3C SMIL. The XMT provides a default mapping from Ù to
A,
for there is no deterministic mapping between the two, and it also provides content authors with an escape mechanism from
Ù to A.
The
Advanced Synchronization Model (usually called so-called FlexTime) supports synchronization of objects from multiple sources with possibly different time bases. The FlexTime Model specifies timing using a flexible, constraint-based timing model. In this model, media objects can be linked to one another in a time graph using relationship constraints such as "CoStart", "CoEnd", or "Meet". And, in addition, to allow some flexibility to meet these constraints, each object may have a flexible duration with specific stretch and shrink mode preferences that may be applied.
The
FlexTime model is based upon a so-called "spring"
metaphor. A spring has a set of three constants: the minimum
length below which it will not shrink, the maximum length beyond
which it will break, and the optimal length at which it rests
comfortably being neither compressed nor extended. Following
this spring model, the temporal playback of media objects
can be viewed as springs, with a set of playback durations corresponding
to these three spring constants. The optimal playback
duration (optimal spring length) can be viewed as the authors preferred choice of playback duration for the media object. A player should, where possible, keep the playback length as close to the optimal duration as the presentation allows but may choose any duration between the minimum and maximum durations as specified by the author. Note, that whereas stretching or shrinking the duration continuous media, e.g. for video, implies respectively slowing down or speeding up playback, for discrete media such as a still image, shrinking or stretching is merely adjusting the rendering period shorter or longer.
2D and 3D animation coding are under study following fgood responses received after a Call for Proposals that was evaluated in October 2000. The work is being progressed together with the Web3D Consortium.
MPEG-4 provides
a large and rich set of tools for the coding of audio-visual objects. In order to allow
effective implementations of the standard,
subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified,
that can be used for specific applications.
These subsets, called Profiles, limit the tool set a decoder has to implement. For each of these Profiles, one or more Levels have been set, restricting the computational complexity. The approach is similar to MPEG-2, where the most well known Profile/Level combination is Main Profile @ Main Level. A Profile@Level combination allows:
a codec builder to implement only the subset of the standard he needs, while maintaining interworking with other MPEG-4 devices built to the same combination, and
checking whether MPEG-4 devices comply with the standard (conformance testing).
Profiles exist for various types of media
content (audio, visual, and graphics) and for scene
descriptions. MPEG does not prescribe or advise combinations of these
Profiles, but care has been taken that good matches
exist between the different areas.
The visual part of the standard
provides
profiles for the coding of natural, synthetic, and synthetic/natural hybrid visual content. There are five profiles
for
natural video content:
1. The Simple Visual Profile provides efficient, error resilient coding of rectangular
video objects, suitable for applications on mobile networks, such as PCS and IMT2000.
2. The Simple Scalable Visual
Profile adds support for coding of temporal and spatial scalable objects to the Simple Visual Profile, It is useful for applications
which provide services at more than one level of quality due to bit-rate or decoder resource limitations, such as Internet
use and software decoding.
3. The Core Visual Profile adds support for coding of arbitrary-shaped and temporally
scalable objects to the Simple Visual Profile. It is useful for applications such as those providing relatively simple
content-interactivity (Internet multimedia applications).
4. The Main Visual Profile adds support for coding
of interlaced, semi-transparent, and sprite objects to the Core Visual Profile. It is useful for interactive and entertainment-quality
broadcast and DVD applications.
5. The N-Bit Visual Profile adds support for coding video objects having pixel-depths
ranging from 4 to 12 bits to the Core Visual Profile. It is suitable for use in surveillance applications.
The profiles
for
synthetic and synthetic/natural hybrid visual content are:
6. The Simple Facial Animation Visual Profile
provides
a simple means to animate a face model, suitable for applications such as audio/video presentation for the hearing
impaired.
7.
The Scalable Texture Visual Profile provides spatial scalable coding of still image (texture)
objects useful
for applications needing multiple scalability levels, such as mapping texture onto objects in games, and
high-resolution
digital still cameras.
8. The Basic Animated 2-D Texture Visual Profile provides spatial
scalability, SNR scalability,
and mesh-based animation for still image (textures) objects and also simple face object
animation.
9. The Hybrid Visual
Profile combines the ability to decode arbitrary-shaped and temporally scalable
natural video objects (as in the
Core Visual Profile) with the ability to decode several synthetic and hybrid objects, including
simple face and animated
still image objects. It is suitable for various content-rich multimedia applications.
Version 2 adds the following Profiles
for natural video:
10. The Advanced
Real-Time Simple (ARTS) Profile provides advanced error resilient coding
techniques of rectangular video objects
using a back channel and improved temporal resolution stability with the low buffering
delay. It is suitable for real time
coding applications; such as the videophone, tele-conferencing and the remote observation.
11.
The Core Scalable
Profile adds support for coding of temporal and spatial scalable arbitrarily shaped objects
to the Core Profile. The
main functionality of this profile is object based SNR and spatial/temporal scalability for regions
or objects of interest.
It is useful for applications such as the Internet, mobile and broadcast.
12. The Advanced Coding
Efficiency (ACE)
Profile improves the coding efficiency for both rectangular and arbitrary shaped objects. It is
suitable for applications
such as mobile broadcast reception, the acquisition of image sequences (camcorders) and other
applications where high
coding efficiency is requested and small footprint is not the prime concern.
The Version 2 profiles
for synthetic and
synthetic/natural hybrid visual content are:
13. The Advanced Scaleable Texture Profile supports
decoding
of arbitrary-shaped texture and still images including scalable shape coding, wavelet tiling and error-resilience.
It is useful for applications the require fast random access as well as multiple scalability levels and arbitrary-shaped
coding of still objects. Examples are fast content-based still image browsing on the Internet, multimedia-enabled PDAs, and Internet-ready high-resolution digital still cameras.
14.
The Advanced Core Profile combines the ability to decode arbitrary-shaped video objects (as in the Core Visual
Profile) with the ability to decode arbitrary-shaped scalable still image objects (as in the Advanced Scaleable Texture
Profile.) It is suitable for various content-rich multimedia applications such as interactive multimedia streaming
over Internet.
15. The Simple Face and Body Animation Profile is a superset of the Simple Face Animation Profile,
adding - obviously - body animation.
Four Audio Profiles have been defined in
MPEG-4 V.1:
1. The Speech Profile
provides HVXC, which is a very-low bit-rate parametric speech coder,
a CELP narrowband/wideband speech coder, and a Text-To-Speech
interface.
2. The Synthesis Profile provides score
driven synthesis using SAOL and wavetables and a Text-to-Speech
Interface to generate sound and speech at very low bitrates.
3.
The Scalable Profile, a superset of the Speech
Profile, is suitable for scalable coding of speech and music for networks,
such as Internet and Narrow band Audio DIgital
Broadcasting (NADIB). The bitrates range from 6 kbit/s and 24 kbit/s, with
bandwidths between 3.5 and 9 kHz.
4. The Main
Profile is a rich superset of all the other Profiles, containing tools
for natural and synthetic Audio.
Another
four Profiles were added in MPEG-4 V.2:
5. The High Quality Audio Profile
contains the CELP speech coder and the
Low Complexity AAC coder including Long Term Prediction. Scalable coding coding
can be performed by the AAC Scalable object
type. Optionally, the new error resilient (ER) bitstream syntax may be used.
6.
The Low Delay Audio Profile contains
the HVXC and CELP speech coders (optionally using the ER bitstream syntax),
the low-delay AAC coder and the Text-to-Speech
interface TTSI.
7. The Natural Audio Profile contains all natural
audio coding tools available in MPEG-4, but
not the synthetic ones.
8. The Mobile Audio Internetworking Profile
contains the low-delay and scalable AAC
object types including TwinVQ and BSAC. This profile is intended to extend communication
applications using non-MPEG
speech coding algorithms with high quality audio coding capabilities.
Graphics Profiles define which graphical and
textual elements can be used in a scene. These profiles are defined
in the Systems part of the standard:
1. Simple 2-D Graphics Profile The Simple 2-D Graphics profile provides for only those
graphics elements of the BIFS tool that are necessary to place one or more visual objects in a scene.
2. Complete
2-D Graphics
Profile The Complete 2-D Graphics profile provides two-dimensional graphics functionalities and supports
features
such as arbitrary two-dimensional graphics and text, possibly in conjunction with visual objects.
3. Complete
Graphics
Profile The Complete Graphics profile provides advanced graphical elements such as elevation grids and extrusions
and
allows creating content with sophisticated lighting. The Complete Graphics profile enables applications such as
complex
virtual worlds that exhibit a high degree of realism.
Scene Description
Profiles, defined in the Systems
part of the standard, allow audiovisual scenes with audio-only, 2-dimensional, 3-dimensional
or mixed 2-D/3-D content.
The 3-D Profile is called VRML, as it optimizes interworking with VRML material:
1. The Audio Scene Graph Profile provides
for a set of BIFS scene
graph elements for usage in audio only applications. The Audio Scene Graph profile supports applications
like broadcast
radio.
2. The Simple 2-D Scene Graph Profile provides for only those BIFS scene graph elements
necessary to place
one or more audio-visual objects in a scene. The Simple 2-D Scene Graph profile allows presentation of
audio-visual content
with potential update of the complete scene but no interaction capabilities. The Simple 2-D Scene
Graph profile supports
applications like broadcast television.
3. The Complete 2-D Scene
Graph Profile provides for all the 2-D scene
description elements of the BIFS tool. It supports features such as 2-D transformations
and alpha blending. The Complete
2-D Scene Graph profile enables 2-D applications that require extensive and customized
interactivity.
4. The Complete
Scene Graph profile provides the complete set of scene graph elements of the
BIFS tool. The Complete Scene Graph profile
will enable applications like dynamic virtual 3-D world and games.
Two MPEG-J
Profiles exist: Personal and Main:
Personal
- a lightweight package for personal devices.
The personal profile
addresses a range of constrained devices
including mobile and portable devices. Examples of such devices are cell video
phones, PDAs, personal gaming devices.
This profile includes the following packages of MPEG-J APIs
1. Network
2. Scene
3. Resource
Main -
includes all the MPEG-J API's.
The Main profile addresses a range of consumer devices including
entertainment devices.
Examples of such devices are set top boxes, computer based multimedia systems etc. It is a superset
of the Personal profile.
Apart from the packages in the Personal profile, this profile includes the following packages
of the MPEG-J APIs.
1. Decoder
2. Decoder Functionality
3. Section Filter and Service Information
The Object Descriptor Profile includes the following tools:
Object Descriptor (OD) tool
Sync Layer (SL) tool
Object Content Information (OCI) tool
Intellectual Property Management and Protection (IPMP) tool
Currently, only one profile
is defined that includes all these tools. The main
reason for defining this profiles is not subsetting the tools, but rather
defining levels for them. This applies especially
to the Sync Layer tool, as MPEG-4 allows multiple time bases to exist.
In the context of Levels for this Profile, restrictions
can be defined, e.g. to allow only a single time base.
MPEG carries out verification tests to check whether the standard delivers what it promises.
The
test results can be found on MPEG's home page, http://www.cselt.it/mpeg/quality_tests.htm
The main results are described below; more verification
tests are
planned.
A
number of MPEG-4's capabilities have been formally evaluated using subjective
tests. Coding efficiency, although not
the only MPEG-4 functionality, is an important selling point of MPEG-4, and one
that has been tested more thoroughly. Also
error robustness has been put to rigorous tests. Furthermore, scalability tests
were done and for one specific profile
the temporal resolution stability was examined. Many of these tests address a specific
profile.
In this Low
and Medium Bitrates Test, frame-based sequences were examined,
with MPEG-1 as a reference. (MPEG-2 would be identical
for the progressive sequences used, except that MPEG-1 is a bit more
efficient as it uses less overhead for header information).
The test uses typical test sequences for CIF and QCIF resolutions,
encoded with the same rate control for both MPEG-1 and
MPEG-4 to compare the coding algorithms without the impact of different
rate control schemes. The test was performed for
low bit rates starting at 40 kbps to medium bit rate up to 768 kbps.
The
tests of the Coding Efficiency functionality show a
clear superiority of MPEG-4 toward MPEG-1 at both the low and medium
bit rate coding conditions whatever the criticality
of the scene. The human subjects have consistently chose MPEG-4 as
statistically significantly superior by one point difference
for a full scale of five points.
The verification tests
for Content Based Coding compare
the visual quality of object-based versus frame-based coding. The major objective was
to ensure that object-based coding
can be supported without impacting the visual quality. Test content was chosen to cover
a wide variety of simulation conditions,
including video segments with various types of motions and encoding complexities.
Additionally, test conditions were
established to cover low bit rates ranging from 256kb/s to 384kb/s, as well as high bit-rates
ranging from 512kb/s to 1.15Mb/s.
The results of the tests clearly demonstrated that object-based functionality is provided
by MPEG-4 with no overhead or
loss in terms of visual quality, when compared to frame-based coding. There is no statistically
significant difference
among any object-based case and the relevant frame-based ones. Hence the conclusion: MPEG-4 is
able to provide content-based
functionality without introducing any loss in terms of visual quality.
The formal verification tests
on Advanced Coding Efficiency
(ACE) Profile were performed to check whether three new Version 2 tools, as included the MPEG-4
Visual Version 2 ACE Profile
(Global Motion Compensation, Quarter Pel Motion Compensation and Shape-adaptive DCT) enhance
the coding efficiency
compared with MPEG-4 Visual Version 1. The tests explored the performance of the ACE Profile and the
MPEG-4 Visual Version
1 Main Profile in the object-based low bit rate case, the frame-based low bit rate case and the frame-based
high bit rate case.
The results obtained show a clear superiority of the ACE Profile compared with the Main Profile; more
in detail:
For the object based case, the quality provided by the ACE Profile at 256 kb/s is equal to the quality provided by Main Profile at 384 kb/s.
For the frame based at low bit rate case, the quality provided by the ACE Profile at 128 kb/s and 256 kb/s is equal to the quality provided by Main Profile at 256 kb/s and 384 kb/s respectively.
For the frame based at high bit rate case, the quality provided by the ACE Profile at 768 kb/s is equal to the quality provided by Main Profile at 1024 kb/s.
When
interpreting these results, it must be noted that the MPEG-4 Main Profile is already more
efficient than MPEG-1 and MPEG-2.
The
performance of error resilient
video in the MPEG-4 Simple Profile was evaluated in subjective tests simulating MPEG-4
video carried in a realistic multiplex
and over ditto radio channels, at bitrates between 32 kbit/s and 384 kbit/s. The test
used a simulation of the residual errors
after channel coding at bit error rates up to 10-3, and the average length
of the burst errors was about 10ms.
The test methodology was based on a continuous quality evaluation over a period of three
minutes. In such a test, subjects
constantly score the degradation they experience.
The results show that the average
video quality achieved on the mobile
channel is high, that the impact of errors is effectively kept local by the tools in MPEG-4
video, and that the video quality
recovers quickly at the end of periods of error. These excellent results were achieved
with very low overheads, less than
those typically associated with the GOP structure used in MPEG-1 and MPEG-2 video.
The performance of error resilient video in MPEG-4 ARTS Profile was checked
in subjective tests similar
to those mentioned in the previous section, at bitrates between 32 kbit/s and 128 kbit/s. In
this case, the residual errors
after channel coding was up to 10-3, and the average length of the burst errors
was about 10 ms (called "critical")
or 1 ms (called "very critical" - this one is more critical because
the same amount of errors is more spread over
the bitstream than in the "critical" case).
The results show a clear
superiority of the ARTS Profile over
the Simple Profile for both the error cases ("critical" and "very
critical"). More in detail the
ARTS Profile outperforms Simple Profile in the recovery time from transmission errors.
Furthermore ARTS Profile in the
"critical" error condition provides results that for most of the test time are
close to a complete transparency,
while Simple Profile is still severely affected by errors. These excellent results were
achieved with very low overheads
and very fast error recovery provided the NEWPRED, and under low delay conditions.
This
test explored the performance of a video codec using
the Dynamic Resolution Conversion technique that adapts the resolution
to the video content and to circumstances in real-time.
Active scene content was coded at 64 kb/s, 96 kb/s and 128 kb/s datarates.
The results show that at 64 kbit/s, it outperforms
the already effective Simple Profile operating at 96 kbit/s, and at 96
kb/s, the visual quality is equally to that of the Simple
profile at 128 kbit/s. (The Simple profile already compares well
to other, existing systems.)
The
scalability test for the Simple Scalable Profile was designed to verify
that the quality provided by Temporal Scalability
tool in Simple Scalable Profile compared to the quality provided by Single
Layer coding in Simple Profile, and to the quality
provided by Simulcast coding in Simple Profile.
In this test, 5 sequences
with 4 combinations of bitrates were used:
a) 24
kbps for base layer and 40 kbps for enhancement layer.
b) 32 kbps for
both layers.
c) 64 kbps for the base layer and 64 kbps for
the enhancement layer.
d) 128 kbps for both layers.
The
formal verification tests showed that in all the given conditions,
the Temporal Scalability coding in Simple Scalable
Profile exhibits the same or slightly lower quality than can be achieved
using Single layer coding in Simple Profile. Furthermore
it is evident that the Temporal Scalability coding in Simple Scalable
Profile provides better quality than the simulcast
coding in Simple Profile for that condition. (Simulcast entails simultaneously
broadcasting or streaming at multiple
bitrates.)
The
verification
test was designed to evaluate the performance of MPEG-4 video Temporal Scalability tool in the Core Profile.
The test was
performed using the "Single Stimulus" method. The subjects were to evaluate how annoying the
impairments of
compressed sequences were, with and without use of temporal scalability. The test was conducted using a
total of 45 subjects
in two different laboratories and the results showed that the quality of sequences encoded using MPEG-4
temporal scalability
tools are comparable to the quality of sequences encoded without temporal scalability. Furthermore
it is evident that
the Temporal Scalability tool in Core Profile provides better quality than the simulcast coding in Core
Profile for that
condition.
MPEG-4 audio technology is composed of many coding tools. Verification
tests have focused on small sets
of coding tools that are appropriate in one application arena, and hence can be effectively
compared. Since compression
is a critical capability in MPEG, the verification tests have for the most part compared coding
tools operating at similar
bit rates. The results of these tests will be presented progressing from higher bit rate to lower
bit rates. The exception
to this is the error robustness tools, whose performance will be noted at the end of this section.
The primary purpose of verification tests
is to report the subjective
quality of a coding tool operating at a specified bit rate. Most audio tests report this on the
subjective impairment scale.
This is a continuous 5-point scale with subjective anchors as shown here.
The performance
of the various MPEG-4 coding
tools are summarized in the following table. To better enable the evaluation of MPEG-4 technology,
several coders from
MPEG-2 and the ITU-T were included in the tests and their evaluation has also been included in the table.
In the table results
from the same test are delimited by heavy lines. These results can be directly compared. Results taken
from different tests
should not be compared, but nevertheless give an indication of the expected quality of a coding tool
operating at a specific
bit rate.
Executive Overview
Table
of Contents
1. Scope
and features of the MPEG-4 standard
1.1Coded representation of media objects
1.2
Composition of media objects
1.3 Description and synchronization
of streaming data for media objects
1.4 Delivery
of streaming data
1.5 Interaction
with media objects
1.6 Management
and
Identification of Intellectual Property
2. Major
Functionalities in MPEG-4 Version 1
2.1 DMIF
2.2 Systems
2.3
Audio
2.4
Visual
2.4.1 Formats Supported
2.4.2 Compression
Efficiency
2.4.3 Content-Based Functionalities
2.4.4 Scalability of Textures, Images
and Video
2.4.5 Shape and Alpha Channel Coding
2.4.6 Robustness in Error Prone Environments
2.4.7 Face Animation
2.4.8 Coding of 2-D Meshes with Implicit Structure
3.
Major Functionalities in MPEG-4
Version 2
3.1
Systems
3.2 Visual
3.2.1 Natural Video
3.2.2
Body animation
3.2.3 Coding of 3-D Polygonal Meshes
3.3 Audio
3.4
DMIF
3.4.1 Support to mobile networks
3.4.2
Qos Monitoring
3.4.3 UserCommands with ACK
3.4.4 Management of MPEG-4 Sync
Layer information
3.4.5
DAI syntax in C language
4. Extensions
to MPEG-4 beyond
Version 2
4.1
Visual
4.2 Systems
4.2.1 Advanced BIFS
4.2.2 Textual Format
4.2.3 Advanced Synchronization Model
4.2.4
2D and 3D animation
5. Profiles
in MPEG-4
5.1
Visual Profiles
5.2 Audio
Profiles
5.3
Graphics
Profiles
5.4 Scene Description Profiles
5.5
MPEG-J Profiles
5.6Object Descriptor Profile
6. Verification
Testing: checking MPEGs performance
6.1 Video
6.1.1
Coding Efficiency Tests
6.1.1.1 Low
and Medium Bit rates (version 1)
6.1.1.2
Content Based Coding (version 1)
6.1.1.3
Advanced Coding Efficiency (ACE) Profile (version 2)
6.1.2
Error Robustness Tests
6.1.2.1 Simple Profile (version 1)
6.1.2.2 Advanced
Real-Time Simple (ARTS) Profile (version
2)
6.1.3
Temporal Resolution Stability Test
6.1.3.1 Advanced
Real-Time Simple (ARTS) Profile (version 2)
6.1.4
Scalability Tests
6.1.4.1 Simple
Scalable Profile (version 1)
6.1.4.2 Core Profile (version 1)
6.2 Audio
XMT
Coding tool |
Number of channels |
Total bit rate |
Typical subjective quality |
AAC |
5 |
320 kb/s |
4.6 |
1995 Backward Compatible MPEG-2 Layer II |
5 |
640 kb/s |
4.6 |
AAC |
2 |
128 kb/s |
4.8 |
AAC |
2 |
96 kb/s |
4.4 |
MPEG-2 Layer II |
2 |
192 kb/s |
4.3 |
MPEG-2 Layer III |
2 |
128 kb/s |
4.1 |
AAC |
1 |
24 kb/s |
4.2 |
Scalable: CELP base and AAC enhancement |
1 |
6 kb/s base, 18 kb/s enh. |
3.7 |
Scalable: Twin VQ base and AAC enhancement |
1 |
6 kb/s base, 18 kb/s enh. |
3.6 |
AAC |
1 |
18 kb/s |
3.2 |
G.723 |
1 |
6.3 kb/s |
2.8 |
Wideband CELP |
1 |
18.2 kb/s |
2.3 |
BSAC |
2 |
96 kb/s |
4.4 |
BSAC |
2 |
80 kb/s |
3.7 |
BSAC |
2 |
64 kb/s |
3.0 |
AAC LD (20 ms one-way delay) |
1 |
64 kb/s |
4.4 |
G.722 |
1 |
32 kb/s |
4.2 |
AAC LD (30 ms one-way delay) |
1 |
32 kb/s |
3.4 |
Narrowband CELP |
1 |
6 kb/s |
2.5 |
Twin VQ |
1 |
6 kb/s |
1.8 |
HILN |
1 |
16 kb/s |
2.8 |
HILN |
1 |
6 kb/s |
1.8 |
Coding tools were tested under circumstances that assessed their strengths. The salient features of the MPEG-4 audio coding tools are briefly noted here.
When coding 5-channel material at 64 kb/s/channel (320 kbit/s) Advanced Audio Coding (AAC) Main Profile was judged to have "indistinguishable quality" (relative to the original) according to the EBU definition. When coding 2-channel material at 128 kbps both AAC Main Profile and AAC Low Complexity Profile were judged to have "indistinguishable quality" (relative to the original) according to the EBU definition.
The two scaleable coders, CELP base with AAC enhancement, and TwinVQ base wth AAC enhancement both performed better than an AAC "multicast" operating at the enhancement layer bitrate, but not as good as an AAC coder operating at the total bitrate.
The wideband CELP coding tool showed excellent performance for speech-only signals. (The verification test result shown is for both speech and music signals.)
Bit Slice Arithmetic Coding (BSAC) provides a very fine step bitrate scalability. At the top of the scalability range it has no penalty relative to single-rate AAC, however at the bottom of the scale it has a slight penalty relative to single-rate AAC.
Relative to normal AAC, Low Delay AAC (AAC LD) provides equivalent subjective quality, but with very low on-way delay and only a slight increase in bit rate.
Narrowband CELP, TwinVQ and Harmonic Individual Lines and Noise (HILN) all have the ability to provide very high signal compression.
The Error Robustness (ER) tools provide equivalently good error robustness over a wide range of channel error conditions, and does so with only a modest overhead in bit rate. Verification test results suggest that the ER tools used with an audio coding system provide performance in error-prone channels that is "nearly as good" as the same coding system operating over a clear channel.
The MPEG-4 Industry Forum is a not-for-profit organization with the following goal: To further the adoption of the MPEG-4 Standard, by establishing MPEG-4 as an accepted and widely used standard among application developers, service providers, content creators and end users.
The following is a non-exhaustive excerpt from M4IF's Statutes about the way of operation:
The purpose of M4IF shall be pursued by: promoting MPEG-4, making available information on MPEG-4, making available MPEG-4 tools or giving information on where to obtain these, creating a single point for information about MPEG-4, creating industrial focus around the usage of MPEG-4
The goals are realized through the open international collaboration of all interested parties, on reasonable terms applied uniformly and openly. M4IF will contribute the results of its activities to appropriate formal standards bodies if applicable.
The business of M4IF shall not be conducted for the financial profits of its Members but for their mutual benefits.
Any corporation and individual firm, partnership, governmental body or international organization supporting the purpose of M4IF may apply for Membership.
Members are not bound to implement or use specific technology standards, or recommendations by virtue of participation in M4IF.
The initial membership fee is set at US