Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MEDIA STREAM SCALING
Document Type and Number:
WIPO Patent Application WO/2007/035151
Kind Code:
A1
Abstract:
An apparatus for forming a packetized scalable media stream includes a scalability information extractor (18) determining a media scalability description as well as a content type identifier (22) determining a media content preference description. A mapper (20) maps the scalability description and the content preference description into an importance identifier included in and controlling the scalability of the media stream.

Inventors:
TALEB ANISSE (SE)
SVEDBERG JONAS (SE)
TAKACS ATTILA (HU)
Application Number:
PCT/SE2006/001056
Publication Date:
March 29, 2007
Filing Date:
September 15, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
TALEB ANISSE (SE)
SVEDBERG JONAS (SE)
TAKACS ATTILA (HU)
International Classes:
H04L29/06; H04N7/24
Domestic Patent References:
WO2002035844A22002-05-02
WO2002056563A22002-07-18
Foreign References:
US20030135631A12003-07-17
US20040194142A12004-09-30
Other References:
AHMED, T.; MEHAOUA, A.; BOUTABA, R.; IRAQI, Y.: "Adaptive packet video streaming over IP networks: a cross-layer approach", SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON, vol. 23, February 2005 (2005-02-01), pages 385 - 401, XP002413053
JITAE SHIN ET AL: "Quality-of-Service Mapping Mechanism for Packet Video in Differentiated Services Network", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 3, no. 2, June 2001 (2001-06-01), XP011036246, ISSN: 1520-9210
Attorney, Agent or Firm:
AROS PATENT AB (Uppsala, SE)
Download PDF:
Claims:
CLAIMS

1. A method of forming a packetized scalable media stream, including the steps of determining a media scalability description; determining a media content preference description; mapping said scalability description and said content preference description into an importance identifier included in and controlling the scalability of said media stream.

2. The method of claim 1, wherein said content preference description is determined from encoded media frames.

3. The method of claim 2, wherein said content preference description is de- termined from headers of said media frames.

4. The method of claim 3, wherein a video content preference description is determined from Network Abstraction Layer Unit (NALU) headers.

5. The method of claim 3, wherein an audio content preference description is determined from Real Time Protocol (RTP) headers.

6. The method of claim 1, wherein said content preference description is determined from a content description file associated with said media stream.

7. The method of claim 1, wherein said importance identifier is stored in an Internet Protocol (IP) header.

8. The method of claim 7, wherein said importance identifier is stored in the Differentiated Services (DS) field of said Internet Protocol (IP) header.

9. The method of claim 1, wherein said importance identifier is stored in an Ethernet header.

10. The method of claim 9, wherein said importance identifier is stored in the Priority field of said Ethernet header.

11. The method of any of the preceding claims, wherein said importance identifier is mapped to the classes of a quality of service architecture.

12. An apparatus for forming a packetized scalable media stream, including a scalability information extractor (18) determining a media scalability description; a content type identifier (22) determining a media content preference description; a mapper (20) mapping said scalability description and said content preference description into an importance identifier included in and control- ling the scalability of said media stream.

13. The apparatus of claim 12, wherein said content type identifier (22) determines said content preference description from encoded media frames.

14. The apparatus of claim 13, wherein said content type identifier (22) determines said content preference description from headers of said media frames.

15. The apparatus of claim 14, wherein said content type identifier (22) determines a video content preference description from Network Abstraction Layer Unit (NALU) headers.

16. The apparatus of claim 14, wherein said content type identifier (22) determines an audio content preference description from Real Time Protocol (RTP) headers.

17. The apparatus of claim 12, wherein said content type identifier (22) determines said content preference description from a content description file (26) associated with said media stream.

18. The apparatus of claim 12, wherein said importance identifier is stored in an Internet Protocol (IP) header.

19. The apparatus of claim 18, wherein said importance identifier is stored in the Differentiated Services (DS) field of said Internet Protocol (IP) header.

20. The apparatus of claim 12, wherein said importance identifier is stored in an Ethernet header.

21. The apparatus of claim 20, wherein said importance identifier is stored in the Priority field of said Ethernet header.

22. The apparatus of any of the preceding claims 12-21, wherein said impor- tance identifier is mapped to the classes of a quality of service architecture.

23. A packetized media stream scaling method, including the steps of receiving packets with a content dependent importance identifier; scaling packets based on the value of said importance identifier.

24. A media stream scaling apparatus, including an input for receiving packets with an importance identifier; an importance identifier extractor (32) for extracting said importance identifier; a scaling unit (30) for scaling packets based on the value of said importance identifier.

25. A network node receiving and forwarding a packetized scalable media stream, including a media stream scaling apparatus in accordance with claim 18.

26. A packetized scalable media stream, wherein packets include a content dependent importance identifier controlling the scalability of said media stream.

Description:

MEDIA STREAM SCALING

TECHNICAL FIELD

The present invention relates to flexible scaling of packetized media streams.

BACKGROUND

The need for offering voice and video services over packet switched networks has been dramatically increasing and is today stronger than ever. A lot of efforts at diverse standardization bodies are being mobilized to define efficient solutions for the delivery of delay sensitive content to the users. Noticeably, two major challenges still await solutions. First, the diversity of deployed networking technologies and user-devices imply that the same service offered for different users may have different user-perceived quality due to the different properties of the transport networks. Hence, improving quality mechanisms is necessary to adapt services to the actual transport characteristics. Second, the properties of wireless links provide no favorable conditions for conversational and streaming services. Over a wireless link the available bandwidth, transmission and transport delay vary severely. These highly unpredictable variations make it difficult to deliver real-time traffic, which fundamentally demands a constantly available bandwidth and predictable delay. To efficiently address these issues adaptive service delivery is mandatory.

Today, scalable audiovisual and in general media content codecs are available, in fact one of the early design guidelines of MPEG was scalability from the beginning. However, although these codecs are attractive due to their functionality, they lack the efficiency to operate at low bit rates, which do not really map to the current mass market wireless devices. With the high penetration of wireless communications more sophisticated scalable codecs are needed.

Despite the tremendous efforts being put on adaptive services and scalable codecs, scalable services will not happen unless more attention is given to the transport issues. Therefore, besides efficient codecs appropriate network architecture and transport framework must be considered as an enabling technology to fully utilize scalability in service delivery. Basically, three scenarios can be considered:

• Adaptation at the end-points. That is, if a lower transmission rate must be chosen, the sending side is informed and it performs scaling or codec changes.

• Adaptation at intermediate gateways. If a part of the network becomes congested, or has a different service capability, a dedicated network entity performs a transcoding of the service. With scalable codecs this could be as simple as dropping or truncating media frames.

• Adaptation inside the network. If a router or wireless interface becomes congested, adaptation is performed right at the place of the problem by dropping or truncating packets. This is a desirable solution for transient problems like handling of severe traffic bursts or channel quality variations of wireless links.

Recently proposed scalable codecs for speech, audio and video will be briefly described in the next few paragraphs. The concepts of the proposed methods for scalable service delivery will also be highlighted.

Audio coding (Non-conversational, streaming /download^

In general the current audio research trend is to improve the compression effi- ciency at low bit-rates (provide good enough stereo quality at bit-rates below 32 kbps). Recent low bit-rate audio improvements are the finalization of the Parametric Stereo (PS) tool development in MPEG, the standardization of a mixed Code Excited Linear Predictive (CELP) /and transform codec "Extended Adaptive Multi-rate

Wideband" (a.k.a. AMR- WB+) in the 3rd Generation Partnership Project (3GPP). There is also an ongoing MPEG standardization activity around Spatial Audio Coding (Surround/ 5.1 content), where a first reference model (RMO) has been selected.

With respect to scalable audio coding, recent standardization efforts in

MPEG have resulted in a scalable to lossless extension tool, MPEG4-SLS. MPEG4-SLS provides progressive enhancements to the core Advanced Audio Coding/Bit Slice Arithmetic Coding (AAC/BSAC) all the way up to lossless coding with a granularity step down to 0.4 kbps. An overview of the MPEG4 toolset can be found in [15] .Furthermore, within MPEG a Call for Information (CfI) has been issued in January 2005 [6] targeting the area of scalable speech and audio coding. In CfI the key issues addressed are scalability, consistent performance across content types (e.g. speech and music) and encoding quality at low bit rates (< 24kbps) .

Speech coding (conversational mono)

In general speech compression, the latest standardization effort is an extension of the 3GPP2 /Variable-rate Multimode Wideband (VMR-WB) codec to also support operation at a maximum rate of 8.55 kbps. In ITU-T an exten- sion of the Multirate G.722.1 audio/video conferencing codec has been extended with two new modes providing super wideband (14 kHz audio bandwidth, 32 kHz sampling) capability operating at 24, 32 and 48 kbps.

With respect to scalable conversational speech coding, the main standardiza- tion effort is taking place in ITU-T, (Working Party 3, Study Group 16). There the requirements for a scalable extension of G.729 have been defined recently (Nov. 2004), and the qualification process was ended in July 2005. This new G.729 extension will be scalable from 8 to 32 kbps with at least 2 kbps granularity steps from 12 kbps. The main target application for the G.729 scalable extension is conversational speech over shared and bandwidth limited xDSL-links, i.e. the scaling is likely to take place in a Digital Residential Gateway that passes the Voice over IP (VoIP) packets through specific controlled Voice channels (Vc's). ITU-T is also in the process of defin-

ing the requirements for a completely new scalable conversational codec in SG16/WP3/ Question 9. The requirements for the Q.9/Embedded Variable rate (EV) codec were finalized in July 2005; currently the Q.9 /EV requirements state a core rate of 8.0 kbps and a maximum rate of 32 kbps. The Q.9 /EV core is not restricted to narrowband (8 kHz sampling) like the G.729 extension will be, i.e. Q.9/EV may provide wideband (16 kHz sampling) from the core layer and onwards.

Basically, audio scalability can be achieved by:

• Changing the quantization of the signal, i.e. SNR-like scalability

• Extending or tightening the bandwidth of the signal,

• Dropping audio channels (e.g., mono consist of 1 channel, stereo 2 channels, surround 5 channels). This is called spatial scalability.

Currently available is the fine-grained scalable audio codec AAC-BSAC. It can be used for both audio and speech coding, and it also allows for bit-rate scalability in small increments. It produces a bit-stream, which can be decoded even if certain parts of the stream are missing. There is a minimum requirement on the amount of data that must be available to permit decoding of the stream. This is referred to as the base-layer. The remaining set of bits correspond to quality enhancements, hence their reference as enhance- ment-layers. The AAC-BSAC supports enhancement layers of around 1 kbps/ channel or smaller for audio signals.

Referring to [6]: "To obtain such fine grain scalability, a bit-slicing scheme is applied to the quantized spectral data. First the quantized spectral values are grouped into frequency bands, each of these groups containing the quantized spectral values in their binary representation. Then the bits of the group are processed in slices according to their significance and spectral content. Thus, first all most significant bits (MSB) of the quantized values in

the group are processed and the bits are processed from lower to higher frequencies within a given slice. These bit-slices are then encoded using a binary arithmetic coding scheme to obtain entropy coding with minimal redundancy."

Furthermore, in [6] it is stated that: "With an increasing number of enhancement layers utilized by the decoder, providing more LSB information refines quantized spectral data. At the same time, providing bit-slices of spectral data in higher frequency bands increases the audio bandwidth. In this way, quasi-continuous scalability is achievable."

That is, scalability can be achieved in a two-dimensional space. Quality, corresponding to a certain signal bandwidth, can be enhanced by transmitting more LSBs, or the bandwidth of the signal can be extended by providing more bit-slices to the receiver. Moreover, a third dimension of scalability is available by adapting the number of channels available for decoding. For example, a surround audio (5 channels) could be scaled down to stereo (2 channels) which, on the other hand, can be scaled to mono (1 channels) if, e.g., transport conditions make it necessary.

Video coding

The H.264/MPEG-4 Advanced Video Codec (AVC) is the current state-of-the- art in video coding [I]. Technically, the design of H.264/MPEG-4 AVC is based on the traditional concept of hybrid video coding using motion- compensated temporal and spatial prediction in conjunction with block- based residual transform coding. Within this framework, H.264/MPEG-4 AVC contains a large number of innovative technical features, both in terms of improved coding efficiency and network friendliness [2]. Recently, a new standardization initiative has been launched by the Joint Video Team of ITU-T/ VCEG and ISO/IEC MPEG with the objective of extending the H.264/

MPEG-4 AVC standard towards scalability [3, 4]. The scalability extensions of H.264/MPEG-4 AVC are referred to as Scalable Video Coding (SVC). The targeted functionality of scalability should allow the removal of parts of the

bit-stream while achieving a reasonable coding efficiency of the decoded video at reduced SNR, temporal or spatial resolution. Conceptually, a scalable bit-stream consists of a base or core layer and one or more nested enhancement layers.

Basically, video scalability can be achieved by:

• Changing the quality of the video frames (SNR scalability).

• Reducing or increasing the frame-rate (temporal scalability).

• Changing the resolution of the sequence (spatial scalability).

The scalability enhancements of the H.264 video codec are described in [7, 8]. A scalability enhancement of the H.264 RTP header format is also proposed. In [5], an

RTP Payload format for the H.264 video codec is specified. The RTP payload format allows for packetization of one or more Network Abstraction Layer Units (NALUs) in each RTP payload, see [9]. NALUs are the basic transport entities of the H.263/AVC framework. With, the introduction of the Scalable Video Coding (SVC) extension, a new NALU header extension is proposed. The first three bits (L2, Ll, LO) indicate a

Layer. Layers are used to increase spatial resolution of a scalable stream. For example, slices corresponding to Layer-0 describe the scene at a certain resolution. If an additional set of Layer- 1 slices is available, the scene can be decoded at a higher spatial resolution. The next three bits (T2, Tl, TO) indicate a temporal resolution. Slices assigned to temporal resolution 0 (TR-O) correspond to the lowest temporal resolution, that is only I-frames are available. If TR- 1 slices are also available, the frame-rate can be increased (temporal-scalability). The last two bits (Ql, QO) specify a quality level (QL). QL-O corresponds to the lowest quality. If additional QL slices are available, the quality can be increased (SNR-scalability). Based on the information contained in the header, during congestion or unfavorable wireless channel conditions, network entities, e.g., routers, Radio Network Controllers (RNCs), Media Gateways (MGWs), etc, can discard packets.

In the past years, several methods have been developed for efficient and adaptive audiovisual content delivery. As a result the necessary elements of a sophisticated content delivery infrastructure are available. However, there was/ is a need for a conceptual architecture that specifies or at least consid- ers the relation and interoperation of the individual elements of the delivery chain. The aim for MPEG-21 Media Adaptation Framework is to describe how these various elements fit together. "The vision for MPEG-21 is to define a multimedia framework to enable transparent and augmented use of multimedia resources across a wide range of networks and devices used by dif- ferent communities.", see [13]. Today, the scope includes adaptation to terminals or networks. However, there is also a growing trend towards a user- centric multimedia content adaptation. The key components of adaptation are: (i) the adaptation engine, and (ii) standardized descriptors for adaptation.

The adaptation engine has the role of bridging the gap between media format, terminal, network, and user characteristics. Content providers permit access to multimedia content through various connections such as Internet, Ethernet, DSL, W-LAN, cable, satellite, and broadcast networks. Moreover, users with various terminals such as desktop computers, handheld devices, mobile phones, and TV-sets are allowed to access the content. This high level of difference between content delivery to various users demand for a system that resolves the complexity of service provisioning, service delivery, and service access.

For adaptation of the content to the user, three types of descriptions, namely multimedia content description, service provider environment description and a user environment description, are necessary. To allow for wide deployment and good interoperability these descriptors must follow a standard- ized form. While the MPEG-7 standard plays a key role in content description [11, 12], the MPEG-21 standard, especially Part 7 Digital Item Adaptation (DIA), in addition to standardized descriptions provides tools for adaptation engines as well, see [13].

One goal of MPEG-21 DIA, is to provide standardized descriptions and tools that can be used by adaptation engines. DIA adaptation tools are divided into seven groups. In the following we highlight the most relevant groups.

Usage Environment Description Tools: The usage environment includes the description of user characteristics and preferences, terminal capabilities, network characteristics and limitations, and natural environment characteristics.

The standards provide means to specify the preferences of the user related to the type and content of the media. It can be used to specify the interest of the user, e.g., in sport events, or movies of a certain actor. Based on the usage preference information a user agent can search for appropriate content or might call the attention of the user to a relevant multimedia broadcast content.

The user can set the "Audio Presentation Preferences" and the "Display Presentation Preferences". These descriptors specify certain properties of the media like, audio volume and color saturation, which reflect the preferred ren- dering of multimedia content.

With the "Conversion Preference" and "Presentation Priority Preference" descriptors the user can guide the adaptation process. For example, a user might be interested in high-resolution graphics even if this would require the loss of video contents.

With "Focus of Attention" the user can specify the most interesting part of the multimedia content. For example, a user might be interested in the news in text form rather than in video form. In this way the text part might be rendered to a larger portion of the display while the video playback resolution is severely reduced or even neglected.

Bit-stream Syntax Description (BSD): The BSD describes the syntax (high level structure) of a binary media resource. Based on the description the adaptation engine can perform the necessary adaptation as all required information about the bit-stream is available through the description. The description is based on the XML language. This way, the description is very flexible but, on the other hand, the result is a quite extensive specification (Examples of using BSD for multimedia resource adaptation are available in [17].).

Terminal and Network Quality of Service: There are descriptors specified that aid the adaptation decisions at the adaptation engine. The adaptation engine has the task to find the best trade-off among network and terminal constraints, feasible adaptation operations satisfying these constraints, and quality degradation associated to each adaptation operation. The main constraints in media resource adaptation are bandwidth and computation time. Adaptation methods include selection of frame dropping and/ or coefficient dropping, requantization, MPEG-4 Fine-Granular

Scalability (FGS), wavelet reduction, and spatial size reduction.

The focus of the work carried out is on adaptation at dedicated network elements, where a sophisticated adaptation engine is located. On the other hand, transport networks with the wide spread of wireless links are becoming more and more an important contributor to multimedia quality degradation. To ease the problem, methods within the transport network are necessary to handle transient resource problems. MPEG-21 and MPEG-7 only provide a general framework for adaptation operations.

In [18] a system and a method for delivery of scalable media is described. The system is based on rate-distortion packet selection and organization. The method used by the system consists of scanning the encoded scalable media and scoring each data unit based on a rate-distortion score. The scored data units are then organized from the highest to the lowest into network packets, which are transmitted to the receiver based on the available network bandwidth. Although there is a clear advantage with this scheme to prioritize important data over non-important data, it has a drawback in that for proper operation it needs a back channel that signals the status of the

network, which prevents its usage in broadcast scenarios. Also, the ordering of data units in a packet is done once and for all at the sender side. It is hence static in the sense that the ordering inside a certain packet is fixed during transmission. This kind of static ordering prevents intelligent network nodes to perform a re-ordering depending on the actual in-place network conditions. Furthermore, the ordering of data units inherently suggests priority transmission of highly scored packets. This implies that most of the low score packets corresponding to the same timed media, if transmitted, may arrive too late to be decoded and played back. This severely restricts the abil- ity to successfully reach an optimal compromise.

Currently, the adaptation capabilities of audio and video codecs are mainly encoded in extensive XML sheets or put into RTP extension headers. This is, however, not adequate from the network transport point of view since proc- essing the XML sheets or digging deep into upper layer protocol headers for service adaptation requires too much processing and implies significant delays. Such a solution may be acceptable if adaptation is performed at the end-systems, but becomes less attractive for transcoding at intermediate gateways and even unacceptable if scaling has to be performed inside the network.

Mechanisms to utilize the Differentiated Services architecture have been proposed [10, 11, 12]. In this context mainly the Assured Forwarding Per-Hop Behavior (PHB) class was used to provision QoS and provide means for drop preference discrimination. In regards to scalability, packet loss is of special interests. To achieve bit-rate adaptation inside the network, some packets (network layer) or frames (link layer) must be discarded or truncated, while adaptation at intermediate gateways or at the sending end-point may also utilize dropping or chopping of entire media-frames.

SUMMARY

An objective of the present invention is to more efficiently use a scalable source codec.

This objective is achieved in accordance with the attached claims.

Briefly, the present invention forms a packetized scalable media stream by determining a media scalability description and a media content preference de- scription. The determined descriptions are mapped into an importance identifier included in and controlling the scalability of the media stream, thereby providing a content-aware scalability.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:

Fig. 1 is a block diagram an apparatus for forming a packetized scalable media stream illustrating the principles of content-aware scalability mapping in accordance with the present invention;

Fig. 2 is a block diagram of a first embodiment of an apparatus for forming a packetized scalable media stream in accordance with the present invention; Fig. 3 is a block diagram of a second embodiment of an apparatus for forming a packetized scalable media stream in accordance with the present invention;

Fig. 4 is a block diagram of a third embodiment of an apparatus for forming a packetized scalable media stream in accordance with the present invention;

Fig. 5 is a block diagram of a fourth embodiment of an apparatus for forming a packetized scalable media stream in accordance with the present invention;

Fig. 6 is a video example illustrating how an importance identifier may be stored in an IP header;

Fig. 7 is a video example illustrating how an importance identifier may be stored in an ethernet header;

Fig. 8 is an audio example illustrating how an importance identifier may be stored in an IP header; Fig. 9 is a block diagram of an embodiment of a media stream scaling apparatus in accordance with the present invention;

Fig. 10 is a flow chart illustrating a method of forming a packetized scalable media stream in accordance with the present invention;

Fig. 11 is a flow chart illustrating a packetized media stream scaling method in accordance with the present invention; and

Fig. 12 illustrates an example of a content- aware mapping in accordance with the present invention.

DETAILED DESCRIPTION

During congestion or troublesome wireless channel conditions, packets might be discarded or corrupted. Scalable codecs permit data loss while maintaining good user-perceived quality. However, which data (packet) is truncated or lost highly matters. Hence, packets are assigned certain impor- tance for decodeability. On the other hand, based on the content of the audiovisual data, user-perceived quality has different vulnerability to spatial, temporal, and quality (SNR) scalability for video and quality, bandwidth, and channel scalability for audio. The mapping proposed by the present invention permits a dynamic configuration of packet importance based on both content- and codec-aware data preferences. That is, the scalable audiovisual data is ordered for scaling by taking into account the content or context of the data as well. Moreover, with the use of the proposed mapping, the joint prioritization of media frames of inherently related media flows is possible.

That is, joint scaling of the video and audio part of a scene can be realized taking into account the importance of the video and audio part for the user.

Generally, scalable video codecs permit scalability along three distinct meas- ures; in terms of spatial-, temporal-, and SNR scalability, while audio codecs can be scaled by SNR-, bandwidth, and channel scalability.

For example, with the video codec discussed earlier, the data-rate can be reduced by dropping spatial enhancement packets (Layers), or temporal reso- lution enhancement packets (TR), or quality enhancement packets (QL). The user-perceived quality is affected differently by these adaptation methods. However, for different media content, different adaptation approaches are actually appropriate. For example, in the case of broadcasting of sport events, the motion may be more important than the quality of single video frames, since extensive and fast movements are likely to be present in the scenes. On the other hand, news or documentary content may contain slow motion and the quality of the frames is likely to be more important than the frame rate used for play-out. Moreover, besides video audio data is also an integral part of multimedia streams. That is, besides video frames the corre- sponding audio sequence is also necessary to be transported to the users. In the previous example of sport event broadcast, the audio part may also be less important than the motion, as seeing a "goal" is usually more satisfying than having a commentator telling about one. Conversely, news is more important to hear about than watching a man/woman announcing something which cannot be heard.

From the discussion above it follows that performing rate-adaptation in a "clever" fashion requires more than just information about the scalability of the packets. This addition is a content- aware or context-aware scalability- mapping. The information required by the network for efficient service delivery is the combination of the audiovisual content and the codec-specific importance of packets.

A clear advantage of this addition, when compared to the prior art, is that the scalability-mapping is visible to the network and therefore can be used by network elements or nodes to adapt the content or context to bandwidth conditions at each network element or node, which is especially useful in broadcast applications.

With a sophisticated media source the mapping of content and codec- specific packet importance may be dynamic as well. That is, as the content changes during communication, the importance of different scalability measures may change as well. For example, in the sport broadcast scenario, usually the motion is most important but there may be certain scenes when the reporters are shown, e.g., during game-breaks. For this case quality and voice may become more important than the motion just as in the news scenario.

An important ingredient of a content-aware scalability- mapping is the information about the content and its effects on a preferable scaling implementation by the network. This information could be provided in several ways. A Content Type may be assigned to video scenes and audio sequences or a joint Content Type description may be provided to a combined audiovisual scene (e.g., a movie or video phone). Each different Content Type is associated with a Content Preference Description that specifies the relative importance of the quality, the movements, the audio part, the video part, etc., to the corresponding scene.

Fig. 1 is a block diagram illustrating the principles of content-aware scalability mapping in accordance with the present invention. An audio, video or audiovisual signal is encoded by a scalable encoder 10. Optionally the encoded signal may be stored in a media store 12. Each encoded media frame 14 includes a header with scalability information and actual coded data. The media frames 14 are forwarded to a unit 16 for UDP/ RTP packetization. A scalability information extractor 18 extracts a Scalability Description from the header of each media frame 14, and forwards this information to a content-aware scalability mapper 20 in unit 16. A content type identifier 22 provides Content Preference

Description to the content-aware scalability mapper 22. The content-aware scalability-mapper 20 uses the Scalability Description of a media frame and the Content Preference Description associated with the content type identification to perform a mapping of the possible scaling operations to an Importance Identifier in each IP packet. The Importance Identifier is used to indicate specific priorities and/ or QoS classes to the underlying network.

A Content Type value is assigned to each media frame 14, although the same Content Type value may be assigned to several consecutive media frames. Thus, the Content Type may change even during a continuous media stream to address significant changes in the content or context characteristics.

As noted above, each Content Type is associated with a Content Preference Description. The purpose of a Content Preference Description is to define the mapping of codec-specific importance of media frames to network specific priorities or QoS classes. The objective of the introduction of this mapping is to permit different priority or QoS assignments to the media frame with the same codec- specific importance when the context in which the frame is encoded makes different scaling implementations more desirable from the user-perceived quality point-of-view.

The Content Type identification may be performed in several ways. For example, the Content Type may be assigned to the stream in a separate description file 24, as illustrated by the embodiment in Fig. 2. This file may ei- ther be stored in media store 12 or in separate storage.

If no a priori information about the Content Type is available it may also be calculated based on the media stream itself, as illustrated by the embodiment in Fig. 3. There are many methods for gathering information about me- dia content. For example, one method is to monitor the bandwidth used to encode a certain part of the media. A high bandwidth demand suggests a highly dynamic situation (e.g. frequent movements in a video scene, music instead of speech in an audio sequence) which may require SNR scalability

instead of temporal scalability for video and preferably channel dropping for audio in case of network congestion. Methods relying on media stream monitoring are typically based on such estimates and heuristic rules.

The embodiment illustrated in Fig 4 is similar to the embodiment of Fig. 3.

However, in this case the information included in the headers of media frames 14 are assumed to carry enough information to estimate the relevant properties of the content. For example, if motion compensation frames require high bit-rates, then high motion is to be expected in the media. Hence, protecting these frames is a feasible solution. To derive a more accurate guess of the content, the properties of several frames could be incorporated in the decision about the actual Content Type association.

In the embodiment illustrated in Fig. 5, the Content Preference Description from Content Type Identifier 22 depends not only on the actually detected

Content Type, but also on a Content Type service binding from a storage 26 that associates different Content Types with different services. For example:

• The Content Type may be assigned based on user defined preferences.

• It may be assigned in advance to certain services, like telephony or video conferencing. In such an embodiment the mapping will use the same Content Type value for all media frames associated with that service.

• It may be assigned to certain service-contexts. That is, for example, in a video telephony service the context when the user is speaking may be assigned a different Content Type than the context when the user is listening.

A service binding storage 26 may optionally be combined with the other previously described embodiments.

Content-aware scalability-mapping may also be used to allow content-based relative differentiation among services in a network domain. That is, certain Content Types may be used, for example, in gold services while others are used for silver and bronze differentiation. In this way, not only the packets of a single service are scaled based on the actual content, but content based differentiation is also realized among different services and media streams.

Content-aware scalability-mapping may be used to derive importance identifiers for audiovisual media that require joint handling to increase the user- perceived quality. Such media include movies, broadcast events where video is accompanied by audio data. In this context the mapping must consider, besides the distinct scaling of video and audio data, the coherence of the audio and video streams. For example, under severe conditions the entire loss of audio is acceptable for sport events while for video conferencing the loss/ degradation of the video signal is more appropriate.

The content-aware scalability-mapping to produce an importance identifier may be applied at the media source, as described in the above embodiments, or at any appropriate network entity (node). This identifier may be encoded in an optional IP header extension, or Differentiated Services code-point, or may be signaled out-of-band with an appropriate signaling protocol.

If a media frame is packetized into more than one IP packet, all the packet are classified according the importance identifier derived for the correspond- ing media frame.

The importance identifier should be examined by network elements such as switches, routers, RNCs, MGWs when they predict or currently experience undesirable conditions. Undesirable conditions include, but are not limited to, network congestion, buffer overflow, undesirable wireless channel conditions.

Network elements may initiate local adaptation procedures to prevent or recover from undesirable conditions. That is, packets or in general data may ¬ be treated differently, e.g., discarded based on the value of the importance identifier.

Intermediate gateways may initiate transcoding procedures to prevent or recover from undesirable conditions. That is, packets or in general data may be treated differently, e.g., discarded based on the value of the importance identifier.

The lower layers (i.e., below the Application Layer) at the source endpoint may also initiate adaptation procedures to prevent or recover from undesirable conditions. That is, packets or in general data may be treated differently, e.g., discarded based on the value of the importance identifier.

A video example:

Fig. 6 is a video example illustrating how an importance identifier may be stored in an IP header. As mentioned earlier, network entities can make use of the information encoded in the newly proposed NALU header extension for H.264/ AVC SVC. However, to access this information, network elements must dig deep into higher layer protocol headers. For example, routers operate at the Network Layer, therefore they can easily access the IP header. However, there are additional protocol headers in the packet which hinder the direct access of the desired information. The protocol stack is as follows IP/ UDP/ RTP/ NALU. One option for network elements to scale the media traffic by accessing the NALU header would be to parse higher layer headers, but this has the disadvantage, among others, to introduce a lot of processing overhead, and reduce robustness and transparency. Instead a mapping is used to calculate an importance identifier based on the NALU header extension NALU- H and the content of the stream to derive a Network- or Link Layer specific priority or QoS class association. For example, if motion is more important for the perception of the current scene than quality, the importance identifier could be set to L2,L1,LO,T2,T1,TO,Q1,QO, where a higher identifier corre-

sponds to a lower drop preference or probability. If quality is more important then the value L2,L1,LO,Q1,QO,T2,T1,TO could be used as the importance identifier. As illustrated in Fig. 6, the importance identifier may be stored in the DS field (Differential Services field) of the IP header IP-H. The importance identifier may, for example, be encoded in accordance with the Assured Forwarding (AF) services defined by the Internet Engineering Task Force (IETF). This defines three priority classes and four discarding priorities for each class. Thus, it is possible to differentiate between, for example, conversational, audio and video. The video class may then have four distinct packet drop precedence levels. The information in the NALU header and the content description may thus be used to map video packets into one of these four drop precedence levels.

Fig. 7 is a video example illustrating how an importance identifier may be stored in an ethernet header. This header includes a Tag Control Information

(TCI) field having a Priority (3 bits) field and a Drop Eligible (1 bit) field. These fields may be used to map the scalability description and content preference description to an importance identifier representing specific service classes. The 3 priority bits may be used to encode up to 8 service classes. The Drop Eligible bit may be used to in conjunction with the priority classes to mark frames that should be dropped first.

It is also possible to combine the embodiments of Fig. 6 and 7, i.e. to map the scalability description and content preference description to obtain a content- aware coding within specific service classes.

An audio example:

The draft RTP transport format update for MPEG-4 AAC-BSAC draft [14] proposes a mode for RFC 3640 to support an MPEG-4 AAC-BSAC audio codec format with an optional attached bit stream description. The bit stream description employs the MPEG-21 generalized Bit stream Syntax Description Language (gBSDL). The description is attached as auxiliary header and can be used to support adaptation. An example gBSDL description is given in

APPENDIX 1 (corresponds to Example 2 from Annex C in reference [16]). The gBSDL conveys information on how the AAC-BSAC layering is composed.

An accompanying Adaptation Quality of Service Description in XML is given in APPENDIX 2 (corresponds to the Example from Annex B of reference [16]). The

"AdaptationQoS" conveys information of the bit rates for the Layers ("BANDWIDTH"), channel dropping possibility ("NUMBERJDF_CHANNELS"), and for SNR scalability the "ODG" (Objective Difference Grade) relation is given in the ODG operator.

Fig. 8 is an audio example illustrating how an importance identifier may be stored in an IP header. Assume that a version of the gBSDL Scalability Description (APPENDIX 1) is transported in each RTP frame according to [14] (Note: the chatty gBSD could be compressed.). An example solution would be to only use the gBSD Scalability Description transported in the RTP frames, together with a statistical analysis of the gBSD-values from previous RTP frames. This is required since the example gBSD Scalability description does not reveal any details about what specific enhancement is provided for each layer (SNR, bandwidth, channels/ spatial). Thus, the Importance Identifier Mapping cannot rely on the intra RTP-packet information only to set the Importance (priority).

To set the Importance Identifier one may instead use the rate of change of some features of the Scalability Description (e.g. Inter-RTP packet information may be used). Examples:

• Base Layer change analysis: A higher amount of bits allocated to the base layer than a long term average of base layer bits may be used as an indication of increased importance.

• Overall Rate change analysis: Demanding parts of the media could be encoded with temporarily higher total bit rates, i.e. packets with a tern-

porarily increased size may be given higher importance than other RTP packets.

Another example would be to use the gBSD "Scalability Description" trans- ported in the RTP frame together with the external Content Information in the

XML AdaptationQoS Description and use the mapping-function to provide an Importance Identifier for the Network Layer (e.g., IP).

If one can support fragmentation as well as using a mapping, more elaborate intra packet schemes are possible. Given that a channeling description file is available for the audio part (e.g. according to QoS operator NUMBER_OF .CHANNELS in APPENDIX 2), it is possible to fragment the RTP-packet according to the channel information given. For instance, give higher importance to the RTP packet containing layers giving at least mono output, and give a reduced importance identifier to the fragmented packets containing higher layers.

Further, if there is external time aligned audio information, the channeling importance could be given variable priority, depending on whether the audio content is speech/news or music. E.g., if speech/news is indicated the mono layers will get the highest Importance Identifier value (e.g., 7) and the enhancement layers will get a somewhat lower Importance Identifier (e.g., 4). On the other hand, if the music content type is indicated, mono layers will still get the highest Importance Identifier value (e.g., 7) and the enhancement layers will get a higher Importance Identifier (e.g., 6).

Fig. 9 is a block diagram of an embodiment of a media stream scaling apparatus in accordance with the present invention. Such an apparatus may be installed at a network node and includes a scaling unit 30, which may also in- elude the routing function. Packets received on an input are scaled either by truncation or by discarding the entire packet (indicated by the dashed lines). The scaling is controlled by an importance identifier obtained from the IP

header of the packets by an importance identifier extractor 32. As an alternative, this extractor may be included in unit 30.

The functionality of the various blocks of the described embodiments is typi- cally achieved by one or several micro processors or micro/ signal processor combinations and corresponding software.

Fig. 10 is a flow chart illustrating the method of forming a packetized scalable media stream in accordance with the present invention. Step Sl determines a media scalability description. Step S2 determining a media content preference description. Step S3 maps the scalability description and the content preference description into an importance identifier included in and controlling the scalability of the media stream.

Fig. 11 is a flow chart illustrating a packetized media stream scaling method in accordance with the present invention. Step S4 receives packets with a content dependent importance identifier. Step S5 scales packets based on the value of the importance identifier.

Fig. 12 illustrates an example of a content- aware mapping in accordance with the present invention. This example is based on the video example described with reference to Fig. 6-7. The scalability description includes the layer identifiers L2, Ll, LO, the temporal identifiers T2, Tl, TO and the SNR identifiers Ql, QO. Each identifier is a binary number (0 or 1), and only one identifier has the value 1 in each scalability dimension (layer, temporal, SNR). That is, a frame may have, for example, the following scalability description:

L2=0, Ll=I, LO=O, T2=0, Tl=O, TO=I, Ql=I, QO=O

Assume that the content preference description includes the possible "high motion", "high quality" and "high resolution", Furthermore, assume that the content preference description of the current frame indicates "high motion". This is used to order the different identifiers in the scalability description and

form an 8 bit number (3+3+2 bits). This number could in principle be used directly as an importance identifier. However, it is likely that fewer bits are available for storing an importance identifier. Hence the 8 bits may have to be mapped into, for example, 3 groups, as indicated in Fig. 12. Here the impor- tance identifier can only have the values 0, 1 and 2 (where 0 indicates the highest priority). In this example frames having Ql=I are given the lowest priority 2. Thus, for "high motion" SNR enhancements are sacrificed first. The next group (having priority 1) includes frames having temporal enhancements, i.e. T2=l or Tl=I, but no SNR enhancements, i.e. QO=I. Thus, the two lowest priority groups include all frames with either SNR or temporal enhancements.

The third group having the highest priority (0 in the example) includes frames in which no SNR or temporal enhancements are allowed (i.e. TO=I and QO=I), but where any one of L2, Ll and LO can have the value 1. The other two content preference descriptions "high quality" and "high resolution" would give a different ordering of L2, Ll, LO, T2, Tl, TO, Ql, QO and a different mapping of the resulting 8 bits into the 3 possible values of the importance identifier.

A QoS mapping function can map the possible values of an importance identifier to the possible classes of the QoS architecture. The QoS mapping could be the following in the case of IP DiffServ. We make use of the AF service class.

Here we can distinguish 4 priority classes in each of which 3 different drop precedences are defined. We may use class 1 for video and class 0 for audio (the lower values correspond to higher priority). In class 1 we use the three drop precedences to achieve the content aware scaling of the media. That is, the importance id must be mapped to 3 drop precedence values, referred to as: AF12, AFI l, AFlO, where the first digit is the class and the second is the drop precedence (again lower values correspond to higher priority, for drop this means that AF 12 is dropped first). Since we already have only 3 different importance identifiers, the QoS mapping is straightforward: importance identi- fier 0 -> AFlO, 1->AF11, 2->AF12.

If there are fewer QoS classes available than importance identifier values, this mapping may group certain importance values together to a single QoS class. However, it should be noted that this may lead to sub-optimal scaling.

By using the content-aware scalability-mapping of the present invention, user-perceived quality can be maintained at good levels even under temporal network performance degradations. By using a mapping the multidimensional codec- specific importance of packets can flexibly be handled and possibly compressed to permit an efficient encoding even into the service differentiation fields of the IPv4 or IPv6 header. Moreover, by considering the audiovisual content and the scalability properties of the codecs together, a better trade-off between quality and transport and transmission cost can be achieved. Furthermore, utilizing the mechanisms described above, a sophisticated differentiation among services is possible.

It will be understood by those skilled in the art that various modifications and changes may be made to the present invention without departure from the scope thereof, which is defined by the appended claims.

APPENDIX 1

<?xml version=" 1.0" encoding="UTF-8"?> <gBSD xmlns="urn:mpeg:mpeg21 :dia:schema:gBSD:2003" xmlns:bt="urn:mpeg:mpeg21 :dia:schema:gBSDatatypes:2003" xmlns:gBSD="urn:mρeg:mρeg21:dia:schema:gBSD:2003" xmlns:xsi="http: / /www.w3.org/ 1999/XMLSchema-instance" xsi:schemaLocation="gBSD ../Schemas/gBSSchema.xsd"> <Header> <ClassificationAlias alias="MA4" href="urn:mpeg:mpeg4:audio:cs:syntacticalLabels"/>

<DefaultValues addressUnit="bit" addressMode="Absolute" globalAddressInfo="test.bsac"/> </Header> <gBSDUnit syntacticalLabel=":MA4:BSAC:BSAC_frame_element" start="23528" length="2640">

<gBSDUnit syntacticalLabel=":MA4:BSAC:BSAC_base_element" start="0" length="631" addressMode=" Consecutive" > <Parameter length=" 11" name=":MA4:BSAC:BSAC_base_element:framelength">

<Value xsi:type="bt: 1 lb">330</Value> </Parameter>

<gBSDUnit syntacticalLabel="dummyl" length="5" /> <Parameter length="6" name=" :MA4:BSAC:BSAC_base_element:toρlayer">

<Value xsi:tyρe="bt:6b">48</Value> </ Parameters

<gBSDUnit syntacticalLabel="dummy2" length="609"/> </gBSDUnit> <gBSDUnit syntacticalLabel=":MA4:BSAC_layer_element" length="2009" marker="bitrate" addressMode=" Consecutive" > <gBSDUnit length="35" marker="eH"/> <gBSDUnit length="35" marker="el2"/>

<gBSDUnit length="35" marker="el3"/> <gBSDUnit length="35" marker="el4"/> <gBSDUnit length="35" marker="el57> <gBSDUnit length="35" marker="el6"/> <gBSDUnit length="35" marker="el7'7>

<gBSDUnit length="35" marker ="el8"/> <gBSDUnit length="35" marker="el9'7>

— skipped lines—

<gBSDUnit length=" 16" marker="el467> <gBSDUnit length="32" marker="el477> <gBSDUnit length="814" marker="el487> </gBSDUnit> </gBSDUnit>

</gBSD>

APPENDIX 2

<AdaptationQoSModule xsi:type="UtilityFunctionType"> Constraint xsi:type="FloatVectorTYpe" iOPinRef="BANDWIDTH"> <FloatVector>16171819202122232425262728293031323334

353637383940414243444546474849505152535455565758 5960616263646668707274767880828486</FloatVector> </ Constraint

<AdaptationOperator xsi:type="IntegerVectorType" iOPin-

Ref="LAYERS_OF_SCALABLE_AUDIO"> <IntegerVector>27262524232221201918171615141312272726

262525242423232222212120201919181817171616151514

141313121211109876543210</IntegerVector> </AdaptationOperator>

<AdaptationOperator xsi:type="IntegerVectorType" iOPin-

Ref="NUMBER_OF_CHANNELS "> <IntegerVector>l 111111111111111000000000000000

00000000000000000000000000000</IntegerVector> </AdaptationOperator>

<AdaptationOperator xsi:type="NMTokenType" iOPin- Ref="CHANNEL_CONFIGURATION" >

<NMToken>M M M M M M M M M M M M M M M M L,R L,R L,R L,R L 5 R L 5 R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L,R L 5 R L,R L 5 R L,R L,R L,R L,R L,R L,R L,R L,R L 5 R L 5 R L 5 R L 5 R L 5 R L 5 R</NMToken>

< / AdaptationOperator>

<Utility xsi:type="FloatVectorType" iOPinRef="ODG"> <!-- -4=BAD ... O=GOOD --> <FloatVector>-3.86 -3.85 -3.84 -3.82 -3.8 -3.78 -3.74 -3.7 -3.66 -3.59 -3.55

-3.52 -3.46 -3.44 -3.43 -3.43 -3.63 -3.63 -3.57 -3.57 -3.44 -3.44 -3.28 -3.28 -3.19 -3.19 -3.11 -3.11 -2.98 -2.98 -2.94 -2.94 -2.91 -2.91 -2.86 -2.86 -2.88

-2.88 -2.84 -2.72 -2.72 -2.65 -2.65 -2.54 -2.54 -2.45 -2.45 -2.37 -2.22 -2.03 -1.91 -1.73 -1.55 -1.44 -1.28 -1.14 -1.06 -1.0 -0.42</FloatVector>

</ Utility > < / AdaptationQoSModule>

REFERENCES

[1] ITU-T and ISO/IEC JTC 1, "Advanced Video Coding for Generic Audiovisual Services", ITU-T Recommendation H.264 & ISO/IEC 14496-10 (MPEG-4 AVC), version 3: 2005.

[2] T. Wiegand et al., "Overview of the H.264/ AVC video coding standard," IEEE Trans. CSVT, vol. 13 (7), pp. 560-576, July 2003.

[3] ISO/IEC JTC 1/SC29 / WGI l, "MPEG Press Release", ISO/IEC

JTC1/SC29/WG11 Doc. N6874, Hong Kong (China), Jan. 2005.

[4] ITU-T and ISO/IEC JTC 1, "Working Draft 1.0 of 14496- 10:200x/AMDl Scalable Video Coding", Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, JVT-N022, Hong Kong, Jan. 2005.

[5] S. Wenger, M. M. Hannuksela, T. Stockhammer, M. Westerlund, . and D. Singer "RTP Payload Format for H.264 Video", RFC 3984, February 2005.

[6] ISO/IEC JTC 1, SC 29, WG 11/Ml 1657, "Performance and functionality of existing MPEG-4 technology in the context of CfI on Scalable Speech and Audio Coding", Jan. 2005.

[7] H. Schwarz, D. Marpe, and T. Wiegand, "MCTF and Scalability Extension of H.264/AVC," Proc. of PCS, San Francisco, CA, USA, Dec. 2004.

[8] H. Schwarz, D. Marpe, and T. Wiegand, "Combined Scalability Support for the Scalable Extension of H.264/AVC", IEEE ICME, Amsterdam, The Netherlands, July 06-08 2005

[9] J. van der Meer et al., "RTP Payload Format for Transport of MPEG-4 Elementary Streams", IETF RFC 3640, November 2003.

[10] W. Kumwilaisak, Y. T. Hou, Q. Zhang, W. Zhu, C-C. Kuo, and Y.-Q. Zhang, "A Cross-Layer Quality-of-Service Mapping Architecture for Video Delivery in Wireless Networks", IEEE Journal on Selected Areas in Communications, vol. 21, num. 10, pl685 — 1697, Dec. 2003.

[11] Toufik Ahmed, Ahmed Mehaoua, Raouf Boutaba, and Youssef Iraqi, "Adaptive Packet video Streaming Over IP Networks: A Cross-Layer Approach" IEEE Journal on Selected Areas in Communications, vol. 23, num. 2, p385— 401, Feb. 2005.

[12] Jitae Shin, Jong Won Kim, and C-C. Jay Kuo, "Quality-of-Service Mapping Mechanism for Packet Video in Differentiated Services Network", IEEE Transactions on Multimedia, vol. 3., num. 2., Jun. 2001.

[13] ISO/IEC TR 21000-1:2004, Information technology — Multimedia framework (MPEG-21) — Part 1: Vision, Technologies and Strategy

[14] Feiten, Wolf et al. ,"New mode for rfc3640: AAC-BSAC with MPEG-21 gBSD", http: / / www. ietf.org/ internet-drafts /draft-feiten-avt-bsacmode- for-rfc3640-00.txt, February 11, 2005

[15] Philips, FhG, Samsung, Coding Technologies, NEC, France Telecom, I2R , "Performance and functionality of existing MPEG-4 technology in the context of CfI on Scalable Speech and Audio Coding", ISO/IEC JTC 1/SC 29/WG 11 /Ml 1657, January, 2005 Hong Kong, CN

[16] Samsung AIT, T- systems, "Report of Core Experiment on Audio Adap- tationQoS", MPEG 2003, ISO/IEC JTC 1/SC 29/WG 11/M9730, Trondheim, July 2003

[17] Panis et al, "Bitstream Syntax Description: A Tool for Multimedia Resource Adaptation within MPEG-21", Signal Processing: Image Communication 18 (2003) 721-747

[18] US Patent No. 6,789, 123.