Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MANAGING NETWORK JITTER FOR MULTIPLE AUDIO STREAMS
Document Type and Number:
WIPO Patent Application WO/2021/255327
Kind Code:
A1
Abstract:
There is inter alia disclosed an apparatus for managing a jitter buffer for multiple audio streams. The apparatus comprises at least receiving a message (503) associated with a first of at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first and a second audio stream; The apparatus can instruct the decoding of the second and the first encoded audio streams and the playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time (505).

Inventors:
LAAKSONEN LASSE (FI)
Application Number:
PCT/FI2021/050326
Publication Date:
December 23, 2021
Filing Date:
May 03, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04M3/56; G10L21/055; H04J3/06
Foreign References:
US5623483A1997-04-22
EP3013013A12016-04-27
Other References:
BORONAT, F. ET AL.: "Multimedia group and inter-stream synchronization techniques: A comparative study", INFORMATION SYSTEMS ELSEVIER, vol. 34, no. 1, 1 March 2009 (2009-03-01), pages 108 - 131, XP025644936, DOI: 10.1016/ j.is. 2008.05.00 1
ORANGE: "3rd Generation Partnership Project (3GPP", 3GPP DRAFT; S 4-191212 ON PASS-THROUGH MODE FOR SCENE MANIPULATION, 15 October 2019 (2019-10-15), XP051799489, Retrieved from the Internet [retrieved on 20210827]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
CLAIMS:

1 . A method comprising: receiving a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams; determining whether the second of the at least two encoded audio streams has been received within the maximum delay time; instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time; and instructing a decoding of the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time.

2. The method as claimed in Claim 1 , wherein determining whether the second of the at least two encoded audio streams has been received within the maximum delay time comprises: receiving the second of the at least two encoded audio streams; determining a time difference between the received second of the at least two encoded audio streams and the received first of the at least two encoded audio streams; and determining that the time difference is less than the maximum delay time.

3. The method as claimed in Claim 2, wherein instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time further comprises: delaying the instructing of the decoding of a frame of the first of the at least two encoded audio streams relative to the instructing of the decoding of a frame of the second of the at least two audio streams by the time difference.

4. The method as claimed in Claims 1 to 3, wherein the message further comprises information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, and wherein the method further comprises: determining from the information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, wherein the first and second decoded audio stream form at least part of a multi-channel audio output signal.

5. The method as claimed in Claims 1 to 4, wherein the message comprises: a first data field comprising an audio identifier of the first encoded audio stream; a second data field comprising an audio identifier of the second encoded audio stream; and a third data field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams.

6. The method as claimed in Claim 5, wherein the third field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two audio streams expresses the maximum delay time by an offset time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams in the form of either.

7. An apparatus comprising: means for receiving a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams; means for determining whether the second of the at least two encoded audio streams has been received within the maximum delay time; means for instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time; and means for instructing a decoding of the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time.

8. The apparatus as claimed in Claim 7, wherein the means for determining whether the second of the at least two encoded audio streams has been received within the maximum delay time comprises: means for receiving the second of the at least two encoded audio streams; means for determining a time difference between the received second of the at least two encoded audio streams and the received first of the at least two encoded audio streams; and means for determining that the time difference is less than the maximum delay time.

9. The apparatus as claimed in Claim 7, wherein the means for instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time further comprises: means for delaying the instructing of the decoding of a frame of the first of the at least two encoded audio streams relative to the instructing of the decoding of a frame of the second of the at least two audio streams by the time difference.

10. The apparatus as claimed in Claims 7 to 9, wherein the message further comprises information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, and wherein the apparatus further comprises: means for determining from the information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, wherein the first and second decoded audio stream form at least part of a multi-channel audio output signal.

11 . The apparatus as claimed in Claims 7 to 10, wherein the message comprises: a first data field comprising an audio identifier of the first encoded audio stream; a second data field comprising an audio identifier of the second encoded audio stream; and a third data field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams.

12. The apparatus as claimed in Claim 11 , wherein the third field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two audio streams expresses the maximum delay time by an offset time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams in the form of either.

Description:
MANAGING NETWORK JITTER FOR MULTIPLE AUDIO STREAMS

Field

The present application relates to apparatus and methods for managing network jitter when receiving multiple encoded audio instances relating to a sound-field related audio representation.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR) as well as spatial voice communication including teleconferencing. This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, it is expected that the decoder can output the audio in supported formats. In this regard the IVAS decoder is also expected to handle multiple output audio streams and audio elements which each may arrive with varying degrees of delay as a result of jitter conditions in a packet based network. Summary

There is provided according to a first aspect a method comprising: receiving a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams; determining whether the second of the at least two encoded audio streams has been received within the maximum delay time; instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time; and instructing a decoding of the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time.

The determining whether the second of the at least two encoded audio streams has been received within the maximum delay time may comprise: receiving the second of the at least two encoded audio streams; determining a time difference between the received second of the at least two encoded audio streams and the received first of the at least two encoded audio streams; and determining that the time difference is less than the maximum delay time.

Instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time may further comprise: delaying the instructing of the decoding of a frame of the first of the at least two encoded audio streams relative to the instructing of the decoding of a frame of the second of the at least two audio streams by the time difference. The message may further comprise information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, and wherein the method further may comprise: determining from the information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, wherein the first and second decoded audio stream form at least part of a multi channel audio output signal.

The message may comprise: a first data field comprising an audio identifier of the first encoded audio stream; a second data field comprising an audio identifier of the second encoded audio stream; and a third data field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams.

The third field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two audio streams may express the maximum delay time by an offset time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams in the form of either.

There is provided according to a first aspect an apparatus comprising: means for receiving a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams; means for determining whether the second of the at least two encoded audio streams has been received within the maximum delay time; means for instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time; and means for instructing a decoding of the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time.

The means for determining whether the second of the at least two encoded audio streams has been received within the maximum delay time may comprise: means for receiving the second of the at least two encoded audio streams; means for determining a time difference between the received second of the at least two encoded audio streams and the received first of the at least two encoded audio streams; and means for determining that the time difference is less than the maximum delay time.

The means for instructing a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time may further comprise means for delaying the instructing of the decoding of a frame of the first of the at least two encoded audio streams relative to the instructing of the decoding of a frame of the second of the at least two audio streams by the time difference.

The message may further comprise information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, and wherein the apparatus may further comprise: means for determining from the information indicating that the first of the at least two encoded audio streams is associated with the second of the at least two encoded audio streams, wherein the first and second decoded audio stream form at least part of a multi-channel audio output signal.

The message may comprise: a first data field comprising an audio identifier of the first encoded audio stream; a second data field comprising an audio identifier of the second encoded audio stream; and a third data field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams. The third field comprising the maximum delay time between the first of the at least two encoded audio streams and a second of the at least two audio streams may express the maximum delay time by an offset time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams in the form of either.

According to a third aspect there is an apparatus for spatial audio encoding comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least to receive a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams; determine whether the second of the at least two encoded audio streams has been received within the maximum delay time; instruct a decoding of the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time; and instruct a decoding of the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;

Figures 2a and 2b show example server and peer-to-peer teleconferencing systems within which embodiments may be implemented;

Figure 3 shows schematically an example encoder-decoder configuration for a server based teleconferencing system as shown in Figure 2a according to some embodiments;

Figure 4 shows schematically a receiver system deploying a jitter buffer management scheme according to embodiments;

Figure 5 shows a flow diagram of the operation of at least part of a receiver system deploying a jitter buffer management scheme according to embodiments;

Figure 6 shows an example device suitable for implementing the apparatus shown; and

Figure 7 shows an example schematic block diagram of a jitter buffer manager.

Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms to compensate for the effect of network jitter in order to control the relative playback tinning of multiple audio streams and/or elements by at least one IVAS decoder instance.

In general, network jitter and packet loss conditions can cause degradation in quality for example in conversational speech services in packet networks, such as the IP networks, and mobile networks such as fourth generation (4G LTE) and fifth generation (5G) networks The nature of the packet switched communications can introduce variations in transmission of times of the packets (containing frames), known as jitter, which can be seen by the receiver as packets arriving at irregular intervals. However, an audio playback device requires a constant input with no interruptions in order to maintain good audio quality. Thus, if some packets/frames arrive at the receiver after they are required for playback, the decoder may have to consider those frames as lost and perform error concealment.

Typically, a jitter buffer can be utilised to manage network jitter by storing incoming frames for a predetermined amount of time (specified e.g. upon reception of the first packet of a stream) in order to hide the irregular arrival times and provide constant input to the decoder and playback components.

Nowadays, most audio or speech decoding systems deploy an adaptive jitter buffer management scheme in order to dynamically control the balance between short enough delay and low enough numbers of delayed frames. In this approach, an entity controlling the jitter buffer constantly monitors the incoming packet stream and adjusts the buffering delay (or buffering time, these terms are used interchangeably) according to observed changes in the network delay behaviour. If the transmission delay seems to increase or the jitter becomes worse, the buffering delay may need to be increased to meet the network conditions. In the opposite situation, where the transmission delay seems to decrease, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.

As such the embodiments discussed herein are concerned with a codec, for example an IVAS codec, configured to support a multi-input mode of operation wherein the codec is configured to provide a framework for decoding/rendering of multiple input streams, each typically originating from a different encoder. This framework also comprises the facility to compensate for network jitter and packet loss in IP based networks.

According to these embodiments there is provided at least one codec input audio stream which can comprise multiple audio inputs.

In some embodiments the audio inputs within the signal can be allocated, for example into separate encoder instances based on a parameter (for example a track-group allocation parameter). This parameter may furthermore be encoded as metadata and be transmitted/stored with the audio signal.

Additionally, in some embodiments the encoded metadata together with the transmitted multiple audio streams/elements can be accompanied by a signalling message which provides information on the relative timings of the input audio signals being encoded by the IVAS encoder. The relative timing of each input audio signal may then be conveyed to the receiving modem (along with the encoded audio elements) and used within a jitter buffer manager (JBM) to ensure that the decoded audio stream signals are time aligned (relative to each other) in accordance with the relative timings of the audio streams at the encoder.

The IVAS codec can have the provision for effective encoding of spatial analysis derived metadata parameters of multi-channel system utilising a multi-channel microphone implementation. However as discussed above the input format may be any suitable input format, such as multi-channel loudspeaker, ambisonic (FOA/HOA) etc. The channel location may be based on a location of the microphone or is a virtual location or direction. The output may be a multi-channel loudspeaker arrangement in which the loudspeaker signals may be generalised to be two or more playback audio signals. Additionally, the output may be rendered to the user via means other than loudspeakers. Such a system is currently being standardised by the 3GPP standardization body as the Immersive Voice and Audio Service (IVAS). IVAS is intended to be an extension to the existing 3GPP Enhanced Voice Service (EVS) codec in order to facilitate immersive voice and audio services over existing and future mobile (cellular) and fixed line networks. An application of IVAS may be the provision of immersive voice and audio services over 3GPP fourth generation (4G) and fifth generation (5G) networks. In addition, the IVAS codec as an extension to EVS may be used in store and forward applications in which the audio and speech content is encoded and stored in a file for playback. It is to be appreciated that IVAS may be used in conjunction with other audio and speech coding technologies which have the functionality of coding the samples of audio and speech signals.

As mentioned previously IVAS may use (side streamed) metadata in order to assist in the encoding and transmission of spatial parameters. That is spatial parameters which may represent the spatial properties if at least some of the audio inputs. IVAS is proposed to have metadata consisting at least of spherical directions (elevation, azimuth), at least one energy ratio of a resulting direction, a spread coherence, and surround coherence independent of the direction, for each considered time- frequency (TF) block or tile, in other words a time/frequency sub band. In total IVAS may have a number of different types of metadata parameters for each time- frequency (TF) tile.

In this regard Figure 1 depicts an example apparatus and system for implementing embodiments of the application. The system 100 is shown with an ‘analysis’ part 121 and a ‘synthesis’ part 131. The ‘analysis’ part 121 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downmix signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded metadata and downmix signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).

The input to the system 100 and the ‘analysis’ part 121 is the multi-channel signals 102. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example, in some embodiments the spatial analyser and the spatial analysis may be implemented external to the encoder. For example, in some embodiments the spatial metadata associated with the audio signals may be provided to an encoder as a separate bit-stream. In some embodiments the spatial metadata may be provided as a set of spatial (direction) index values. These are examples of a metadata-based audio input format.

The multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105.

In some embodiments the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable transport signal comprising a determined number of channels and output the transport signals 104. For example, the transport signal generator 103 may be configured to generate a 2-audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. The transport signal generator in some embodiments is configured to otherwise select or combine, for example, by beamforming techniques the input audio signals to the determined number of channels and output these as transport signals.

In some embodiments the transport signal generator 103 is optional and the multi channel signals are passed unprocessed to an encoder 107 in the same manner as the transport signal are in this example.

In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the transport signals 104. The analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108 and an energy ratio parameter 110 and a coherence parameter 112 (and in some embodiments a diffuseness parameter). The direction, energy ratio and coherence parameters may in some embodiments be considered to be spatial audio parameters. In other words, the spatial audio parameters comprise parameters which aim to characterize the sound-field created/captured by the multi-channel signals (or two or more audio signals in general). In some embodiments the parameters generated may differ from frequency band to frequency band. Thus, for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. The transport signals 104 and the metadata 106 may be passed to an encoder 107.

The encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport (for example downmix) signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a metadata encoder/quantizer 111 which is configured to receive the metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.

In the following description terms such as encoded audio stream may be taken to mean an encoded audio signal such as an encoded downmix signal and accompanying encoded metadata.

In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport extractor 135 which is configured to decode the audio signals to obtain the transport signals. Similarly, the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata. The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

Within the context of IVAS a jitter buffer management may be part of the transport extractor 135.

The decoded metadata and transport audio signals may be passed to a synthesis processor 139.

The system 100 ‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the transport and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 110 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the transport signals and the metadata.

Therefore, in summary first the system (analysis part) is configured to receive multi channel audio signals.

Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting or downmixing some of the audio signal channels) and the spatial audio parameters as metadata.

The system is then configured to encode for storage/transmission the transport signal and the metadata.

After this the system may store/transmit the encoded transport and metadata.

The system may retrieve/receive the encoded transport and metadata.

Then the system is configured to extract the transport and metadata from encoded transport and metadata parameters, for example demultiplex and decode the encoded transport and metadata parameters. The system (synthesis part) is configured to synthesize an output multi-channel audio signal based on extracted transport audio signals and metadata.

An example system in which an IVAS coding system may be implemented is shown in Figures 2a and 2b.

Figure 2a for example shows a teleconferencing system within which some embodiments can be implemented. In this example there is shown three sites or rooms, Room A 201 , Room B 203, and Room C 205. Room A 201 comprises three ‘talkers’, Talker 1 211 , Talker 2 213, and Talker 3 215. Room B 203 comprises one ‘talker’, Talker 4 221 . Room C 205 comprises one ‘talker’, Talker 5 231 .

In the following example within room A is a suitable teleconference apparatus 202 configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. Within each of the other rooms may be a suitable teleconference apparatus 204, 206 configured to render a spatial audio signal to the room and furthermore is configured to capture and encode at least a mono audio and optionally configured to spatially capture and encode the audio environment. In the following examples each room is provided with the means to spatially capture, encode spatial audio signals, receive spatial audio signals and render these to a suitable listener. It would be understood that there may be other embodiments where the system comprises some apparatus configured to only capture and encode audio signals (in other words the apparatus is a ‘transmit’ only apparatus), and other apparatus configured to only receive and render audio signals (in other words the apparatus is a ‘receive’ only apparatus). In such embodiments the system within which embodiments may be implemented may comprise apparatus with varying abilities to capture/render audio signals.

The teleconference apparatus (for each site or room) 202, 204, 206 is further configured to call into a teleconference controlled by and implemented over a server or multipoint control unit (MCU) 207. Figure 2b shows a further (peer-to-peer) teleconferencing system within which some embodiments can be implemented. In this example there is shown three sites or rooms, Room A 201 , Room B 203, and Room C 205. Room A 201 comprises three ‘talkers’, Talker 1 211 , Talker 2 213, and Talker 3 215. Room B 203 comprises one ‘talker’, Talker 4 221 . Room C 205 comprises one ‘talker’, Talker 5 231 . Within each of the rooms is a suitable teleconference apparatus 202, 204, 206 configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. The teleconference apparatus (for each site or room) 202, 204, 206 is further configured to communicate with each other to implement a teleconference function. As shown in Figure 2b, the IVAS decoder/renderer for each of the teleconference apparatus 202 can be configured to handle multiple input streams that may each originate from a different encoder. For example, the apparatus 206 in room C 205 is configured to decode/render simultaneously audio streams from room A 201 and room B 204.

In some embodiments there can be advantages in terms of complexity of the decoding/rendering and synchronization of the presentation of the streams when these simultaneous decoding/rendering can be done using the same (IVAS) decoder/renderer instance. Alternatively, in some embodiments the handling of multiple input streams that may each originate from a different encoder could be implemented by two or more (IVAS) decoder instances and the audio outputs from the instances mixed in the rendering operations. In the latter case, it may be preferable to use external rendering operation instead or in addition to the integrated (IVAS) rendering in order to allow for the manipulation of the audio outputs relative to each other.

With respect to Figure 3 shows the system of Figure 2a where the MCU is implementing an encoding for the downstream of one RX user 381 (within room C 205). In this example each talker is represented as a separate audio object for the (IVAS) encoder. For example, each user ‘talker’ may be wearing a lavalier microphone for individual voice pick-up. Thus for example as shown in Figure 2, the talkers for room A 201 could be talker 1 211 which using a first lavalier microphone generates audio object 200, talker 2 213 which using a second lavalier microphone generates audio object 302, and talker 3 215 which using a third lavalier microphone generates audio object 304. The teleconference apparatus can comprise an (IVAS) encoder 301 which is configured to receive the audio objects and encode the objects based on a suitable encoding to generate a bitstream 306. The bitstream 306 is passed to the MCU 321 .

Additionally, for example as shown in Figure 3, the talkers for room B 203 could be talker 4 221 which using a fourth lavalier microphone generates audio object 310. The teleconference apparatus can comprise a further (IVAS) encoder 311 which is configured to receive the audio objects from room B 203 and encode the objects based on a suitable encoding to generate a bitstream 316. The bitstream 316 is passed to the MCU 321 .

The MCU (conferencing server) 321 comprises a (IVAS) decoder / (IVAS) encoder and is configured to decode the two (IVAS) bitstreams from rooms A (bitstream 306) and B (bitstream 316). The MCU may then be configured to perform some mixing of the two streams (for example based on audio activity or by any suitable means) and then encode a downmix bitstream 318 for the RX user 381 in room C 205. Some examples of the MCU may also comprise the facility to perform transcoding in which the bit streams may be decoded from one encoding format and re-encoded to another coding format.

The room C 205 may comprise apparatus comprising a (IVAS) decoder 331 which is configured to receive the downmix bitstream 318 which is configured to receive the downmix bitstream 318, decode the bitstream 318 and render a suitable spatial audio signal output to the RX user 381 .

Figure 4 shows how the IVAS codec may be connected to a jitter buffer management system. The receiver modem 401 can receive packets through a network socket such as an IP network socket which may be part of an ongoing Real-time Transport Protocol (RTP) session. The received packets may be pushed to a RTP depacketizer module 403, which may be configured to extract the encoded audio stream frames (payload) from the RTP packet. The RTP payload may then be pushed to a jitter buffer manager (JBM) 405 where various housekeeping tasks may be performed such as frame receive statistics are updated. The jitter buffer manager 405 may also be arranged to store the received frames. The jitter buffer manager 405 may be configured to pass the received frames to an IVAS decoder instance 407 for decoding. Accordingly, the IVAS decoder instance passes the decoded frames back to the jitter buffer manager 405 in the form of digital samples (PCM samples). Also depicted in Figure 4 is an Acoustic player 409 which may be viewed as the module performing the playing out (or playback) of the decoded audio streams. The function performed by the Acoustic player 409 may be regarded as a pull operation in which it pulls the necessary PCM samples from the JBM buffer in order to provide uninterrupted audio playback of the audio streams.

Figure 7, depicts a schematic block diagram of an example jitter buffer manager 405, which may be used in conjunction with an IVAS decoder as shown in Figure 4. The jitter buffer manager 405 may comprise a jitter buffer 800, a network analyzer 802, an adaptation control logic 803 and an adaptation unit 804.

Jitter buffer 800 is configured to at least temporarily store one or more audio stream (element) frames, which are received via a (wired or wireless) network for instance in the form of packets 806. These packets 806 may for instance be Real-time Transport Protocol (RTP) packets, which are unpacked by buffer 800 to obtain the audio stream frames. Buffer 800 may be linked to IVAS decoder 607 to output decoded audio streams when they are requested for decoding.

Buffer status information 808, such as for instance information on a number of frames contained in buffer 800, or information on a time span covered by a number of frames contained in the buffer, or a buffering time of a specific frame (such as an onset frame), is transferred between buffer 800 and adaptation control logic 803. Network analyzer 802 monitors the incoming packets 806 from the RTP depacketizer 603, for instance to collect reception statistics (e.g. jitter, packet loss). Corresponding network analyzer information 807 is passed from network analyzer 802 to adaptation control logic 803.

Adaptation control logic 803, inter alia, controls buffer 800. This control comprises determining buffering times for one or more frames received by buffer 800, and is performed based on network analyzer information 807 and/or buffer status information 808. The buffering delay of buffer 800 may for instance be controlled during comfort noise periods, during active signal periods or in-between. For instance, a buffering time of an onset signal frame may be determined by adaptation control logic 803, and IVAS decoder 407 may (for instance via adaptation unit 804, signals 809 and the signal 810 to control the IVAS decoder) then be triggered to extract this onset signal frame from buffer 800 when this determined buffering time has elapsed. The IVAS decoder 407 can be arranged to pass the decoded audio samples to the adaption unit 804 via the connection 811 .

Adaptation unit 804, if necessary, shortens or extends the output audio signal 812 according to requests given by adaptation control logic 803 to enable buffer delay adjustment in a transparent manner. The output audio signal may be connected to the Acoustic front end 409.

As explained previously the jitter buffer manager 405 may not only be required to compensate for network jitter but also maintain and control the relative playback timing of multiple audio streams when decoded by an IVAS decoder instance. To this end the jitter buffer manager 405 may also be required to adjust the relative playback times of the multiple stream audio elements once the effect of network jitter (which can be different for the packets of different audio streams) has been catered for. This may be required in order that the relative timings of the various multiple audio streams are maintained as they were when encoded, such as in the teleconferencing system example of Figures 2a and 2b. To that end it is proposed that as part of the encoded audio streams passing from the encoding system to decoding system there is a signalling message which conveys information relating to the relative timing of each audio stream at the encoding system. This audio stream jitter management signalling message may then be used at a receiver system, such as that depicted in Figure 4, to ensure that the decoded multiple audio streams are each timed aligned in accordance with their relative time alignment at the transmitting end of the communications chain.

In some embodiments the above signalling message may be transmitted as part of the IVAS (side streamed) metadata which accompanies the encoded audio streams.

In other embodiments the audio stream jitter management message may be contained within the header information of an RTP packet.

The signalling message may contain three parameters (or fields.) The first parameter, termed the ID, may be used to uniquely identify the audio stream or element to which the corresponding time alignment parameter may apply. Alternatively, in other embodiments the ID field may be used to identify a group of audio streams (or elements) instead.

The second parameter (or field) may be used to signify if an audio stream (or element) is connected or related to another audio stream (or element). In other words, this field can signify whether an audio stream is required to be played within a relative time to another audio stream under jitter conditions. In the context of this description the term connected is used to signify that there is a playback relationship between the audio streams such that for the full audio experience the playing out of one audio stream should be accompanied with the playing out of another connected audio stream. In other words, the audio streams may be associated with each other. An example of a connected (or associated) audio experience may be the teleconferencing system described in Figures 2a and 2b, in which the audio streams from each of the rooms will need to be played out together within a tolerable delay in order to maintain the overall audio experience at the receiving end. The second parameter (or field) may be denoted by the term IDconn and has as a value comprising the ID of the audio stream to which the “current” received audio stream (element) it is connected with, that is the audio stream with which the “current” received audio stream would ideally be played back with.

The third parameter may be used to specify an offset between the “current” received audio stream given by the audio element identifier ID and the audio stream to which it is connected in terms of playback given by the identifier IDconn . In embodiments, the offset parameter may be either expressed in physical time such as (ms) or as a number of time samples (at a specific sampling frequency). In other words, the offset time specifies the maximum time limit (due to network jitter) that may be tolerated in order that the “current” received audio stream and its corresponding connected audio stream may be played back together. Outside of this time limit the jitter buffer manager 405 may be arranged to play the “current” received audio stream rather than the combination of the “current” received audio stream and its corresponding connected audio stream.

An example of a signalling message may take the form of

ID /Pconn_ Offset

0001 0

0002 0001 40ms

0003 0

In this example we have three audio streams (or elements) with IDs 0001 , 0002 and 0003. For the case of audio stream with ID 0002, this audio stream is “connected” to audio stream 0001 . That is, ideally taking into account the jitter condition of the network, audio stream 0002 should be played back together with audio stream 0001 . The offset field associated with this pairing of audio streams indicates that audio stream 0002 can wait up to maximum delay of 40ms (due to network jitter conditions) for the audio stream 0001 to arrive in order that both streams can be played out together. If for instance audio stream 0001 is not received within the 40ms window then audio stream 0002 may be played out without audio stream 0001 . With respect to a JBM 405 which is integrated with an IVAS decoding/rendering system (such as that depicted in Figure 4). The JBM 405 may be configured to read that the RTP time stamp related to each received packet. So that upon receiving an audio stream, the JBM may check the above audio stream jitter management message accompanying the audio stream in order to determine whether the audio stream is firstly associated with a “connected” audio stream, and secondly whether the “connected” audio stream is available. If the “connected” audio stream is available then the IVAS decoder may be instructed (by the JBM) to perform the decoding/rendering of the audio stream and the associated “connected” audio stream. However, if it is determined by the JBM that the associated “connected” audio stream is not currently available (due to the RTP packet conveying the audio stream frames being delayed in the network) the JBM405 may commence delaying the playback of the current stream by up to the offset time in order to give the “connected” (delayed or missing) audio stream a window of opportunity to be received (by the receiver modem).

A jitter buffer manager may therefore be configured to control a delay time by receiving a first of at least two encoded audio streams and a message associated with the first of the at least two encoded audio streams, wherein the message comprises information relating to a maximum delay time between the first of the at least two encoded audio streams and a second of the at least two encoded audio streams. The jitter buffer manager may then be configured to determine whether the second of the at least two encoded audio streams has been received within the maximum delay time, and then instruct a decoder to decode the second and first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams and a decoded second of the at least two encoded audio streams when the second of the at least two encoded audio streams has been received within the maximum delay time, or instruct a decoder to decode the first of the at least two encoded audio streams and playing out a decoded first of the at least two encoded audio streams when the second of the at least two encoded audio streams has not been received within the maximum delay time. The jitter buffer manager may determine whether the second of the at least two encoded audio streams has been received within the maximum delay time by determining a time difference between the received second of the at least two encoded audio streams and the received first of the at least two encoded audio streams, and then determining that the time difference is less than the maximum delay time. If the jitter buffer manager determines that the time difference is less than the maximin delay time then there may be a delaying of the decoding of a frame a frame of the first of the at least two encoded audio streams relative to the instructing of the decoder to decode a frame of the second of the at least two audio streams by the time difference.

Figure 5 depicts a flow diagram of how a JBM 405 may implement and manage the receiving and decoding of an audio stream and its associated “connected” audio stream over a network experiencing network jitter.

A JBM 405 may be configured to receive at least two encoded audio streams (elements) from an RTP depacketizer. Each encoded audio stream may be conveyed/transmitted as a separate RTP stream, and the RTP depacketizer may be arranged to unpack the RTP frame in order to obtain the encoded audio streams. In addition, the RTP depacketizer may also be configured to read the RTP timestamp in order to assist in any delaying and synchronizing operation performed by the JBM. The operation of receiving the RTP streams is shown as processing step 501 in Figure 5.

In embodiments a RTP depacketizer may be arranged to push the encoded audio stream to a jitter buffer management module, once the process of extracting the encoded audio stream has been completed. This can be performed for each encoded audio stream as and when they arrive at the receiver. The operation of pushing encoded audio streams to a jitter buffer management module is depicted in Figure 5 by the processing step 502.

A JBM may then be arranged to read an audio stream jitter management signalling message associated with each received encoded audio stream in order to determine whether a particular audio stream is “connected” with other encoded audio streams. That is, whether the decoded audio streams should be played out together, and if so, what is the delay time window which can be tolerated between the audio streams should one of the audio streams fall foul to network jitter. As mentioned above the delay time window is given by the offset field in the above outlined message structure. The operation may be depicted in Figure 5 by the processing step of 503.

The JBM may be arranged to determine a target playout time for each received audio stream. For so called connected audio streams this target playout time may be determined with reference to the maximum delay that can be tolerated between two asynchronously received encoded audio streams, as given by the offset field. This operation may be depicted in Figure 5 by the processing step of 504.

The JBM is then arranged to check with “connected” audio streams have all been received within the required delay time window as given by the offset field. In the case that all so called “connected” audio streams have been received within the specified delay time window the JBM may be arranged to play the connected audio streams together. In the case of a delay between the “connected” audio streams the JBM may time align the “connected” decoded audio streams by either delaying one decoded audio stream relative to an another or utilizing techniques of time scale modification such as time warping. Flowever, should the JBM determine that a “connected” audio stream has not been received in within the specified delay time window, then the JBM may be arranged to simply push the decoded audio stream for playback without the associated “connected” audio stream. These operations are captured in Figure 5 by the processing steps of 505, 506 and 507.

In some embodiments, the adaptation logic may be configured to be used only in non-DTX operation, i.e., it may be utilized in regular active encoding only. In further embodiments, it may be configured to ignore any playback time adaptation logic according to embodiments when at least one of the at least two encoded audio streams is under active DTX operation. For example, when a first packet relating to DTX operation is received, the time adaptation logic may ignore audio stream jitter management signalling messages relating to said audio stream. When a first packet relating to regular active encoding (i.e., non-DTX operation) is again received, the time adaptation logic may continue to follow the audio stream jitter management signalling messages to said audio stream.

With respect to Figure 6 an example electronic device is shown which can be used to implement embodiments of the invention. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1700 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

In some embodiments the device 1700 comprises at least one processor or central processing unit 1707. The processor 1707 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1700 comprises a memory 1711. In some embodiments the at least one processor 1707 is coupled to the memory 1711. The memory 1711 can be any suitable storage means. In some embodiments the memory 1711 comprises a program code section for storing program codes implementable upon the processor 1707. Furthermore, in some embodiments the memory 1711 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1707 whenever needed via the memory-processor coupling.

In some embodiments the device 1700 comprises a user interface 1705. The user interface 1705 can be coupled in some embodiments to the processor 1707. In some embodiments the processor 1707 can control the operation of the user interface 1705 and receive inputs from the user interface 1705. In some embodiments the user interface 1705 can enable a user to input commands to the device 1700, for example via a keypad. In some embodiments the user interface 1705 can enable the user to obtain information from the device 1700. For example the user interface 1705 may comprise a display configured to display information from the device 1700 to the user. The user interface 1705 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1700 and further displaying information to the user of the device 1700.

In some embodiments the device 1700 comprises an input/output port 1709. The input/output port 1709 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1707 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example, in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1709 may be configured to receive the signals and in some embodiments obtain the focus parameters as described herein.

In some embodiments the device 1700 may be employed to generate a suitable audio signal using the processor 1707 executing suitable code. The input/output port 1709 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.