Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUDIO REPRESENTATION AND ASSOCIATED RENDERING
Document Type and Number:
WIPO Patent Application WO/2020/152394
Kind Code:
A1
Abstract:
An apparatus comprising means for:obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

Inventors:
LAAKSONEN LASSE (FI)
LAITINEN MIKKO-VILLE (FI)
RÄMÖ ANSSI (FI)
PIHLAJAKUJA TAPANI (FI)
VASILACHE ADRIANA (FI)
Application Number:
PCT/FI2020/050014
Publication Date:
July 30, 2020
Filing Date:
January 09, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04S7/00; G10L19/008; H04R3/12; H04S1/00; G10L19/02; G10L19/16; G10L25/18; H04S5/00
Domestic Patent References:
WO2018193163A12018-10-25
WO2017148526A12017-09-08
Foreign References:
GB2559199A2018-08-01
GB2559200A2018-08-01
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
CLAIMS:

1 . An apparatus comprising means for:

obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

2. The apparatus as claimed in claim 1 , wherein the input format further comprises a definition configured to control an encoder.

3. The apparatus as claimed in any of claims 1 to 2, wherein the means for is further for:

encoding a bit-exact mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.

4. The apparatus as claimed in claim 3 when dependent on claim 2, wherein the encoding a bit-exact mono audio signal based on the mono audio signal is based on the definition configured to control an encoder.

5. The apparatus as claimed in any of claims 1 to 2, the input format further comprising a multichannel audio signal, wherein the means for is further for:

encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.

6. The apparatus as claimed in any of claims 1 to 5, wherein the multichannel audio signal is a stereo audio signal.

7. The apparatus as claimed in any of claims 1 to 6, wherein the encoded mono audio signal is an enhanced voice system encoded mono audio signal.

8. The apparatus as claimed in any of claims 1 to 7, wherein the encoded multichannel audio signal is one of:

an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.

9. The apparatus as claimed in any of claims 1 to 8, wherein the encoded mono audio signal is an encoded mono audio signal.

10. The apparatus as claimed in any of claims 1 to 9, wherein the metadata signal comprises:

two directional parameters for each time-frequency tile, wherein the direction parameters are limited to a single plane of elevation; and

direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1.

11. A method comprising obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

12. The method as claimed in claim 11 , wherein the input format further comprises a definition configured to control an encoding.

13. The method as claimed in any of claims 11 to 12, further comprising:

encoding a mono audio signal based on the mono audio signal;

encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal. 14. The method as claimed in claim 13 when dependent on claim 12, wherein encoding a mono audio signal based on the mono audio signal further comprising encoding based on the definition.

15. The method as claimed in any of claims 11 to 12, wherein the input format further comprises a multichannel audio signal, wherein the method further comprises:

encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.

Description:
AUDIO REPRESENTATION AND ASSOCIATED RENDERING

Field

The present application relates to apparatus and methods for sound-field related audio representation and associated rendering, but not exclusively for audio representation for an audio encoder and decoder.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Furthermore parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics. Summary

There is provided according to a first aspect an apparatus comprising means for: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

The input format may further comprise a definition configured to control an encoder.

The means for may be further for: encoding a mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.

The encoding a mono audio signal based on the mono audio signal may be based on the definition configured to control an encoder.

The input format may further comprise a multichannel audio signal, wherein the means for may be further for: encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.

The multichannel audio signal may be a stereo audio signal.

The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.

The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.

The metadata signal may comprise: two directional parameters for each time- frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1 . According to a second aspect there is provided a method comprising obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

The input format may further comprise a definition configured to control an encoding.

The method may further comprise: encoding a mono audio signal based on the mono audio signal; encoding a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.

Encoding a mono audio signal based on the mono audio signal may further comprise encoding based on the definition.

The input format may further comprise a multichannel audio signal, wherein the method may further comprise: encoding a mono audio signal based on the mono audio signal; and encoding a multichannel audio signal based on the multichannel audio signal.

The multichannel audio signal may be a stereo audio signal.

The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.

The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.

The metadata signal may comprise: two directional parameters for each time- frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1 . According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal. The input format may further comprise a definition configured to control an encoding.

The apparatus may further be caused to: encode a mono audio signal based on the mono audio signal; encode a multichannel audio signal based on the mono audio signal and the metadata signal associated with the mono signal.

The apparatus caused to encode a mono audio signal based on the mono audio signal may further be caused to encode based on the definition.

The input format may further comprise a multichannel audio signal, wherein the apparatus may further be caused to: encode a mono audio signal based on the mono audio signal; and encode a multichannel audio signal based on the multichannel audio signal.

The multichannel audio signal may be a stereo audio signal.

The encoded mono audio signal may be an enhanced voice system encoded mono audio signal.

The encoded multichannel audio signal may be one of: an enhanced voice system encoded multichannel audio signal; and an Immersive Voice and Audio Services multichannel audio signal.

The metadata signal may comprise: two directional parameters for each time- frequency tile, wherein the direction parameters are limited to a single plane of elevation; and direct-to-total energy ratios associated with the two directional parameters, wherein the sum of direct-to-energy ratios for the two directions is 1 . According to a fourth aspect there is provided an apparatus comprising obtaining circuitry configured to obtain an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

According to a sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

According to a seventh aspect there is provided an apparatus comprising: means for obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

According to an eighth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining an input format for generating an encoded mono audio signal and/or a multichannel audio signal, the input format comprising: a mono audio signal and a metadata signal associated with the mono signal, the metadata signal configured to enable the generation of the encoded multichannel audio signal from the mono audio signal.

Encoding a mono audio signal may be encoding a bit-exact mono audio signal.

The encoded mono audio signal may be an encoded bit-exact mono audio signal.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above. A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;

Figure 2 shows schematically a system of apparatus for an IVAS encoder architecture of Figure 1 including a mono signal input;

Figure 3 shows schematically a first example IVAS encoder architecture of Figure 1 including a mono signal input according to some embodiments;

Figure 4 shows schematically a second example IVAS encoder architecture of Figure 1 including a mono signal input according to some embodiments;

Figure 5 shows example bit distribution for bitstream examples based on the first example IVAS encoder architecture shown in Figure 3;

Figures 6 to 9 show example voice conference systems employing some embodiments;

Figure 10 shows a third example IVAS encoder architecture of Figure 1 including a mono signal input according to some embodiments;

Figure 11 shows an example voice conference systems employing the third example IVAS encoder architecture as shown in Figure 10 according to some embodiments; and

Figure 12 shows an example device suitable for implementing the apparatus shown. Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient representation of audio in immersive systems which implement an embedded stereo (or spatial) mode of operation and which further supports a bit-exact enhanced voice services (EVS) compatible mono downmix bitstream in an efficient way. These examples enable the immersive encoding system to produce an EVS mono downmix bitstream that remains bit- exact against a standalone EVS implementation despite possible differences in mono and stereo (or spatial) pre-processing within the immersive encoding system. For example, there may be different filtering prior to encoding (including the stereo- to-mono downmix) between the immersive system and the standalone EVS. Such pre-processing operation typically introduces a delay into the signal path, which may affect the outcome of the coding, e.g., due to different framing.

The above may assume a well-understood downmix from stereo to mono. For example, a‘Mono = 0.5 x L + 0.5 x R’ is a well-understood downmix. Flowever, a practical stereo encoding may utilize an adaptive, smart downmix, for example to compensate inter-channel time/phase differences, which may maintain quality and produce as faithful a reproduction of the original stereo signal as a mono signal as possible.

Although the examples shown herein are described with respect to the IVAS codec any other codec or coding can implement some embodiments as described herein. For example an embedded bit-exact stereo or spatial extension can be of great interest for interoperability and conformance reasons (particularly in managed services with quality of service (QoS) requirements).

The concept as discussed in further detail hereafter is able in some embodiments to allow an embedded stereo (or spatial) extension to feature bit-exact legacy mono operation in an embedded encoding structure while providing freedom of high-quality stereo-to-mono (or spatial-to-mono) downmix. Additionally the embodiments can extend Metadata-assisted spatial audio (MASA), which may be understood (at least) as an ingest format intended for the 3GPP IVAS audio codec, for (embedded) stereo encoding in a“spatial” MASA compatible way.

The embodiments may thus define an EETU-MASA (Embedded EVS Stereo Using Metadata-Assisted Spatial Audio) method allowing embedded stereo (and by regular MASA extension, of course, immersive/spatial) operation on top of the legacy EVS codec in a way where the mono downmix can be guaranteed to be bit- exact with EVS operation as specified by the standards such as TS 26.445, TS 26.442, and TS 26.444. This allows for straightforward conformance in 3GPP services (e.g., MTSI).

In some embodiments EETU-MASA is implemented as stereo as 1 -channel + metadata input: In such an example the MASA format is configured with a channel configuration option‘stereo using mono input + MASA metadata’. This configuration information can be provided for example via a specific metadata field (e.g., called ‘Channel configuration’ or‘Channel audio format’). An IVAS encoder on receiving this this input can be configured to select EVS as the core coding of the mono stream (treating the input as mono without metadata), while any MASA metadata is fed into MASA metadata encoder. The bitstream from the IVAS encoder can in some embodiments include a bit-exact EVS mono and additional MASA metadata providing a stereo extension. This can be transmitted to a suitable IVAS decoder as is. The IVAS recipient receives the IVAS mono stream and decodes stereo (or spatial audio). Alternatively, in some embodiments a suitable network element (such as MCU) can, for a legacy EVS recipient, drop or strip the additional metadata. Thus, the network element performs a transcoding from IVAS to EVS which is lossless for the mono part. In such embodiments the EVS recipient receives the EVS mono stream and decodes mono.

The stereo mode of the MASA spatial metadata can be achieved using the definition below:

Provide two direction parameters per time-frequency (TF) tile;

Direction index is limited to Left and Right with no elevation;

Sum of Direct-to-total energy ratio for the two Directions is always 1 .0;

All the other parameters can be omitted or set to zero/default. The above definition may be particularly useful in situations with independent stereo channels. For example in a multipoint control unit (MCU) stereo audio use case.

In such embodiments the capture and audio processing system before the (IVAS) encoder, in other words the part of the end-to-end system that creates the (MASA) input for the encoder, is free to apply any stereo-to-mono downmix that is suitable for best possible quality. The (EVS/IVAS) encoder in such embodiments is configured to see the single mono downmix, and bit-exactness of the mono signal is therefore maintained.

The methods as described in the embodiments hereafter can furthermore be used for core codecs other than EVS and for extensions/immersive codecs other than IVAS.

In some embodiments an additional advantage may be that a bit exact embedded stereo is allowed also on top of EVS adaptive multi-rate wideband (AMR- WB) interoperable (IO) modes (and therefore not only the EVS primary modes).

In some embodiments there may be defined an EETU-MASA with stereo input as 3-channel + metadata example. In such embodiments, the MASA format comprises a channel configuration option of‘stereo using combination of mono input and stereo input + MASA metadata’. This configuration information can be provided for example via a specific metadata field (e.g., called ‘Channel configuration’ or ‘Channel audio format’). In an (IVAS) encoder, this input can configure the encoder to select EVS as the core coding of the mono stream (treating the input as mono without metadata), while at least the stereo stream with the MASA metadata is fed into the IVAS MASA encoder (including the metadata encoding). This mode of operation is thus a parallel stereo/spatial mono downmix encoding for bit-exact backwards interoperability.

The bitstream from the (IVAS) encoder will comprise a bit-exact (EVS) mono and additional (IVAS) bitstream with (MASA) metadata providing a stereo extension. This can be transmitted to an (IVAS) decoder as is (or with the EVS payload dropped). The (IVAS) recipient receives the (IVAS) stream and decodes stereo (or spatial audio). Alternatively in some embodiments a suitable network element (such as a MCU) can drop or strip, for a legacy EVS recipient, everything beyond the EVS bitstream. Thus, the network element can be configured to perform a transcoding from IVAS to EVS which is lossless for the mono part. The EVS recipient can be configured to receive the EVS mono stream and decodes the mono signals.

In some embodiments, the IVAS encoder (or any stereo/spatial encoder) can provide the EVS bitstream (or any mono bitstream) and the IVAS bitstream (or any stereo/spatial bitstream) as separate packets.

Before discussing the embodiments further we initially discuss the systems for obtaining and rendering spatial audio signals which may be used in some embodiments.

With respect to Figure 1 is shown an example apparatus and system for implementing the obtaining and encoding an audio signal (in the form of audio capture in this example) and rendering (the encoded audio signals).

The system 100 is shown with an‘analysis’ part 121 and a‘synthesis’ part 131 . The‘analysis’ part 121 is the part from receiving the multi-channel signals up to an encoding of the metadata and transport signal and the‘synthesis’ part 131 is the part from a decoding of the encoded metadata and transport signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).

The input to the system 100 and the‘analysis’ part 121 is the multi-channel signals 102. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example in some embodiments the spatial analyser and the spatial analysis may be implemented external to the encoder. For example in some embodiments the spatial metadata associated with the audio signals may be a provided to an encoder as a separate bit-stream. In some embodiments the spatial metadata may be provided as a set of spatial (direction) index values.

The multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105. In some embodiments the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable transport signal comprising a determined number of channels and output the transport signals 104. For example the transport signal generator 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. The transport signal generator in some embodiments is configured to otherwise select or combine, for example, by beamforming techniques the input audio signals to the determined number of channels and output these as transport signals.

In some embodiments the transport signal generator 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the transport signal are in this example.

In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the transport signals 104. The analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108 and an energy ratio parameter 1 10 and a coherence parameter 1 12 (and in some embodiments a diffuseness parameter). The direction, energy ratio and coherence parameters (and diffuseness parameter) may in some embodiments be considered to be spatial audio parameters. In other words the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi-channel signals (or two or more playback audio signals in general).

In some embodiments the parameters generated may differ from frequency band to frequency band. Thus for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. The transport signals 104 and the metadata 106 may be passed to an encoder 107. In some embodiments, the spatial audio parameters may be grouped or separated into directional and non-directional (such as, e.g., diffuse) parameters.

The encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport (for example downmix) signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a metadata encoder/quantizer 1 1 1 which is configured to receive the metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.

In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport extractor 135 which is configured to decode the audio signals to obtain the transport signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata. The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

The decoded metadata and transport audio signals may be passed to a synthesis processor 139.

The system 100‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the transport and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural signals for headphone listening or Ambisonics signals, depending on the use case) based on the transport signals and the metadata.

Therefore in summary first the system (analysis part) is configured to receive multi-channel audio signals.

Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting or downmixing some of the audio signal channels).

The system is then configured to encode for storage/transmission the transport signal and the metadata.

After this the system may store/transmit the encoded transport and metadata.

The system may retrieve/receive the encoded transport and metadata.

Then the system is configured to extract the transport and metadata from encoded transport and metadata parameters, for example demultiplex and decode the encoded transport and metadata parameters.

The system (synthesis part) is configured to synthesize an output multi channel audio signal based on extracted transport audio signals and metadata.

In some embodiments the apparatus and methods can be implemented as part of a MASA format definition, encoder functionality, and bitstream format (including, e.g., RTP header). These embodiments are relevant for the audio codec standard as well as various network functionalities (e.g., MCU operation).

With respect to Figure 2 is shown a high-level view of an example IVAS encoder including the various inputs which may, as non-exclusive examples, be expected for the codec. The underlying idea is that mono signals are handled by a bit-exact implementation of the EVS codec, while any stereo, spatial or immersive input is handled by the IVAS core tools complemented in some cases by a metadata encoder.

The system as shown in Figure 2 can comprise input format generators 203. The input format generators 203 may be considered in some examples to be the same as the transport signal generator 103 and the analysis processor 105 from Figure 1 . The input format generators 203 may be configured to generate suitable audio signals and metadata for capturing the audio and spatial audio qualities of the input signals, which may originate from microphone capture, some other source (such as a file) or a combination thereof. For example, a relevant microphone capture may be a multi-microphone audio capture on a mobile device (such as a smartphone), while a relevant other source may be a channel-based music file (such as a 5.1 music mix file). Any other suitable microphone array capture or source can also be used.

The input format generators 203 can comprise a mono audio signal generator 205 configured to generate a suitable mono audio signal.

The input format generators 203 can also comprise a multichannel or spatial format generator 207.

The multichannel or spatial format generator 207 in some embodiments comprises a metadata-assisted spatial audio generator 209. The metadata-assisted spatial audio generator 209 is configured to generate audio signals (such as the transport audio signals in the form as a stereo-channel audio signal) and metadata associated with the audio signals.

The multichannel or spatial format generator 207 in some embodiments comprises a multichannel format generator 21 1 configured to generate suitable multichannel audio signals (for example stereo channel format audio signals and/or 5.1 channel format audio signals).

The multichannel or spatial format generator 207 in some embodiments comprises an ambisonics generator 213 configured to generate a suitable ambisonics format audio signal (which may comprise first order ambisonics and/or higher order ambisonics).

The multichannel or spatial format generator 207 in some embodiments can comprise an independent mono streams with metadata generator 215 configured to generate mono audio signals and metadata.

In some embodiments the apparatus comprises encoders 221 . The encoder(s) is configured to receive the output of the input format generators 203 and encode these into a suitable format for storage and/or transmission. The encoders may be considered to be the same as the encoder 107. The encoders 221 may comprise a bit exact EVS encoder 223. The bit exact EVS encoder 223 may be configured to receive the mono audio signal from the input format generators 203 and generate a bit exact EVS mono audio signal.

In some embodiments the encoders 221 may comprise IVAS core encoder 225. The IVAS core encoder 225 may be configured to receive the audio signals generated by the input format generators 203 and encode these according to the IVAS standard.

In some embodiments the encoders comprises a metadata encoder 227. The metadata encoder is configured to receive the spatial metadata and encode it or compress it in any suitable manner.

The encoders 221 in some embodiments can be configured to combine or multiplex the datastreams generated by the encoders prior to being transmitted and/or stored.

The system furthermore comprises a transmitter configured to transmit or store the bitstream 231 .

With respect to Figure 3 furthermore it is shown how an embedded EVS stereo generated signal can be implemented within the system shown in Figure 2.

Thus in this example there comprises a mono input 301 and a stereo (and immersive audio) input 303. The mono input 301 is passed to the encoder 31 1 and the bit exact EVS encoder 317 in the same manner as shown in Figure 2.

The stereo and immersive audio input 303 is passed to the encoder 31 1 and a pre-processor 315.

The encoder 31 1 in some embodiments comprises a pre-processor 315. The pre-processor 315 may be configured to receive the stereo and immersive inputs and pre-process the signal before being passed to the downmixer 313 and to the IVAS core encoder 319. The metadata output of the pre-processor 315 can be passed to the metadata encoder 321 .

The encoder 31 1 furthermore comprises a downmixer 313. The downmixer 313 is configured to process the pre-processed audio signal and output a downmixed or mono channel audio signal to the bit-exact EVS encoder 317. The downmixer 313 in some embodiments is further configured to output metadata associated with the downmixed audio signal to the metadata encoder 321 .

The encoder 31 1 may comprise a bit-exact EVS encoder 317. The bit-exact EVS encoder 317 may be configured to receive the mono audio signal from the mono input 301 and the downmixer 313 and generate an EVS mono audio signal.

In some embodiments the encoder 31 1 may comprise the IVAS core encoder 319. The IVAS core encoder 319 may be configured to receive the audio signals generated by the pre-processor 315 and encode these according to the IVAS standard.

In some embodiments the encoders comprises the metadata encoder 321 . The metadata encoder 321 is configured to receive the spatial metadata from the downmixer 313 and pre-processor 315 and encode it or compress it in any suitable manner.

With respect to Figure 4 is shown how an embedded EVS stereo generated signal can be implemented within the system shown in Figure 2 according to a first example embodiment. This example improves over the example shown in Figure 3 in that although the apparatus in Figure 3 implements an embedded EVS stereo it is not a bit exact output when compared to a mono downmix of the same stereo signal into a legacy EVS mono encoder. This is because there is a signal delay due to pre-processing (such as any highpass or lowpass filtering) affecting among other things the exact framing of the signal encoding. For example, if the input framing in the encoder is changed even by introducing a one-sample delay, the resulting bitstream will be different. In addition, the pre-processing itself can change the signal characteristics (such as removal of low-frequency or high-frequency components). Another example is if an active downmix is performed to deal with certain time/phase alignment effect, and this downmix processing differs from the downmix performed outside the codec. Although the apparatus in Figure 3 may be modified such that the pre-processing is skipped when the embedded stereo mode is used this complicates the apparatus and introduces mode switching issues.

The embodiments further improve over the apparatus as shown in Figure 3 in that the downmix inside the codec is not limited to a simple downmix to be able to produce the same downmix outside the codec and inside the codec (as could be required for any managed system conformance test, where the requirement to be tested is providing“an embedded bit-exact EVS mono downmix bitstream”).

The example shown in Figure 4 features the same mono input 301 and a stereo (and immersive audio) input 303. The mono input 301 is passed to the encoder 31 1 and the bit exact EVS encoder 317 in the same manner as shown in Figure 3.

The stereo and immersive audio input 303 is passed to the encoder 31 1 and a pre-processor 315.

The encoder 31 1 in some embodiments comprises the pre-processor 315. The pre-processor 315 may be configured to receive the stereo and immersive inputs and pre-process the signal before being passed to the IVAS core encoder 319. The metadata output of the pre-processor 315 can be passed to the metadata encoder 321 .

As shown in Figure 4 the apparatus differs from the example shown in Figure 3 in that the format generator/inputs include a further input. In this example the further input is designated the Embedded EVS stereo using MASA (EETU-MASA) input 401 . In this example a mono-downmixed parametric stereo representation of the stereo input is thus used which removes the need for passing the stereo or other multichannel audio signals through the pre-processor, for the inclusion of the downmixer prior to the EVS encoder, and allows the use of the metadata encoding as is.

The mono-downmixed parametric stereo representation in some embodiments is an extension of the MASA format. The extension is compatible with the MASA format parameter set. In principle, it is straightforward to allow encoding mode switching with this input, however, in some embodiments the mode is primarily used for the embedded bit-exact EVS stereo operation.

In some embodiments, the EETU-MASA input can be defined as (or additionally support) the following:

one or two Direction parameters per time-frequency (TF) tile; a direction index limited to planar front sector (left-front-right) or any equivalent sector;

a sum of direct-to-total energy ratio for the two Directions < 1 .0; and other parameters may also have non-zero values

The stereo-to-mono downmix may be determined based on a capture or device implementation preference.

The EETU-MASA input is configured to pass the audio signal 441 to the bit- exact-EVS encoder 317 and to pass the metadata 443 to the metadata encoder 321 .

The encoder 31 1 may comprise a bit-exact EVS encoder 317. The bit-exact EVS encoder 317 may be configured to receive the mono audio signal from the mono input 301 and the EETU-MASA input 401 and attempt to generate a bit exact EVS mono audio signal.

In some embodiments the encoder 31 1 may comprise the IVAS core encoder 319. The IVAS core encoder 319 may be configured to receive the audio signals generated by the pre-processor 315 and encode these according to the IVAS standard.

In some embodiments the encoders comprises the metadata encoder 321 . The metadata encoder 321 is configured to receive the spatial metadata from the EETU-MASA input 401 and pre-processor 315 and encode it or compress it in any suitable manner.

In some embodiments the rendering at the decoder is configured to provide a stereo signal. It is understood this stereo is preferably a head-locked stereo (in other words no head-tracking is needed and should not affect the rendering).

It is possible to implement the two above modes in a switching system, where a mode selection selects based on a relevant criteria one of the two modes for each frame of audio. Typically, fluctuation from one mode to another and back on a frame- to-frame basis would be avoided. The mode selection in this case is part of the front- end processing and seen on the format level by the audio encoder.

In some embodiments the EETU-MASA format comprises a channel configuration parameter which may be defined as a channel configuration specifying ‘stereo input as mono + restricted MASA metadata’. In some embodiments this configuration information when detected by the encoder 41 1 configures the EVS encoder 317 to automatically trigger EVS mono encoding and configures the metadata encoder 321 to generate a separate metadata (stream) encoding for the stereo extension.

Example outputs from the encoder are show in Figure 5. Figure 5a (the upper block) shows an example where the full IVAS payload is allocated between EVS BE bitstream and stereo (spatial) extension metadata. Thus for example where the available bitrate is 13.2 kbps the EVS BE allowance may be 9.6 kbps and the metadata 3.6 kbps, where the available bitrate is 16.4 kbps the EVS BE allowance may be 13.2 kbps and the metadata 3.2, kbps where the available bitrate is 24.4 kbps the EVS BE allowance may be 16.4 kbps and the metadata 8.0 kbps and where the available bitrate is 32.0 kbps the EVS BE allowance may be 24.4 kbps and the metadata 7.6 kbps.

Figure 5b (the middle block) illustrates an option where the extension bit rate is reduced to allow the first bit in the IVAS payload to indicate the extension usage as shown by the small 0.05 kbps block proceeding the EVS BE blocks. Thus for example where the available bitrate is 13.2 the extension usage is 0.05 kbps, the EVS BE allowance may be 9.6 kbps and the metadata 3.55 kbps, where the available bitrate is 16.4 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 13.2 kbps and the metadata 3.15 kbps where the available bitrate is 24.4 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 16.4 kbps and the metadata 7.95 kbps and where the available bitrate is 32.0 kbps the extension usage is 0.05 kbps, the EVS BE allowance may be 24.4 kbps and the metadata 7.55 kbps.

Figure 5c (the lower block) shows a further illustration for a 32-kbps packet, which is similar to the middle block but utilizing the first bit of each embedded stream for increased packet flexibility. In this example the 32 kbps packet can be divided into extension usage of 4x 0.05 kbps, 9.6 kbps EVS BE, 3.55 kbps metadata, 3.15 kbps metadata, 7.95 kbps metadata and 7.55 kbps metadata. The 32 kbps packet can also be divided into extension usage of 3x 0.05 kbps, 13.2 kbps EVS BE, 3.15 kbps metadata, 7.95 kbps metadata and 7.55 kbps metadata. The 32 kbps packet also can be divided into extension usage of 2x 0.05 kbps, 16.4 kbps EVS BE, 7.95 kbps metadata and 7.55 kbps metadata. Additionally is shown the 32 kbps packet can be divided into extension usage of 1 x 0.05 kbps, 24.4 kbps EVS BE and 7.55 kbps metadata. This illustrates the flexibility of the embedded packetization.

In the examples shown in Figure 5, part of the bits used for the metadata can in some embodiments be used for residual coding, a differential extra layer on top of the core EVS coded downmix. The difference can be applied on top of sub blocks of the core codec, for example as, Algebraic code-excited linear prediction (ACELP) sub blocks, TCX sub blocks, etc. These methods can in some embodiments extend the usage of the methods to non-bit-exact embedded mono encoding systems.

In these embodiments the EETU-MASA input is a straightforward new extension of the MASA metadata definition providing an additional audio representation/mode based on a limitation applied on parameter usage (parameters used and allowed values). It is designed to be fully compatible with MASA format.

In some embodiments the EETU-MASA enables IVAS stereo operation with an embedded bit-exact EVS mono downmix bitstream. According to some embodiments the IVAS operation can also be a spatial operation with an embedded bit-exact EVS mono downmix bitstream. The embodiments furthermore allow a switching between stereo and spatial IVAS operation based on the input metadata while providing an embedded bit-exact EVS mono downmix bitstream.

With respect to Figures 6 to 9 are shown a series of example use cases implementing embodiments. Figure 6 presents a first voice conferencing scenario between three participants with a wide range of device capabilities. The system shows a legacy EVS upstream 602 implementation on user equipment 601 with mono capture and playback via earpiece (user A), an IVAS upstream 604 implementation on user equipment 603 with spatial audio capture and playback via headphones (user B), and an IVAS upstream 606 implementation on a conference room setup 605 using stereo audio capture and multi-channel loudspeaker presentation (user C). In this example the common codec that can be negotiated between these users is the EVS codec (either legacy EVS for all or with two users using EVS in IVAS). However, two users would have full IVAS capability with the first of them being able to provide spatial audio upstream (IVAS MASA) with preference for stereo/binaural downstream/presentation and with the second of the two users being able to provide stereo audio upstream (IVAS stereo) with preference for multi channel spatial/immersive audio playback.

As the legacy EVS user 601 requires an EVS mono downstream, it seems there are two ways to handle the downstream audio (when the legacy EVS user 601 is silent). Where there is no mixing, the two ways are:

to produce a single EVS mono downstream for all participants; or to produce an EVS mono downstream for the legacy EVS user and a suitable IVAS downstream for the other user.

For this use case, it is understood that an embedded mode can be very desirable.

This can be further shown via Figure 7, which presents the same scenario as Figure 6 for the downstream, and Figure 8, which adds an additional fourth user D using user equipment 807 who is always in a listening-only mode. For example, fourth user D has joined the audio conference through a separate number or link allowing user D only to listen in.

As shown in Figure 7 each user is delivered an audio representation that is relevant to the user equipment with a reduced number of re-encodings and different bitstreams. Thus for example where a transmitting user (for example user equipment 603 or 605 sends an IVAS payload consisting of‘EVS+stereo metadata’ to the network. For receiving user equipment associated with user A (user equipment 601 ) and user B (user equipment 603), the spatial metadata is stripped and legacy EVS is delivered. Thus for example the MCU may be configured to transmit EVS mono with stereo/spatial metadata stripped out 702 to the user equipment 601 and may be further configured to transmit EVS mono with stereo metadata (with any spatial metadata stripped out) 704 to the user equipment 603. Immersive participants for example user C operating user equipment 605 may be configured to receive from the MCU 607 a EVS mono and spatial metadata downlink 706. Furthermore as shown in Figure 8 user D operating user equipment 807 may be configured to receive from the MCU 607 an EVS mono and spatial metadata downlink 808.

In such an example it is possible to see that overall delivery load in the network is reduced, where a single bitstream is suitable for all receiving users (or for as many users as possible).

It is understood that in the examples of Figures 6 to 8 user B 603 could at least in some embodiments also receive bitstream describing EVS mono and spatial metadata instead of EVS mono and stereo metadata. This is because a spatial signal can be presented over headphones, e.g., via means of binauralization.

With respect to Figure 9 is shown a further example wherein user B (user equipment 903) uploads or transmits to a MCU node 915 of a network an IVAS payload consisting of EVS mono 906 and stereo metadata 904. The MCU nodes 915, 917 are shown passing the IVAS payload (EVS mono 906 and stereo metadata 904). The receiving user equipment associated with receiver 1 (user equipment 901 ) is configured to receive from the MCU node 915 signals where the metadata is stripped and legacy EVS 906 is delivered. Similarly the MCU node 917 may be configured to transmit EVS mono 906 with stereo/spatial metadata stripped out to the user equipment 905.

Immersive participants, for example receiver 3 (user equipment 907) may be configured to receive from the MCU 917 an IVAS payload (in the form of an EVS mono 906 and spatial metadata 904) downlink.

With respect to Figure 10 is shown a further example of some embodiments. In this example it differs from the example shown in Figure 4 in that the EETU-MASA input 401 (and a stereo or multichannel signal) is furthermore passed 1045 to the pre-processor 315. In such embodiments backward compatible embedded EVS encoding in the IVAS codec is achieved by representing the stereo input as a combination of a mono downmix and a stereo (or more generally multichannel) representation. In such an example the input is thus a 3-channel input. The input can furthermore include full MASA metadata. In preferred embodiments, this can be considered to be a special case of MASA input format.

The mono downmix can be generated using any suitable means. The resulting mono signal is utilized as one component of the 3-channel input. The original stereo, or stereo with MASA metadata, is similarly used as one component of the 3-channel input. In some embodiments the 3-channel input for the IVAS encoder can be created for example based on a mixing of at least two audio streams (e.g., an operation on an MCU). At least in some embodiments, any delay incurred by the mono downmix can be taken into account for the stereo signal of the 3- channel input. The mono and stereo audio signals can thus be fully aligned in time.

In such an example as shown in Figure 10, the mono channel of the 3- channel input is fed (without metadata) into a bit-exact EVS encoder. In some embodiments the EVS codec may be instructed to encode the signal at a fixed bit rate. In other words the EVS encoding can be without bit rate switching. The result of which produces a fixed-bit rate EVS bitstream. The stereo (+ metadata) is encoded using the IVAS core encoding (and metadata encoding). In some embodiments the mono stream may be fed also to the IVAS core encoder (and may in some embodiments be always provided to the EVS encoder).

The substantially simultaneously encoded EVS and IVAS frames (both of length 20 ms, although they may have a slight relative offset due to potential mismatch in core encoder lookahead) are packed together in some embodiments into a common package for transmission.

For example, there may be various coding and transmission modes with package-embedded EVS or embedded scalable EVS enabled, such as for example:

Above, by“package-embedded” it is understood that the EVS bitstream is part of the IVAS package in a special operation mode. The EVS bitstream can be provided to a legacy EVS user. However, when IVAS is being decoded, the EVS bitstream may be simply discarded. The first two examples may be implemented in this way. By“embedded scalable” it is understood“regular” embedded operation, for example resembling Figure 5. The third example may be implemented in such a manner. This package in some embodiments includes three separate encodings: EVS at 13.2 kbps, EVS-based IVAS stereo at 16.4 kbps, and a 47.6 kbps IVAS encoding (that may be, for example, a high-quality stereo or a spatial signal).

Figure 1 1 presents a further use example associated with the further example shown in Figure 10. In this example a fixed packed size 1 1 15 may be used to communicate between MCUs such as MCU 1 121 and MCU 1 1 1 1 . While it may seem wasteful to deliver several encodings in a single package, there may be systems where a fixed prioritized packet size (e.g., a 64-kbps channel or some other size channel) for voice communication is implemented. The“package-embedded” delivery can in this case be used to provide various levels of service, e.g., to conference call participants with different capabilities.

Thus for example an IVAS mobile device 1 101 and user may establish an IVAS connection 1 102 (for example with a MASA input) with the MCU 1 121 . A legacy EVS mobile device 1 105 and user may establish an EVS only connection 1 106 with the MCU 1 121 . A further legacy EVS mobile device 1 103 and user may establish an EVS only connection 1 104 with the MCU 1 1 1 1 . Also as shown in Figure 1 1 a fixed line device 1 107 and user may additionally establish a fixed package size 1 108 (for example 64kbps) connection with the MCU 1 1 1 1 .

With respect to Figure 12 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 141 1 . In some embodiments the at least one processor 1407 is coupled to the memory 141 1 . The memory 141 1 can be any suitable storage means. In some embodiments the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.

In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.