Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DECODER SPATIAL COMFORT NOISE GENERATION FOR DISCONTINUOUS TRANSMISSION OPERATION
Document Type and Number:
WIPO Patent Application WO/2021/255328
Kind Code:
A1
Abstract:
An apparatus comprising means configured to: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; and generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

Inventors:
RÄMÖ ANSSI (FI)
LAAKSONEN LASSE (FI)
Application Number:
PCT/FI2021/050364
Publication Date:
December 23, 2021
Filing Date:
May 20, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
G10L19/012; G10L19/18; G10L25/21; H04W76/28; G06N20/00; G10L19/028; G10L25/78; H04S7/00
Domestic Patent References:
WO2014143582A12014-09-18
WO2017202680A12017-11-30
Foreign References:
EP2866228A12015-04-29
EP2936486B12018-07-18
US9865274B12018-01-09
US20150248889A12015-09-03
Other References:
NOKIA CORPORATION: "Description of the IVAS MASA C Reference Software", 3GPP DRAFT S 4-191167 . IN: 3GPP.ORG, 3GPP TSG-SA4 #106 MEETING, 25 October 2019 (2019-10-25), pages 1 - 16, XP051799447, Retrieved from the Internet [retrieved on 20210908]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
CLAIMS:

1 . An apparatus comprising means configured to: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

2. The apparatus as claimed in claim 1 , wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is configured to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

3. The apparatus as claimed in any of claims 1 and 2, wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is configured to: obtain or update a spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal; and generate the at least one spatial comfort noise parameter based on spatial comfort noise generator model.

4. The apparatus as claimed in claim 3, wherein the means configured to obtain or update the spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal and during the active mode of the apparatus is configured to determine: at least one comfort noise directional component, wherein the comfort noise directional component is associated with a frequency band and time window; and at least one comfort noise energy ratio, wherein the at least one comfort noise energy ratio is associated with one of the at least one comfort noise directional component.

5. The apparatus as claimed in claim 4, wherein the means configured to determine at least one comfort noise directional component, wherein the comfort noise directional component is associated with the frequency band and time window is configured to determine a time and/or frequency smoothed directional component based on a combination of a determined comfort noise directional component and a directional component of the metadata parameters.

6. The apparatus as claimed in any of claims 4 or 5, wherein the means configured to determine at least one comfort noise energy ratio is configured to determine at least one time and/or frequency smoothed comfort noise energy ratio based on a combination of a determined comfort noise energy ratio directional component and an energy ratio of the metadata parameters.

7. The apparatus as claimed in any of claims 1 to 6, wherein the means is further configured to obtain, during the discontinuous transmission mode of operation of the apparatus, at least one comfort noise audio signal, wherein the means configured to generate, during the discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter is configured to generate the least one spatial comfort noise audio signal based on the at least one comfort noise audio signal.

8. The apparatus as claimed in claim 7 wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is further configured to generate at least one spatial comfort noise parameter based on the at least one comfort noise audio signal.

9. The apparatus as claimed in any of claims 1 to 8, wherein the means is further configured to: determine, during the active mode of operation of the apparatus, a time window or period based on a source activity determination, wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is configured to generate at least one spatial comfort noise parameter further based on the time window or period.

10. The apparatus as claimed in claim 9, wherein the means configured to generate at least one spatial comfort noise parameter further based on the time window or period is configured to track the at least one spatial comfort noise parameter more accurately during the time window or period when the source activity determination determines substantially background noise.

11. The apparatus as claimed in any of claims 1 to 10, wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is configured to: determine a statistical model for spatial comfort noise; and generate the at least one spatial comfort noise parameter based on the statistical model of directional occurrences.

12. The apparatus as claimed in claim 11, wherein the means configured to determine the statistical model for spatial comfort noise is configured to determine the statistical model based on at least one of: a directional component of the metadata; an energy component of the metadata; a source activity determination.

13. The apparatus as claimed in any of claims 11 or 12, wherein the means configured to determine the statistical model for spatial comfort noise is configured to determine at least one of: a direction; a directional sector occurrence or hit-rate; a directional sector frequency or hit-ratio; an energy ratio; a spread coherence; a distance; a mean of any of the above; and a variance of any of the above.

14. The apparatus as claimed in any of claims 1 to 13, wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus is configured to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

15. The apparatus as claimed in any of claims 1 to 14, wherein the means configured to generate, during the discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter is configured to spatially render a spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

16. A method for an apparatus comprising: obtaining at least one audio signal and metadata parameters associated with the at least one audio signal; generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; and generating, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

Description:
DECODER SPATIAL COMFORT NOISE GENERATION FOR DISCONTINUOUS

TRANSMISSION OPERATION

Field

The present application relates to apparatus and methods for decoder spatial comfort noise generation for discontinuous transmission operation, but not exclusively for immersive audio codec based decoder spatial comfort noise generation for discontinuous transmission operation.

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Voice Activity Detection (VAD), also known as speech activity detection or more generally as signal activity detection is a technique used in various speech processing algorithms, most notably speech codecs, for detecting the presence or absence of human speech. It can be generalized to detection of active signal, i.e., a sound source other than background noise. Based on a VAD decision, it is possible to utilize, e.g., a certain encoding mode in a speech encoder.

Discontinuous Transmission (DTX) is a technique utilizing VAD intended to temporarily shut off parts of active signal processing (such as speech coding according to certain modes) and the frame-by-frame transmission of encoded audio. For example rather than transmitting normal encoded frames a simplified update frame is sent to drive a comfort noise generator (CNG) at the decoder. The use of DTX can help with reducing interference and/or preserving/reallocating capacity in a practical mobile network. Furthermore the use of DTX can also help with battery life of the device.

Comfort Noise Generation (CNG) is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed. For example comfort noise generation can be implemented under a DTX operation.

Silence Descriptor (SID) frames can be sent during speech inactivity to keep the receiver CNG decently well aligned with the background noise level at the sender side. This is of particular importance at the onset of each new talk spurt. Thus, SID frames should not be too old, when speech starts again. Commonly SID frames are sent regularly e.g. every 8 th frame, but some codecs allow also variable rate SID updates. SID frames are typically quite small e.g. 2.4kbit/s SID bitrate equals 48 bits per frame.

Summary

There is provided according to a first aspect an apparatus comprising means configured to: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

The means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

The means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to: obtain or update a spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal; and generate the at least one spatial comfort noise parameter based on spatial comfort noise generator model.

The means configured to obtain or update the spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to determine: at least one comfort noise directional component, wherein the comfort noise directional component is associated with a frequency band and time window; and at least one comfort noise energy ratio, wherein the at least one comfort noise energy ratio is associated with one of the at least one comfort noise directional component.

The means configured to determine at least one comfort noise directional component, wherein the comfort noise directional component is associated with the frequency band and time window may be configured to determine a time and/or frequency smoothed directional component based on a combination of a determined comfort noise directional component and a directional component of the metadata parameters.

The means configured to determine at least one comfort noise energy ratio may be configured to determine at least one time and/or frequency smoothed comfort noise energy ratio based on a combination of a determined comfort noise energy ratio directional component and an energy ratio of the metadata parameters.

The means may be further configured to obtain, during the discontinuous transmission mode of operation of the apparatus, at least one comfort noise audio signal, wherein the means configured to generate, during the discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may be configured to generate the least one spatial comfort noise audio signal based on the at least one comfort noise audio signal. The means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be further configured to generate at least one spatial comfort noise parameter based on the at least one comfort noise audio signal.

The means may be further configured to: determine, during the active mode of operation of the apparatus, a time window or period based on a source activity determination, wherein the means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to generate at least one spatial comfort noise parameter further based on the time window or period.

The means configured to generate at least one spatial comfort noise parameter further based on the time window or period may be configured to track the at least one spatial comfort noise parameter more accurately during the time window or period when the source activity determination determines substantially background noise.

The means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to: determine a statistical model for spatial comfort noise; and generate the at least one spatial comfort noise parameter based on the statistical model of directional occurrences.

The means configured to determine a statistical model for spatial comfort noise may be configured to determine the statistical model based on at least one of: a directional component of the metadata; an energy component of the metadata; a source activity determination.

The means configured to determine a statistical model for spatial comfort noise may be configured to determine at least one of: a direction; a directional sector occurrence or hit-rate; a directional sector frequency or hit-ratio; an energy ratio; a spread coherence; a distance; a mean of any of the above; and a variance of any of the above.

The means configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be configured to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

The means configured to generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may be configured to spatially render a spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to second aspect there is provided a method, the method comprising: obtaining at least one audio signal and metadata parameters associated with the at least one audio signal; generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation; and generating, during a discontinuous transmission mode of operation, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

Generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation may comprise generating at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

Generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation may comprise: obtaining or updating a spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal; and generating the at least one spatial comfort noise parameter based on spatial comfort noise generator model.

Obtaining or updating the spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal and during an active mode may comprise determining: at least one comfort noise directional component, wherein the comfort noise directional component is associated with a frequency band and time window; and at least one comfort noise energy ratio, wherein the at least one comfort noise energy ratio is associated with one of the at least one comfort noise directional component.

Determining at least one comfort noise directional component, wherein the comfort noise directional component is associated with the frequency band and time window may comprise determining a time and/or frequency smoothed directional component based on a combination of a determined comfort noise directional component and a directional component of the metadata parameters.

Determining at least one comfort noise energy ratio may comprise determining at least one time and/or frequency smoothed comfort noise energy ratio based on a combination of a determined comfort noise energy ratio directional component and an energy ratio of the metadata parameters.

The method may further comprise obtaining, during the discontinuous transmission mode of operation, at least one comfort noise audio signal, wherein generating, during the discontinuous transmission mode of operation, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may comprise generating the least one spatial comfort noise audio signal based on the at least one comfort noise audio signal.

Generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal may further comprise generating at least one spatial comfort noise parameter based on the at least one comfort noise audio signal.

The method may further comprise: determining, during the active mode of operation, a time window or period based on a source activity determination, wherein generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation may comprise generating at least one spatial comfort noise parameter further based on the time window or period. Generating at least one spatial comfort noise parameter further based on the time window or period may comprise tracking the at least one spatial comfort noise parameter more accurately during the time window or period when the source activity determination determines substantially background noise.

Generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation may comprise: determining a statistical model for spatial comfort noise; and generating the at least one spatial comfort noise parameter based on the statistical model of directional occurrences.

Determining the statistical model for spatial comfort noise may comprise determining the statistical model based on at least one of: a directional component of the metadata; an energy component of the metadata; a source activity determination.

Determining the statistical model for spatial comfort noise may comprise determining at least one of: a direction; a directional sector occurrence or hit-rate; a directional sector frequency or hit-ratio; an energy ratio; a spread coherence; a distance; a mean of any of the above; and a variance of any of the above.

Generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation may comprise generating at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

Generating, during the discontinuous transmission mode of operation, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may comprise spatially rendering a spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

The apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

The apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to: obtain or update a spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal; and generate the at least one spatial comfort noise parameter based on spatial comfort noise generator model.

The apparatus caused to obtain or update the spatial comfort noise generator model based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to determine: at least one comfort noise directional component, wherein the comfort noise directional component is associated with a frequency band and time window; and at least one comfort noise energy ratio, wherein the at least one comfort noise energy ratio is associated with one of the at least one comfort noise directional component.

The apparatus caused to determine at least one comfort noise directional component, wherein the comfort noise directional component is associated with the frequency band and time window may be caused to determine a time and/or frequency smoothed directional component based on a combination of a determined comfort noise directional component and a directional component of the metadata parameters. The apparatus caused to determine at least one comfort noise energy ratio may be caused to determine at least one time and/or frequency smoothed comfort noise energy ratio based on a combination of a determined comfort noise energy ratio directional component and an energy ratio of the metadata parameters.

The apparatus may be further caused to obtain, during the discontinuous transmission mode of operation of the apparatus, at least one comfort noise audio signal, wherein the apparatus caused to generate, during the discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may be caused to generate the least one spatial comfort noise audio signal based on the at least one comfort noise audio signal.

The apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be further caused to generate at least one spatial comfort noise parameter based on the at least one comfort noise audio signal.

The apparatus may be further caused to: determine, during the active mode of operation of the apparatus, a time window or period based on a source activity determination, wherein the apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to generate at least one spatial comfort noise parameter further based on the time window or period.

The apparatus caused to generate at least one spatial comfort noise parameter further based on the time window or period may be caused to track the at least one spatial comfort noise parameter more accurately during the time window or period when the source activity determination determines substantially background noise.

The apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to: determine a statistical model for spatial comfort noise; and generate the at least one spatial comfort noise parameter based on the statistical model of directional occurrences.

The apparatus caused to determine a statistical model for spatial comfort noise may be caused to determine the statistical model based on at least one of: a directional component of the metadata; an energy component of the metadata; a source activity determination.

The apparatus caused to determine a statistical model for spatial comfort noise may be caused to determine at least one of: a direction; a directional sector occurrence or hit-rate; a directional sector frequency or hit-ratio; an energy ratio; a spread coherence; a distance; a mean of any of the above; and a variance of any of the above.

The apparatus caused to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during the active mode of operation of the apparatus may be caused to generate at least one spatial comfort noise parameter based on an analysis of the at least one audio signal.

The apparatus caused to generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter may be caused to spatially render a spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to a fourth aspect there is provided an apparatus comprising: obtaining circuitry configured to obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generating circuitry configured to generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generating circuitry configured to generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter. According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to a seventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to an eighth aspect there is provided an apparatus comprising: means for obtaining at least one audio signal and metadata parameters associated with the at least one audio signal; means for generating at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; means for generating, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

According to a ninth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain at least one audio signal and metadata parameters associated with the at least one audio signal; generate at least one spatial comfort noise parameter based on the metadata parameters associated with the at least one audio signal and during an active mode of operation of the apparatus; generate, during a discontinuous transmission mode of operation of the apparatus, at least one spatial comfort noise audio signal based on the at least one spatial comfort noise parameter.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;

Figure 2 shows schematically an example IVAS decoder according to some embodiments;

Figure 3 shows schematically a further example IVAS decoder according to some embodiments;

Figure 4 shows a flow diagram of decoding according to some embodiments;

Figure 5 shows a further flow diagram of decoding according to some embodiments; and

Figure 6 shows schematically an example device suitable for implementing the apparatus shown. Embodiments of the Application

The concept as discussed in the embodiments invention relates to speech and audio codecs and in particular immersive audio codecs supporting a multitude of operating points ranging from a low bit rate operation to transparency as well as a range of service capabilities, e.g., from mono to stereo to fully immersive audio encoding/decoding/rendering. An example of such a codec is the 3GPP IVAS codec discussed above.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, it is expected that the decoder can output the audio in a number of supported formats. A pass through mode has been proposed, where the audio could be provided in its original format after transmission (encoding/decoding).

For example a mono audio signal (without metadata) may be encoded using an Enhanced Voice Service (EVS) encoder. Other input formats may utilize new IVAS encoding tools. One input format proposed for IVAS is the Metadata-assisted spatial audio (MASA) format, where the encoder may utilize, e.g., a combination of mono and stereo encoding tools and metadata encoding tools for efficient transmission of the format. MASA is a parametric spatial audio format suitable for spatial audio processing. Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound (or sound scene) is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the relative energies of the directional and non- directional parts of the captured sound in frequency bands, expressed for example as a direct-to-total ratio or an ambient-to-total energy ratio in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics. For example, there can be two channels (stereo) of audio signals and spatial metadata. The spatial metadata may furthermore define parameters such as: Direction index, describing a direction of arrival of the sound at a time-frequency parameter interval; level/phase differences; Direct-to-total energy ratio, describing an energy ratio for the direction index; Diffuseness; Coherences such as Spread coherence describing a spread of energy for the direction index; Diffuse-to-total energy ratio, describing an energy ratio of non-directional sound over surrounding directions; Surround coherence describing a coherence of the non-directional sound over the surrounding directions; Remainder-to-total energy ratio, describing an energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1 ; Distance, describing a distance of the sound originating from the direction index in meters on a logarithmic scale; covariance matrices related to a multi-channel loudspeaker signal, or any data related to these covariance matrices; other parameters for guiding or controlling a specific decoder, e.g., VAD/DTX/CNG/SID parameters. Any of these parameters can be determined in frequency bands.

As discussed above Voice Activity Detection (VAD) may be employed in such a codec to control Discontinuous Transmission (DTX), Comfort Noise Generation (CNG) and Silence Descriptor (SID) frames.

Furthermore as discussed above CNG is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed, e.g., under the DTX operation. Flowever a complete silence can be confusing or annoying to a receiving user. For example, the listener could judge that the transmission may have been lost and then unnecessarily say “hello, are you still there?” to confirm or simply hang up. On the other hand, sudden changes in sound level (from total silence to active background and speech or vice versa) could also be very annoying. Thus, CNG is applied to prevent a sudden silence or sudden change. Typically, the CNG audio signal output is based on a highly simplified transmission of noise parameters.

There is currently no proposed spatial audio DTX, CNG and SID implementations. In particular, the implementation of DTX operation for spatial audio is likely to change the “feel” and level of the background noise that will be observed by the user. These changes in the spatialization of the background noise may be perceived by the listener as being annoying or confusing. For example in some embodiments as discussed herein the background noise is provided such that it is experienced as coming from the same direction(s) both during active and inactive speech periods.

For example, a user is talking in front of a spatial capture device with a busy road behind the device. The spatial audio capture has then constant traffic hum (that is somewhat diffuse), and specific traffic noises (e.g., car horns) coming mainly from behind, and of course the talker’s voice coming from the front. When DTX is active, and the user is not talking both N channels + spatial metadata transmission can be shut off to save transmission bandwidth (DTX active). In the absence of regular spatial audio coding CNG provides the static background hum that is not too different to the original captured background noise. The embodiments as discussed herein attempt to generate spatial metadata during inactive periods. This avoids using the most recent received values results repeating and annoying “stuck” spatial image.

Crude background noise description (SID) updates may be transmitted during inactive period (with EVS this is mono and with IVAS typically stereo SID/CNG) to keep the signal properties (spectrum and energy) aligned between encoder and decoder. The embodiments as discussed herein attempt to define how to transmit spatial image SID updates.

Furthermore in the example above upon local VAD indicating a speech onset, the user voice returns, and so do the traffic hum and other traffic noises get regular updates as well as spatial metadata will be sent at normal bitrate. The listener thus hears a significant change in the spatial reproduction. The embodiments as described herein are configured to consider the spatial dimension of background noise during the CNG periods and SID updates. Thus in such embodiments the DTX operation is made as transparent and pleasant to the user as possible.

As such the embodiments as described herein attempt to provide an optimal DTX / CNG / SID system for parametric spatial audio such as MASA. Additionally the embodiments as described herein are configured to provide an optimal CNG system for parametric spatial audio such as MASA based on a mono or stereo DTX system.

Thus, some embodiments comprise an IVAS apparatus or system configured to implement a DTX / CNG system where the parametric spatial audio CNG is based either on a spatial audio DTX or a mono/stereo DTX. This means that spatial audio parameters are updated substantially synchronously with the core audio DTX. Also the embodiments as discussed herein are configured to implement CNG and possible SID updates such that they can work substantially synchronously with the core audio codec.

Additionally in some embodiments the proposed apparatus is configured such that it may be capable of the Backward Interoperability constraints expected of IVAS. Having interoperability with the EVS is an important feature. The full EVS codec algorithm shall be part of the IVAS codec solution. EVS bit-exact processing shall be used when the input to the IVAS codec is a simple mono signal without spatial metadata and should also be applied whenever possible. When multiple mono audio channels without spatial metadata are negotiated they shall all be bit- exact with EVS.

In particular the IVAS codec as implemented in some embodiments herein may be configured to support certain stereo modes of operation which include an embedded bit-exact EVS mono downmix bitstream at the bit-rates from 9.6 kbit/s to 24.4 kbit/s SWB (9.6/13.2/16.4/24.4 kbit/s).

This requirement for embedded bit-exact EVS mono downmix bitstream delivered as part of certain stereo modes of operation is such that for such stereo modes of operation an EVS mono encoding with some additional separate encoded data needs to provide a stereo audio playback for a stereo input. The additional separate encoded data can be removed/stripped/ignored to receive an EVS mono bitstream. By bit-exact operation it is generally meant that the EVS mono bitstream needs to fully comply with one encoded and decoded with an external “legacy” EVS (i.e., the EVS standard). The embodiments may furthermore be configured such that when the embedded stereo/spatial IVAS codec backwards compatible mode of operation is in DTX operation, it is configured to be compliant with EVS DTX operation. Thus, any additional DTX data should be at minimal cost for optimal operation. Transmitting DTX data twice will complicate the system.

The embodiments as discussed herein relate to a decoder side spatial CNG update system for spatial audio codec utilizing a parametric spatial audio description. The embodiments may be implemented within or apply to codecs such as IVAS, and its MASA or MASA/DirAC coding, when DTX functionality is used.

The embodiments as described herein have the additional advantage that they can be utilized for embedded spatial extensions of mono/stereo codecs without additional SID updates for the spatial part. For example, some embodiments as described herein may be implemented within an embedded stereo codec. Some embodiments furthermore present a significant simplification of the current VAD/DTX/CNG solution for embedded systems.

In the embodiments described herein spatial metadata is collected, analyzed and statistically modelled only in the decoder side. This mode of operation allows zero bit additional transmission of SID spatial metadata. Also no complexity is added for the encoder. This zero-bit spatial CNG is especially well suited for low bitrate applications. In some embodiments additional spatial SID metadata may be analysed in the encoder side and transmitted to the decoder side and used to refine the spatial CNG model obtained with the methods as described herein. For example, a zero-bit spatial CNG can be employed in frames where no SID updates are received. For example, a zero-bit spatial CNG model can be updated based on decoder-side algorithms between SID update frames that then provide (at least partially) a new ground truth or state for the model.

The following embodiments furthermore describe methods for synthesising spatial metadata during CNG periods. The spatial metadata generation relies on the spatial CNG model built during active signal transmission.

In some embodiments there is apparatus configured to acquire spatial audio model and generate and render spatial audio under a DTX operation. For a receiving user, it is annoying and confusing to be every now and then presented complete silence. Thus the embodiments are configured such that, under DTX operation CNG processing is utilized to fill in the silence using artificial background noise. Updates relating to the captured noise level during speech inactivity are sent with SID frames to keep the receiver CNG decently well aligned with the background noise level at the sender side. The embodiments are furthermore configured to overcome the issue that where the spatial characteristics of the conversational spatial audio (under DTX) are not aligned between the sender side capture and receiving side presentation. In some embodiments the DTX SID update frame analysis and description and/or the decoder side analysis for the CNG is extended to include the parameters needed for spatial rendering. In some embodiments therefore it is provided the means, where the decoder creates spatial CNG information from the decoder side model that has been established during active signal periods.

In some embodiments a MASA input to the IVAS encoder comprises a suitable number of audio signals (for example 1 to 4 audio signals) and metadata. It should be noted that MASA encoding can be an efficient representation also for other spatial inputs besides a dedicated MASA input. For example, channel-based inputs or Ambisonics (FOA, FIOA) inputs could be transformed into a MASA or DirAC format representation inside the audio encoder.

Figure 1 presents a high-level overview of a suitable system or apparatus for IVAS coding and decoding which is suitable for implementing embodiments as described herein. The system 100 is shown comprising an (IVAS) input 111. The IVAS input 111 can comprise one of any suitable input format. For example as shown in Figure 1 there is shown a mono audio signal input 112. The mono audio signal input 112 may in some embodiments be passed to the encoder 121 and specifically to an Enhanced Voice Services (EVS) encoder 123. Furthermore is shown a stereo and binaural audio signal input 113. The stereo and binaural audio signal input 113 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. Figure 1 also shows a Metadata-Assisted Spatial Audio (MASA) signal input 114. The Metadata-Assisted Spatial Audio (MASA) signal input in some embodiments is passed to the encoder 121. Specifically the audio component of the MASA input is passed to the (IVAS) spatial audio encoder 125 and the metadata component passed to a metadata quantizer/encoder 127. Another input format shown in Figure 1 is an ambisonic audio signal, which may comprise first order Ambisonics (FOA) and/or higher order ambisonics (FIOA) audio signal 115. The first order Ambisonics (FOA) and/or higher order ambisonics (FIOA) audio signal 115 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. Furthermore as shown in Figure 1 is a channel based audio signal input 116. This may be any suitable input audio channel format, for example 5.1 channel format, 7.1 channel format etc. The channel-based audio signal input 116 in some embodiments is passed to the encoder 121 and specifically to the (IVAS) spatial audio encoder 125. The final example input shown in Figure 1 is an object (or audio object) signal input 117. The object signal input in some embodiments is passed to the encoder 121. Specifically the audio component of the object signal input is passed to the (IVAS) spatial audio encoder 125 and the metadata component passed to a metadata quantizer/encoder 127.

Figure 1 furthermore shows an (IVAS) encoder 121 . The (IVAS) encoder 121 is configured to receive the audio signal from the input and encode it to produce a suitable format encoded bitstream 131. The (IVAS) encoder 121 in some embodiments as shown in Figure 1 comprises an EVS encoder 123 configured to receive any mono audio signals 112 and encode them according to an EVS codec definition.

Furthermore the (IVAS) encoder 121 is shown comprising an (IVAS) spatial audio encoder 125. The (IVAS) spatial audio encoder 125 is configured to receive the audio signals or audio signal components and encode the audio signals based on a suitable definition or coding mechanism. In some embodiments the spatial audio encoder 125 is configured to reduce the number of audio signals being encoded before the signals are encoded. For example in some embodiments the spatial audio encoder is configured to combine or otherwise downmix the input audio signals. In some embodiments, for example when the input type is MASA signals, the spatial audio encoder is configured to encode the audio signals as a mono or stereo (downmix) signal.

The spatial audio encoder 125 may comprise an audio encoder core which is configured to receive the downmix or the audio signals directly and generate a suitable encoding of these audio signals. The encoder 125 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

In some embodiments the encoder 121 comprises a metadata quantizer/encoder 127. The metadata quantizer/encoder 127 is configured to receive the metadata, for example from the MASA input or the objects and generate a suitable quantized and/or encoded metadata bitstream suitable for (being combined with or associated with the encoded audio signal bitstream) and being output as part of the (IVAS) bitsteam 131 .

Furthermore as shown in Figure 1 there is shown a (IVAS) decoder 141 . The decoder 141 in some embodiments comprises a metadata dequantizer/decoder 147. The metadata dequantizer/decoder 147 is configured to receive the encoded metadata, for example from the IVAS bitstream 131 and generate a metadata bitstream suitable for rendering the audio signals within the stereo and spatial audio decoder 145.

Figure 1 furthermore shows the (IVAS) decoder 141 comprising an EVS decoder 143. The EVS decoder 143 is configured to receive the EVS encoded mono audio signals as part of the IVAS bitstream 131 and decode them to generate a suitable mono audio signal which can be passed to an internal Tenderer (for example the stereo and spatial decoder) or suitable external Tenderer.

Additionally in some embodiments the (IVAS) decoder 141 comprises a stereo and spatial audio signal decoder 145. The stereo and spatial audio signal decoder 145 in some embodiments is configured to receive the encoded audio signals and generate a suitable decoded spatial audio signal which can be rendered internally (for example by the stereo and spatial audio signal decoder) or suitable external Tenderer.

Therefore in summary first the system is configured to receive a suitable audio signal format. In some embodiments the system is configured to generate (a downmix or more generally known as transport audio signals) audio signals. The system is then configured to encode for storage/transmission the audio signals. After this the system may store/transmit the encoded audio signals and metadata. The system may retrieve/receive the encoded audio signals and metadata. Then the system is configured to extract the audio signals and metadata from encoded audio signals and metadata parameters, for example demultiplex and decode the encoded audio signals and metadata parameters.

The system may furthermore be configured to synthesize an output multi channel audio signal based on the extracted audio signals and metadata.

With respect to Figure 2 is shown in further detail the decoder as shown in Figure 1 according to some embodiments. The decoder 241 in some embodiments comprises a mode determiner 200 configured to determine whether the decoder is operating in a DTX or active mode.

Furthermore in some embodiments the decoder 241 comprises an (active mode) EVS decoder 201. The (active mode) EVS decoder 201 is configured to decode any suitable EVS encoded datastream and pass this to the IVAS Tenderer (or output to external Tenderer) 209.

In some embodiments the decoder 241 comprises an (active mode) IVAS stereo decoder 203. The (active mode) IVAS stereo decoder 203 is configured to decode any suitable IVAS stereo encoded datastream and pass this to the IVAS Tenderer (or output to external Tenderer) 209.

Furthermore the decoder 241 comprises a metadata dequantizer/decoder 203. The (active mode) IVAS stereo decoder 203 is configured to decode any suitable IVAS stereo encoded datastream and pass this to the IVAS Tenderer (or output to external Tenderer) 209.

In some embodiments the decoder 241 comprises an EVS Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 211. The EVS Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 211 is configured to receive and decode any EVS DTX or SID information and pass this to an EVS comfort noise generator 215.

The decoder may further comprise an EVS comfort noise generator 215 which is configured to receive the decoded EVS DTX or SID information and generate comfort noise and pass this to the IVAS Tenderer (or output this to the external Tenderer) 209.

The decoder 241 furthermore in some embodiments comprises a Stereo Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 213. The stereo Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 213 is configured to receive and decode any stereo DTX or SID information and pass this to a stereo comfort noise generator 217.

The decoder may further comprise a stereo comfort noise generator 217 which is configured to receive the decoded stereo DTX or SID information and generate comfort noise and pass this to the IVAS Tenderer (or output this to the external Tenderer) 209.

In some embodiments the decoder 241 comprises a spatial comfort noise determiner 207. The spatial comfort noise determiner 207 may comprise a spatial comfort noise generator model updater 221 configured to receive information from the metadata dequantizer/decoder 205 and update the comfort noise generator model and pass this to comfort noise generator model storage 223.

The spatial comfort noise determiner 207 comprises comfort noise generator model storage 223 configured to receive the updated comfort noise generator model and store it. The stored comfort noise model can then be supplied to a spatial parameter generator 225.

The spatial comfort noise determiner 207 furthermore in some embodiments may comprise a spatial parameter generator 225 which receives the updated comfort noise generator model and generate spatial parameters and pass these to the IVAS Tenderer (or external Tenderer output) 209.

In some embodiments the decoder 241 comprises a (IVAS) Tenderer (or output to an external Tenderer) 209. The (IVAS) Tenderer (or output to an external renderer) 209 is configured to receive the decoded EVS audio from the EVS mono decoder 201 , decoded stereo audio from the stereo decoder 203, decoded metadata from the metadata dequantizer/decoder 205, the CNG spatial parameters from the spatial comfort noise determiner 207, the EVS comfort noise from the EVS CNG 215 and the stereo comfort noise from the stereo CNG 217. The renderer is then configured to generate the spatial audio signals (with comfort noise when required) and output the renderer spatial audio signal (or provide the parameters to an external renderer to do the same).

For spatial (MASA) audio, no spatial update data is received during DTX operation, thus zero-bit operation and no complexity for the encoder. Instead the spatial (MASA) CNG relies on mono (at least EVS, potentially also non-EVS mono modes) or stereo DTX updates such as mono or stereo SID frames for channel(s) provide CNG updates. The decoder thus is configured to obtain during active signal decoding of a set of spatial metadata parameters (MASA / DirAC etc.).

With respect to Figure 3 is shown a further embodiment decoder as shown in Figure 1 according to some embodiments. The decoder 341 in some embodiments comprises a mode determiner 300 configured to determine whether the decoder is operating in a DTX or active mode.

Furthermore in some embodiments the decoder 341 comprises an (active mode) EVS decoder 301 . The (active mode) EVS decoder 301 is configured to decode any suitable EVS encoded datastream and pass this to the IVAS renderer (or output to external renderer) 309 and furthermore pass some information to the spatial CNG model updater 321 .

In some embodiments the decoder 341 comprises an (active mode) IVAS stereo decoder 303. The (active mode) IVAS stereo decoder 303 is configured to decode any suitable IVAS stereo encoded datastream and pass this to the IVAS renderer (or output to external renderer) 309 and furthermore pass some information to the spatial CNG model updater 321 .

Furthermore the decoder 341 comprises a metadata dequantizer/decoder 303. The (active mode) IVAS stereo decoder 303 is configured to decode any suitable IVAS stereo encoded datastream and pass this to the IVAS renderer (or output to external Tenderer) 309 and furthermore pass some information to the spatial CNG model updater 321 .

In some embodiments the decoder 341 comprises an EVS Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 311. The EVS Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 311 is configured to receive and decode any EVS DTX or SID information and pass this to an EVS comfort noise generator 315.

The decoder may further comprise an EVS comfort noise generator 315 which is configured to receive the decoded EVS DTX or SID information and generate comfort noise and pass this to the IVAS Tenderer (or output this to the external Tenderer) 309 and furthermore pass information to the spatial parameter generator 325.

The decoder 341 furthermore in some embodiments comprises a Stereo Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 313. The stereo Discontinuous Transmission/Silence Descriptor (DTX/SID) decoder 313 is configured to receive and decode any stereo DTX or SID information and pass this to a stereo comfort noise generator 317.

The decoder may further comprise a stereo comfort noise generator 317 which is configured to receive the decoded stereo DTX or SID information and generate comfort noise and pass this to the IVAS Tenderer (or output this to the external Tenderer) 309 and furthermore pass information to the spatial parameter generator 325.

In some embodiments the decoder 341 comprises a spatial comfort noise determiner 307. The spatial comfort noise determiner 307 may comprise a spatial comfort noise generator model updater 321 configured to receive information from the metadata dequantizer/decoder 305, EVS mono decoder 301 and Stereo decoder 303 and update the comfort noise generator model and pass this to comfort noise generator model storage 323.

The spatial comfort noise determiner 307 comprises comfort noise generator model storage 323 configured to receive the updated comfort noise generator model and store it. The stored comfort noise model can then be supplied to a spatial parameter generator 325.

The spatial comfort noise determiner 307 furthermore in some embodiments may comprise a spatial parameter generator 325 which receives the updated comfort noise generator model and information from the EVS CNG 315 and stereo CNG 317 and generate spatial parameters and pass these to the IVAS Tenderer (or external Tenderer output) 309.

In some embodiments the decoder 341 comprises a (IVAS) Tenderer (or output to an external Tenderer) 309. The (IVAS) Tenderer (or output to an external Tenderer) 309 is configured to receive the decoded EVS audio from the EVS mono decoder 301 , decoded stereo audio from the stereo decoder 303, decoded metadata from the metadata dequantizer/decoder 305, the CNG spatial parameters from the spatial comfort noise determiner 307, the EVS comfort noise from the EVS CNG 315 and the stereo comfort noise from the stereo CNG 317. The Tenderer is then configured to generate the spatial audio signals (with comfort noise when required) and output the Tenderer spatial audio signal (or provide the parameters to an external Tenderer to do the same).

Figure 4 shows the operation of the spatial comfort noise determiner 207 as shown in Figure 2 according to some embodiments. In some embodiments the spatial CNG model updater 221 is configured to update the model for spatial CNG based on spatial metadata tracking. For example the spatial CNG model updater 221 is configured to initialize a spatial CNG model for spatial metadata either to zero or any other initial state, for example, a randomized state according to a seed value so that uniform uncoherent spatial audio can be rendered.

Thus in some embodiments during active mode signal decoding, for each time frame (e.g., a 20-ms time frame used for EVS/IVAS), is configured to obtain spatial metadata. For example, in IVAS MASA a frame of metadata may consist of N frequency bands F, where N <= M and M denotes the maximum number of frequency bands, and 4 temporal subframes of 5 ms each. (In some embodiments other values for frequency bands and temporal subframes can be implemented. ).The obtaining of the spatial metadata for the current frame is shown in Figure 4 by step 401 .

In some embodiments the CNG state (the spatial model) is then updated successively for each subframe T. Alternatively, in some embodiments, the spatial CNG model can be updated once per frame based on a (weighted) average of the subframe metadata values.

For each TF tile (in the subframe T), the spatial CNG model updater 321 is configured to obtain at least one Direction and at least one corresponding Energy ratio parameter or any other parameter (such as spread coherence) used in the particular spatial audio representation. For example, the Energy ratio parameter can be a Direct-to-total-energy ratio describing the amount of the directional energy relative to the total energy in the TF tile. The reading/determining of the Read time- frequency tile direction, energy ratio, etc. parameters (for the current time slot T) is shown in Figure 4 by step 403.

For the directional component D (in frequency band F) it is thus obtained an Energy ratio E(F,D). Depending on the bit rate, the number of frequency bands and the accuracy of the directional information can vary over time. At least for the frequency bands (but this can be done also for the direction information), it thus needs to be determined a mapping between the received band F and the spatial CNG model frequency resolution. (This mapping is denoted by hat for both frequency band and directional information.) It is here proposed to use a lower frequency and directional resolution (e.g., corresponding to the lowest resolution of the active signal spatial information) for the CNG state for two reasons: memory consumption of the spatial model and the uncertainty of the directional information over time, also with lower frequency resolution the spatial CNG model parameters are updated more often for any particular Direction sector.

The next operation is determining which direction sector is being hit for this frequency F or in other words obtaining the direction based on the metadata and from this direction identifying a sector or region within which a first processing operation is to be carried out (when compared to outside of this sector or region where a second processing operation is to be carried out). The operation of determining which direction sector is being hit is shown in Figure 4 by step 405.

The next operation is one of updating spatial CNG state Hit-rate or hit-ratio, Energy ratio, and other parameters for current F band and Direction Sector as shown in Figure 4 by step 407. The hit-ratio may for example be the ratio of the number of times a sector is identified based on the directional metadata against all sectors summed for a particular frequency. The hit-ratios may furthermore in some embodiments be limited so that no particular sector has a ratio of 0 and no sector has a ratio close to 1 . For example for a ratio with values between 0 and 1 a limit bound of between 0.2 and 0.8 may be implemented. However any suitable limit may be implemented. The hit-rate may be the hit-ratio over some suitable time window or period. For example the window may be 200-frames. Furthermore the hit-rate values may have some limits similar to those described with respect to the hit-ratio. For example no sector is allowed to have more than 80% of the total rate value, and no sector can have below 2% of the total rate value.

The energy update for the obtained Direction sector D can be of the form:

Furthermore the model updater may be configured to update spatial CNG state Hit-rate (in other words the frequency or number of determinations that the sector or region is identified based on the directional component of the metadata), Energy ratio, and other suitable parameters for current F band and for the direction sectors other than the obtained direction sector as shown in Figure 4 by step 409. For example for all other Directions (not corresponding to D), the update can be:

Parameters a and control the influence of the latest temporal directional value on the long-term energy ratio tracking and the corresponding smearing of the values. Thus when no directional component is present, the long-term energy ratio slides toward a more diffuse state. In some implementations, it may be a = ^. In case the update is carried out independently 4 times per frame (i.e., once for each 5-ms subframe), a may be, e.g., 0.05 and b may be, e.g., 0.01. In the earlier embodiments, since energy ratios only are considered, E CNG is limited between 0.0 and 1 .0. The value E CNG is thusly an energy ratio itself.

The method then loops for all of the remaining frequencies of the time slot as shown by the arrow from step 409 to step 405.

Furthermore when all of the remaining frequencies have been processed the method then loops to the next time slot as shown by the arrow from step 409 to step 403 and the next time slot T is then analysed and the parameters updated.

Once all of the parameters have been updated then these values may then be employed.

With respect to Figure 5 is shown a further example method for the operation of the spatial comfort noise determiner 307 as shown in Figure 3. In this example the metadata updater for spatial CNG model is configured to consider both the decoded audio waveform and the spatial metadata. Thus, it can be said that in such embodiments the spatial audio evolution is being tracked.

The main difference of the spatial comfort noise determiner 307 as shown in Figure 3 and the spatial comfort noise determiner 207 as shown in Figure 2 is that there is at least one audio waveform which is also considered when updating the CNG state. For example, a spatial audio signal (MASA) transport may be based on 1 or 2 audio waveforms corresponding to mono and stereo (or binaural audio). In some cases, there could be even more component signals, e.g., if an Ambisonics based transport is utilized.

Thus the decoded waveform(s) for current frame corresponding to the spatial signal are obtained as shown in Figure 5 by step 501 .

Then the time-frequency tile signal energies for current frame are obtained as shown in Figure 5 by step 503.

The obtaining of the spatial metadata for the current frame is then shown in Figure 5 by step 505.

In some embodiments the CNG state (the spatial model) is then updated successively for each subframe T. Alternatively, in some embodiments, the spatial CNG model can be updated once per frame based on a (weighted) average of the subframe metadata values. For each TF tile (in the subframe T), the spatial CNG model updater 321 is configured to obtain at least one Direction and at least one corresponding Energy ratio parameter or any other parameter (such as spread coherence) used in the particular spatial audio representation. For example, the Energy ratio parameter can be a Direct-to-total-energy ratio describing the amount of the directional energy relative to the total energy in the TF tile. The reading/determining of the Read time- frequency tile direction, energy ratio, etc. parameters (for the current time slot T) is shown in Figure 5 by step 507.

The next operation is determining which direction sector is being hit for this frequency F or in other words obtaining the direction based on the metadata. The operation of determining which direction sector is being hit is shown in Figure 5 by step 509.

Then the following operation is one of calculating a time-frequency tile directional energy based on time-frequency tile signal energy and Energy ratio as shown in Figure 5 by step 511 . As such the audio waveform(s) can be considered by calculating the Directional energy instead of relying solely on the Energy ratio. The Directional energy of a TF tile may be obtained by a filtering operation giving the signal corresponding to the TF tile (e.g., this could be generated using the Complex Low Delay Filter Bank - CLDFB from EVS). The energy of this signal is calculated. The energy is then multiplied by the corresponding Energy ratio value (e.g., the Direct-to-total-energy ratio).

The next operation is one of updating spatial CNG state Hit-rate, Energy ratio, and other parameters for current F band and Direction Sector as shown in Figure 5 by step 513. As discussed above the hit-ratio may for example be the ratio of the number of times a sector is identified based on the directional metadata against all sectors summed for a particular frequency. The hit-ratios may furthermore in some embodiments be limited. The hit-rate may be the hit-ratio over some suitable time window or period and may have some limits similar to those described with respect to the hit-ratio.

The energy update for the obtained Direction sector D can be of the form similar to the example equations above but implementing the absolute Directional energy instead of the energy ratios. For energy ratios and other parameters a sliding average like those shown above with alpha and beta may be implemented.

The spatial CNG model parameters are thus not limited between 0.0 and 1 .0.

However in such embodiments when the spatial model is applied to waveform signals during DTX processing, the energy ratios are provided as well. At that stage, this can be achieved by normalizing the values (between 0.0 and 1.0). Alternatively in some embodiments the energy can be calculated from core audio CNG output as shown in Figure 3.

The updating of the spatial CNG state Hit-rate, Energy ratio, and other suitable parameters for current F band and for the direction sectors other than the obtained direction sector as shown in Figure 5 by step 513.

The method then loops for all of the remaining frequencies of the time slot as shown by the arrow from step 513 to step 509.

Furthermore when all of the remaining frequencies have been processed the method then loops to the next time slot as shown by the arrow from step 513 to step 507 and the next time slot T is then analysed and the parameters updated.

Once all of the parameters have been updated then these values may then be employed.

In some embodiments a local decoder side “fast VAD” algorithm can be utilised in conjunction with the spatial CNG model metadata updater functionality. Specifically, the local “fast VAD” is performed on the decoder-side to detect when the signal and full metadata is being received during short background noise segments. As there is some amount of full bitrate coding of the background noise also when DTX functionality is enabled, there is furthermore a so called hang-over time that is needed to make the codec more robust for unwanted attenuation of active signal.

The hang-over time is a term used in DTX functions to describe the “safety”- period to avoid a too “aggressive” DTX being applied. Thus for example after the speech energy attenuates after a spoken sentence below noise floor or some energy threshold a few frames are still sent with full speech coding before entering a DTX period. These few frames with full speech coding but with ‘no speech’ can be exploited to estimate a “pure” background noise profile at the decoder during these times.

The encoder side VAD can thus be said to be “slow”. Based on the decoder side “fast VAD” decision the background noise spatial properties can be tracked more accurately than without it. Parameters a and b can in such ‘fast VAD’ embodiments be made larger when the “fast VAD” detects that the decoder and/or renderer is receiving mostly background noise.

In some embodiments a background statistical spatial CNG model may comprise of a number (2..N) spatial sectors (where the number is more than 1 only because a value of 1 implies that the sound always comes from all directions). For each spatial sector there are 1 ..M frequency ranges that may or may not correspond to IVAS MASA frequency ranges. For example if N=8 and M=5, there may be 5 frequency ranges (e.g. 0- 400Flz, 400- 1600Flz, 1600- 3200Flz, 3200- 6400Flz, and 6400- 24000Flz) and 8 different direction sectors (e.g. the sector middle points could point to: centre (0,0), left (-30, 0), right (+30,0), side left (-90,0), side right (+90,0), back left (-135,0), back right (+135,0), and Up (0, 90)). In the methods described herein the Down direction is not usually relevant, but e.g. FIVAC noise may come from up. Thus, in some embodiments there may be a total of 40 “Direction- Frequency sectors” to update.

Whenever MASA spatial information is received by the decoder the direction for the frequency may be decoded such as described in the embodiments shown in Figures 2 and 4 and it is checked which “Direction-Frequency sector” to update. Additionally, energy can be calculated (such as shown in the embodiments shown in Figures 3 and 5), and VAD can be calculated in a manner described herein. For each sector and frequency the averaged MASA spatial parameters such as Direction, Direct-to-total energy ratio, Spread coherence and distance are collected with time adaptive a and b parameters, which may be adaptive respective to the embodiments as described herein. Also, derived information such as Hit-rate or hit- ratio (i.e. how times each sector is being “hit”) can be collected updated for each “Direction-Frequency sector”. As explained earlier, depending on whether the one or more embodiments produce the better results the more accurate background noise statistical spatial model can be established by the decoder.

In addition to mean values for each CNG model parameter, variance and higher level statistics can also be collected and updated.

When DTX is on and no spatial MASA updates are coming from the channel the decoder has to generate MASA/DirAC spatial parameters for the CNG frames. The MASA parameters are generated based on the statistical spatial CNG model parameters collected earlier and stored by the simple statistical spatial CNG model. The database spatial model parameters and derived information such as hit-rate or hit-rate are then used as weights in some embodiments to randomly select, which direction-sector statistic to use for each TF-tile. Then the averaged parameters from the selected sector are used to generate Direction, Direct-to-total energy ratio, Spread coherence and distance parameters. The same method is used to generate all TF-tiles MASA coefficients. Thus, each individual spatial parameter generated CNG MASA frame has somewhat randomized values (i.e. they are not the same for consecutive frame or subframes level), but their overall distribution matches that of the earlier collected simple statistical CNG model properties.

With respect to Figure 6 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View,

California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.