SYSTEM AND METHOD FOR ENHANCEMENT OF A DEGRADED AUDIO SIGNAL

Title:

SYSTEM AND METHOD FOR ENHANCEMENT OF A DEGRADED AUDIO SIGNAL

Document Type and Number:

WIPO Patent Application WO/2021/022079

Kind Code:

Abstract:

The present disclosure relates to the field of audio enhancement, and in particular to methods, devices and software for supervised training of a machine learning model, MLM, the MLM trained to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal. The present disclosure further relates to methods, devices and software for use of such a trained MLM.

More Like This:

WO/2019/083130	ELECTRONIC DEVICE AND CONTROL METHOD THEREFOR
WO/2018/227169	OPTIMAL HUMAN-MACHINE CONVERSATIONS USING EMOTION-ENHANCED NATURAL SPEECH
JP2024033382	Instrument identification method, instrument identification device, and instrument identification program

Inventors:

DAI JIA (US)
LI KAI (US)
CARTWRIGHT RICHARD J (US)

Application Number:

PCT/US2020/044324

Publication Date:

February 04, 2021

Filing Date:

July 30, 2020

Export Citation:

Click for automatic bibliography generation Help

Assignee:

DOLBY LABORATORIES LICENSING CORP (US)

International Classes:

G10L25/30; G10L21/0232

Foreign References:

US20160111107A1	2016-04-21
USPP62676095P

Other References:

HOTHDANIEL, THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 12, 1941, pages 499, Retrieved from the Internet

Attorney, Agent or Firm:

MA, Xin et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A method for supervised training of a machine learning model, MLM, to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal, the method comprising the steps of:

receiving a degraded audio signal and a clean audio signal for training of the MLM; extracting a first set of features from the received degraded audio signal, and a second set of features from the received clean audio signal, each feature corresponding to a frequency band of the respective received audio signals;

comparing each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, each gain corresponding to a respective feature among the first set of features, and used as ground truth when training the MLM; using the first set of features and the derived set of gains as a training set for training the MLM;

wherein the method further comprises at least one of:

a pre-processing step performed prior to deriving the set of gains, wherein the pre-processing step comprises adjusting the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is more similar to the frequency energy distribution of the second set of features, and

defining a loss function of the MLM which is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain.

2. A method according to claim 1 , wherein only one of the pre-processing step and the step of defining a loss function of the MLM is used.

3. A method according to claim 1 , wherein both the pre-processing step and the step of defining a loss function of the MLM are used.

4. A method according to any one of claims 1 -3, wherein the loss function is further weighted according to the frequency band of the features of the training set, such that an error for a feature corresponding to a relatively higher frequency band is weighted with a relatively higher weight.

5. A method according to claim 4, wherein an error for a feature corresponding to a frequency band exceeding 6kHz is weighted with a higher weight compared to an error for a feature corresponding to a frequency band below 6kHz.

6. A method according to any one of claims 1 -5, wherein the first and second sets of features are extracted by converting the received degraded audio signal and clean audio signal into the frequency domain.

7. A method according to claim 6, wherein the conversion is performed using one from the list of: a short time Fourier transform, SFTF, a modified discrete cosine transform,

MDCT, and a shifted discrete frequency transform, MDXT.

8. A method according to claim 7, wherein the first and second set of features are extracted by, for each frequency band of a plurality of frequency bands,

for frequency bins of the frequency band, combining complex features of the frequency domain representation of the respective audio signal corresponding to the frequency bins into a single feature corresponding to that frequency band.

9. A method according to claim 8, wherein the features of the first and second set of features corresponds to Mel-frequency band powers, Bark Scale band powers, log- frequency band powers or ERB band powers.

10. A method according to any one of claims 1 -9, wherein the step of pre-processing comprises balancing a frequency energy distribution of the second set of features across the entire frequency band of the received clean audio signal.

1 1 . A method according to claim 10, wherein the pre-processing comprises:

fitting a polynomial curve to the second set of features,

defining a filter based on a difference between the polynomial curve and a constant function,

applying the filter to the second set of features.

12. A method according to claim 10, wherein the pre-processing comprises: fitting a polynomial curve to the second set of features,

calculating a difference between a minimum value and a maximum value of the polynomial curve,

upon determining that the difference exceeds a threshold value:

defining a filter based on the difference between the polynomial curve and a constant function,

applying the filter to the second set of features.

13. A method according to claim 12, wherein the threshold value corresponds to a 3 dB difference in a frequency energy distribution of the second set of features across the entire frequency band of the received clean audio signal.

14. A method according to any one of claims 1 1 -13, wherein a value of the constant function is set to the maximum value of the polynomial curve.

15. A method according to any one of claims 1 1 -14, wherein the polynomial curve is one from the list of: a linear curve, a quadratic curve and a cubic curve.

16. A method according to any one of claims 1 -15, wherein the loss function is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain by:

multiplying a distance measurement between the predicted gain and the ground truth with a weight, the weight being relatively higher when:

the predicted gain is lower than the ground truth gain, and

the predicted gain is negative,

the weight being relatively lower when:

the predicted gain is higher than or equal to the ground truth gain, or the predicted gain is positive.

17. A method according to claim 16, wherein the ratio between the relatively higher weight and the relatively lower weight is between 3-7.

18. A method according to claim 17, wherein the ratio between the relatively higher weight and the relatively lower weight is 5.

19. A method according to any one of claims 1 -18, wherein the first and second sets of features are extracted by:

converting the received degraded audio signal and clean audio signal into the frequency domain,

for each frequency band, j, of a plurality of frequency bands

combining frequency components of the frequency domain representation of the degraded audio signal into a feature, f1 ,j, corresponding to the frequency band, and adding log(f1 ,j) to the first set of features;

combining frequency components of the frequency domain representation of the clean audio signal into a feature, f2,j, corresponding to the frequency band and adding log(f2,j) to the to the second set of features.

20. The method of claim 19, wherein the steps of combining frequency components of the frequency domain representation of the degraded audio signal into a feature, f1 ,j, comprises weighting the frequency components with different weights, wherein the steps of combining frequency components of the frequency domain representation of the clean audio signal into a feature, f2,j, comprises weighting the frequency components with said different weights.

21 . The method of any one of claims 19-20, wherein the plurality of frequency bands are equally spaced in Mel frequency.

22. The method of claim 21 , wherein the first and second set of features are extracted by combining extracted features from a plurality of audio frames of the repective audio signals.

23. The method of claim 22, wherein the extracted first and second set of features are further normalized prior to being used for deriving the set of gains.

24. The method of any one of claims 1 -23, further comprising adding artificial pairs of features to the first and second sets of features, wherein an artificial pair of feature comprises a first feature added to the first set of features and a second feature added to the second set of features, the first and second feature having a same value and corresponding to a same frequency band.

25. The method of any one of claims 1 -24, further comprising the step of, before comparing each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, adding noise to the first set of features.

26. The method of claim 25, wherein the noise is added only for a first threshold number of epochs when training the MLM.

27. The method of any one of claims 1 -24, further comprising the step of, before comparing each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, adjusting the first and/or the second set of features, wherein the adjustment comprises using distinct adjustment parameters during each training pass, epoch and/or minibatch of a training loop of the MLM.

28. The method of claim 27, wherein the adjustment parameters are drawn from a plurality of probability distributions.

29. The method of any one of claims 27-28, wherein the adjusting of the first set of features comprises at least one from the list of: adding fixed spectrum stationary noise, adding variable spectrum stationary noise, adding reverberation, adding non-stationary noise, adding simulated echo residuals, simulating microphone equalization, simulating microphone cutoff, and varying broadband level.

30. A method according to any one of claims 1 -29, wherein the received degraded audio signal is generated from the received clean audio signal.

31 . A method according to claim 30, wherein generation of the degraded audio signal comprises applying at least one codec to the clean audio signal.

32. A method according to claim 31 , wherein the at least one codec comprises a voice codec.

33. A method according to any one of claims 30-32, wherein generation of the degraded audio signal comprises applying an Intermediate Reference System, IRS, filter to the clean audio signal.

34. A method according to any one of claims 30-33, wherein generation of the degraded audio signal comprises applying a low pass filter to the clean audio signal.

35. A method according to any one of claims 30-34, wherein generation of the degraded audio signal comprises convolving a generated degraded audio signal with a narrow band impulse response.

36. A method according to any one any one of claims 1 -35, wherein the MLM is one from a list of: an artificial neural network, a decision tree, a support vector machine, a mixture model, and a Bayesian network.

37. A method for enhancing a degraded audio signal, comprising the steps of:

receiving a degraded audio signal;

extracting a first set of features from the received degraded audio signal;

inputting the extracted first set of features to a machine learning model, MLM, trained according to any one of claims 1 -36; and

using output gains from the MLM for enhancing the received degraded audio signal.

38. A method according to claim 37, further comprising the step of post-processing the output gains before using the gains for reducing coding artefacts of the received degraded audio signal.

39. A method according to claim 38, wherein the post-processing comprises at least one of:

limiting a range of the output gains to a predefined range,

limiting a difference between a gain for a frequency band of an audio frame of the received degraded audio signal and a gain for the frequency band of a previous audio frame of the received degraded audio signal, and limiting a difference between a gain for a frequency band of an audio frame of the received degraded audio signal and a gain for a neighbouring frequency band of the audio frame or another audio frame of the received degraded audio signal.

40. A method according to any one of claims 37-39, wherein the degraded audio signal is a public switched telephone network, PSTN, call, wherein the steps of extracting a first set of features and inputting the extracted first set of features to the trained MLM is performed for at least one audio frame of the PSTN call.

41 . A method according to any one of claims 37-40, implemented in an end point of an audio conference system for enhancing incoming audio signals.

42. A method according to any one of claims 37-41 , implemented in a server of an audio conference system for enhancing incoming audio signals before being transmitted to an end point.

43. A device configured for supervised training of a machine learning model, MLM, to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal, the device comprising circuity configure to:

compare each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, each gain corresponding to a respective feature among the first set of features, and used as ground truth when training the MLM; use the first set of features and the derived set of gains as a training set for training the MLM;

wherein the circuity is further configured for at least one of:

prior to deriving the set of gains, performing pre-processing comprising adjusting the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is more similar to the frequency energy distribution of the second set of features, and defining a loss function of the MLM configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain. 44. A device configured for enhancing a degraded audio signal, the device comprising circuity configure to:

receive a degraded audio signal;

extract a first set of features from the received degraded audio signal;

input the extracted first set of features to a machine learning model, MLM, trained according to any one of claims 1 -36; and

use output gains from the MLM for enhancing the received degraded audio signal.

45. A computer program product comprising a non-transitory computer-readable storage medium with instructions adapted to carry out the method of any one of claims 1 -42 when executed by a device having processing capability.

Description:

SYSTEM AND METHOD FOR ENHANCEMENT OF A DEGRADED AUDIO

SIGNAL

Cross-reference to related applications

This application claims priority to PCT Patent Application No. PCT/CN2019/098896, filed August 1 , 2019, United States Provisional Patent Application No. 62/889,748, filed August 21 , 2019 and European Patent Application No. 1921 1731 .5, filed November 27,

2019, each of which is hereby incorporated by reference in its entirety.

Technical field

Background

An audio signal may be submitted to a variety of compression, transcoding and processing steps before being listened to. This may result in a reduced listening experience for a user, where the audio quality of the played audio signal is not satisfactory. For example, a telephone conference service provider may find that there are significant degradations of audio quality before the audio signal is received by the telephone conference service. For example, a mobile phone conversation may often have GSM encoded voice which is transcoded to G.71 1 before being received by the telephone conference service provider.

The audio signal may thus be referred to as a degraded audio signal and

enhancement of such a signal may advantageously be performed to reduce codec artefacts and improve the listening experience.

There are three main challenges for enhancing a degraded audio signal discussed herein. The first difficulty is that various encoding/trans-coding may be applied to an audio signal before being received to be enhanced, which often are unknown for the enhancement system. Consequently, an algorithm used for enhancement is expected to handle various codec chains. Another problem is that besides distortion resulting from the encoding/ transcoding, there is typically noise and reverberation in the degraded audio signal. The third difficulty is that, since the algorithm may be implemented at the endpoints, and/or be required to handle enhancement in real time, the complexity of the algorithm may be an issue and is advantageously kept low.

There is thus a need for improvements in this context.

Summary of the invention

In view of the above, it is thus an object of the present invention to overcome or mitigate at least some of the problems discussed above. In particular, it is an object of the present disclosure to provide a low-complexity method for enhancing a degraded audio signal, wherein the method is robust for the cause of the distortion in the degraded audio signal. Further and/or alternative objects of the present invention will be clear for a reader of this disclosure.

According to a first aspect of the invention, there is provided a method for supervised training of a machine learning model, MLM, the MLM trained to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal. The method comprises the steps of:

In the present method, over suppression in the trained MLM is reduced by at least one of:

- a pre-processing step performed prior to deriving the set of gains, wherein the pre-processing step comprises adjusting the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is substantially equal to the frequency energy distribution of the second set of features, and

defining a loss function of the MLM which is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain. By the term“over suppression” should, in the context of present specification, be understood that when enhancing the degraded audio signal (e.g. reducing transcoding artifacts or removing noise etc.), some frequency bands of the degraded audio signal may be attenuated rather than amplified or attenuated to a higher degree than what is required. This should advantageously be avoided to improve the listening experience of the enhanced audio signal.

By the term“clean audio signal” should, in the context of present specification, be understood an audio signal without or with little defects that degrades the audio quality. The clean audio signal may be recorded using a high-end studio, or otherwise recorded to have a high quality.

By the term“degraded audio signal” should, in the context of present specification, be understood an audio signal having artefacts such as coding artefacts (due to e.g.

compression), noise, reverberations, etc., that negatively influences the audio quality.

The inventors have realized that the different causes for a degraded audio quality result in that a traditional signal processing method may not be suitable for modelling the degradation. In order to make the algorithm for enhancement of the degraded audio signal robust to different causes of distortion and improve the perceptual quality, a machine learning model, MLM, is implemented and trained as defined herein. The MLM is trained by receiving a clean audio signal (with no/little distortion), and a corresponding degraded audio signal (having distortion). From these audio signals, a first set of features is extracted from the received degraded audio signal, and a second set of features is extracted from the received clean audio signal. Each feature corresponds to a frequency band of the respective received audio signals. Gains for the plurality of frequency bands are derived by comparing each feature of the first set of features to a corresponding feature of the second set. The first set of features and the gains are input to the MLM, and used for training the MLM.

The gains are thus used for reference, or as ground truth. Advantageously, by using the first set of features and the set of gains as input/output when training the MLM, instead of using Pulse code modulation, PCM, values of the degraded audio signal and the clean audio signal as input/output, the risk for unexpected errors in the enhancement process is reduced. Using gains and the first set of features as described herein facilitates a robustly trained MLM.

Over suppression (in audio of one or more of the frequency bands) in the enhanced signal is avoided by implementing at least one of: 1 ) a pre-processing method to minimize the difference of frequency energy distribution of clean audio and the degraded audio e.g. to solve a timbre quality issue (over suppression in high/low frequency part) or other types of over suppression issues;

2) a loss function of the MLM with over suppression punishment, i.e. a loss function that punish over suppression more.

To this end, the method facilitates reduction of over suppression in the trained MLM by at least one of:

a pre-processing step performed prior to deriving the set of gains, wherein the preprocessing step comprises adjusting the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is substantially equal to the frequency energy distribution of the second set of features, and defining a loss function of the MLM that is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain.

Frequency energy distribution of the clean audio signal and the degraded audio signal often differs. If the distribution differs, this may lead to over suppression. For example, if energy tends to decrease from low frequency to high frequency in the clean audio signal but for the degraded audio signal, frequency energy distribution is more balanced (not decreasing as much as the clean audio signal), this may lead to over suppression in high frequencies. By employing the pre-processing step described herein, such over suppression may be avoided.

Using an MLM, which may be inherently difficult to control and manage in detail, the training may result in over suppression. To avoid this, a loss function may be defined that specifically is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain.

According to some embodiments, over suppression is reduced using only one of the pre-processing step and the defined loss function of the MLM. In other embodiments, both the pre-processing step and the defined loss function of the MLM is employed. An advantage of the present method is the flexibility and that over suppression can be handled differently depending on the context and e.g. the available computational resources, available training data etc.

According to some embodiments, the loss function is further weighted according to the frequency band of the features of the training set, such that an error for a feature corresponding to a relatively higher frequency band is weighted with a relatively higher weight. Distortion due to codecs may be more likely to happen in high frequencies, which may make it more important to avoid over suppression in such frequency bands. For example, an error for a feature corresponding to a frequency band exceeding 6 kHz is weighted with a higher weight compared to an error for a feature corresponding to a frequency band below 6 kHz. Other threshold frequencies may be employed depending on the context. In some embodiments, errors for features corresponding to frequency band(s) between two threshold frequencies, or above or below a threshold frequency are weighted with a relatively higher weight, based on a perceptual importance according to a

psychoacoustic model.

According to some embodiments, the first and second sets of features are extracted by converting the received degraded audio signal and clean audio signal into the frequency domain. For example, the conversion may be performed using one from the list of: a short time Fourier transform, SFTF, a modified discrete cosine transform, MDCT, and a shifted discrete frequency transform, MDXT.

To reduce the computational complexity, and/or to improve quality, the complex features resulting from the conversion to the frequency band (e.g. DCT components) may be banded (combined within a frequency band). To this end, the first and second set of features may be extracted by, for each frequency band of a plurality of frequency bands, for frequency bins of the frequency band, combining complex features of the frequency domain representation of the respective audio signal corresponding to the frequency bins into a single feature corresponding to that frequency band.

In some embodiments, the features of the first and second set of features corresponds to Mel-frequency band powers, Bark Scale band powers, log-frequency band powers or ERB band powers.

Put differently, according to some embodiments, the first and second sets of features are extracted by:

converting the received degraded audio signal and clean audio signal into the frequency domain,

for each frequency band, j, of a plurality of frequency bands

combining frequency components of the frequency domain representation of the degraded audio signal into a feature, fi _j, corresponding to the frequency band, and adding log(fi _j) to the first set of features; combining frequency components of the frequency domain representation of the clean audio signal into a feature, f _2j, corresponding to the frequency band and adding log(f _2j) to the to the second set of features.

In some embodiments, the steps of combining frequency components of the frequency domain representation of the degraded audio signal into a feature, fi _j, comprises weighting the frequency components with different weights.

According to some embodiments, the plurality of frequency bands are equally spaced in Mel frequency. Consequently, the extracted features may advantagously approximating the human auditory system's response more closely than if a linearly-spaced frequency bands are used.

According to some embodiments, the first and second set of features are extracted by combining extracted features from a plurality of audio frames of the repective audio signals. Advantagously, the MLM may get more input data to work with.

According to some embodiments, the extracted first and second set of features are further normalized prior to being used for deriving the set of gains. Advantageously, the trained MLM may be less sensitive to to differences in speech level and equalisation that arise from different microphones in different acoustic scenarios.

According to some embodiments, the step of pre-processing comprises balancing a frequency energy distribution of the second set of features to be substantially equally distributed across the entire frequency band of the received clean audio signal. In some embodiments, in particular in the context of audio conference services, frequency energy distribution of a degraded audio signal may be more balanced or not decrease as much as a clean audio signal typically does. Consequently, the frequency energy distribution of the second set of features may be balanced to avoid over suppression. Balancing the frequency energy distribution may be less computationally complex compared to adjusting the frequency energy distribution of the first set of features to be more similar to the frequency energy shape of the clean audio signal.

According to some embodiments, the pre-processing comprises: fitting a polynomial curve to the second set of features, defining a filter based on a difference between the polynomial curve and a constant function, and applying the filter to the second set of features. Advantageously, this is a low complexity embodiment for adjusting the frequency energy distribution such that the frequency energy distribution of the first set of features is substantially equal to the frequency energy distribution of the second set of features. According to some embodiments, adjustment of the frequency energy distribution of the second set of features is only done if the shape of the frequency energy distribution of the second set of features fulfils certain requirements. Consequently, unnecessary adjustments are avoided, and computational resources may be saved. To this end, in these embodiments, the pre-processing comprises:

fitting a polynomial curve to the second set of features,

calculating a difference between a minimum value and a maximum value of the polynomial curve,

upon determining that the difference exceeds a threshold value: defining a filter based on the difference between the polynomial curve and a constant function and applying the filter to the second set of features. The threshold value may correspond to a 3 dB difference in a frequency energy distribution of the second set of features across the entire frequency band of the received clean audio signal.

According to some embodiments, the value of the constant function is set to the maximum value of the polynomial curve. In other embodiments, the mean value is chosen.

According to some embodiments, the polynomial curve is one from the list of: a linear curve, a quadratic curve and a cubic curve.

In some embodiments, the loss function of the MLM is used to avoid over suppression. Accordingly, in some embodiments the loss function is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain by multiplying a distance measurement between the predicted gain and the ground truth with a weight, the weight being relatively higher when: the predicted gain is lower than the ground truth gain, and the predicted gain is negative,

the weight being relatively lower when the predicted gain is higher than or equal to the ground truth gain, or the predicted gain is positive.

In some embodiments, the ratio between the relatively higher weight and the relatively lower weight is between 3-7. In some embodiments, the ratio between the relatively higher weight and the relatively lower weight is 5.

To further improve the robustness of the training of the MLM, according to some embodiments, the method further comprises adding artificial pairs of features to the first and second sets of features, wherein an artificial pair of features comprises a first feature added to the first set of features and a second feature added to the second set of features, the first and second features having a same value and corresponding to a same frequency band. To further improve the robustness of the training of the MLM, according to some embodiments, noise is added to the first set of features. The noise may be added only for a first threshold number of epochs when training the MLM. Consequently, a same pair of a degraded audio signal and a corresponding clean audio signal may result in slightly different gains for a same frequency band throughout the training of the MLM, to facilitate a robust MLM, with a reduced number of audio signals used for training.

According to some embodiments, the received degraded audio signal is generated from the received clean audio signal. Consequently, a same clean audio signal may be used for producing a plurality of degraded audio signals, simulating different transcoding chains. A reduced number of clean audio signals may thus be required to train the MLM.

According to some embodiments, generation of the degraded audio signal comprises applying at least one codec to the clean audio signal.

According to some embodiments, the at least one codec comprises a voice codec.

An MLM trained for a teleconferencing system may thus advantageously be trained.

According to some embodiments, the method further comprising the step of, before comparing each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, adjusting the first and/or the second set of features, wherein the adjustment comprises using distinct adjustment parameters during each training pass, epoch and/or minibatch of a training loop of the MLM. Advantageously, robustness of the trained MLM may be increased, and overfitting problems of the training process of the MLM may be avoided or reduced.

According to some embodiments, the adjustment parameters are drawn from a plurality of probability distributions. Advantageously, the robustness may be further increased.

According to some embodiments, the adjusting of the first set of features comprises at least one from the list of: adding fixed spectrum stationary noise, adding variable spectrum stationary noise, adding reverberation, adding non-stationary noise, adding simulated echo residuals, simulating microphone equalization, simulating microphone cutoff, and varying broadband level.

According to some embodiments, generation of the degraded audio signal comprises applying an Intermediate Reference System, IRS, filter to the clean audio signal.

According to some embodiments, generation of the degraded audio signal comprises applying a low pass filter to the clean audio signal. According to some embodiments, generation of the degraded audio signal comprises convolving a generated degraded audio signal with a narrow band impulse response.

Reverbation in the degraded audio signal may thus advantagously be simulated.

According to some embodiments, the MLM is one from a list of: an artificial neural network, ANN, a decision tree, a support vector machine, a mixture model, and a Bayesian network. The ANN may be a deep neural network, DNN, a shallow neural network, a convolutional neural network, CNN, etc., The mixture model may be a Gaussian Mixture model. The Bayesian network may be a Hidden Markov Model, HMM.

In a second aspect of the invention, there is provided a device configured for supervised training of a machine learning model, MLM, the MLM being trained to reduce codec artefacts in a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal, the device comprising circuity configured to perform the method according to any embodiments of the first aspect.

In a third aspect of the invention, there is provided a computer program product comprising a non-transitory computer-readable storage medium with instructions adapted to carry out the method of the first aspect when executed by a device having processing capability.

The second and third aspect may generally have the same features and advantages as the first aspect.

According to a fourth aspect of the invention, there is provided a method for enhancing a degraded audio signal, comprising the steps of:

receiving a degraded audio signal;

extracting a first set of features from the received degraded audio signal;

inputting the extracted first set of features to a machine learning model, MLM, trained according to any embodiments of the first aspect; and

using output gains from the MLM for enhancing the received degraded audio signal.

The enhancement may comprise reducing coding artefacts of the received degraded audio signal.

The first set of features is advantageously extracted in a same way as the extraction of features from the degraded audio signal used in the training of the MLM, excluding any adding of noise.

According to some embodiments, the method further comprises the step of postprocessing the output gains before using the gains for reducing coding artefacts of the received degraded audio signal. The post-processing may advantageously facilitate the output gains being in a reasonable range.

For example, the post-processing comprises at least one of:

limiting a range of the output gains to a predefined range;

limiting a difference between a gain for a frequency band of an audio frame of the received degraded audio signal and a gain for a neighbouring frequency band of the audio frame or another audio frame of the received degraded audio signal.

According to some embodiments, the degraded audio signal is a public switched telephone network, PSTN, call, wherein the steps of extracting a first set of features and inputting the extracted first set of features to the trained MLM is performed for at least one audio frame of the PSTN call. According to some embodiments, each audio frame is used for producing gains. According to some embodiments, every N:th audio frame is used for producing gains. In these embodiments, the intermediate frames are enhanced using the gains from a previous audio frame for which the gains have been determined.

The device may enhance the degraded audio signal in real time, i.e. the degraded audio signal may be streamed to the device. In other embodiments, the device enhances a recorded audio signal received by the device.

According to some embodiments, the method is implemented in an end point of an audio conference system for enhancing incoming audio signals.

According to some embodiments, the method is implemented in a server of an audio conference system for enhancing incoming audio signals before being transmitted to an end point.

In a fifth aspect of the invention, there is provided a device configured for enhancing a degraded audio signal, the device comprising circuity configure to perform the method according to any embodiments of the fourth aspect.

In a sixth aspect of the invention, there is provided a computer program product comprising a non-transitory computer-readable storage medium with instructions adapted to carry out the method of the fourth aspect when executed by a device having processing capability.

It is further noted that the invention relates to all possible combinations of features unless explicitly stated otherwise. Brief Description of the Drawings

The above, as well as additional objects, features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of preferred embodiments of the present invention, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:

Fig. 1 shows a method for supervised training of a machine learning model, MLM, according to some embodiments,

Fig. 2 shows a method for balancing a frequency energy distribution of a second set of features to be substantially equally distributed according to some embodiments,

Fig. 3 shows a device configured for supervised training of a machine learning model, MLM, according to some embodiments,

Fig. 4 shows a method for enhancing a degraded audio signal using the MLM trained as described in Fig. 1 ,

Fig. 5 shows a device configured for enhancing a degraded audio signal using the MLM trained as described in Fig. 1 ,

Fig. 6 shows a device for multi-style training of a machine learning model, MLM, according to some embodiments,

Fig. 7 is a diagram showing an example of fixed spectrum stationary noise addition (augmentation) according to some embodiments, and

Fig. 8 is a diagram showing an example of microphone equalization augmentation according to some embodiments.

Detailed description of embodiments

The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown. The systems and devices disclosed herein will be described during operation.

The present disclosure generally relates to the problem of enhancing an audio signal. As described above, a quality of an audio signal may be degraded, due to e.g. artefacts caused by encoding and/or transcoding of the audio signal, and due to noise added to the audio signal during recording and/or transmission of the audio signal. In the following, the degraded audio signal is sometimes exemplified as a public switched telephone network, PSTN, call. However, this is just by way of example and the methods and systems described herein may be employed for enhancing the quality of any other suitable type of audio signals, such as for example a voice over IP signal (VoIP), audio in streaming media, or an analogue or digital recording of audio.

As described herein, the enhancing of the degraded audio signal is facilitated by training a machine learning model, MLM. The MLM may be embodied by a one from the list of: an artificial neural network, ANN, a decision tree, a support vector machine, SVM, a mixture model, and a Bayesian network. The ANN may be a Deep Neural network, DNN, a Convolutional Neural network, CNN, a shallow neural network or any other suitable type of ANN. In the following, a DNN is used by way of example when describing the invention.

Figure 1 shows by way of example a method 100 for supervised training of a machine learning model, MLM, to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal. Different embodiments of the method will now be described in conjunction with Figs. 2 and 3.

The method 100 comprises receiving a degraded audio signal 308 and a clean audio signal 310 for training of the MLM. The degraded audio signal 308 and a clean audio signal 310 is thus received by a device 301 configured for training an MLM to enhance a degraded audio signal by calculating gains to be applied to frequency bands of the degraded audio signal. The device 301 comprises circuity, for example in the form of one or more processors, configured to receive the degraded audio signal 308 and the clean audio signal 310.

In one embodiment, the degraded audio signal 308 is generated from the clean audio signal 310 in a degraded audio creator unit 312. The degraded audio creator unit 312 may be part of a same device 300 as the enhancing device 301 , or may be a device separate from the enhancing device 301 and wired or wirelessly connected to the enhancing device 301 . The degraded audio creator may be implemented using one or more processors. The functionality of the degraded audio creator unit 312 will now be described.

The degraded audio creator unit 312 may be seen as embodying a plurality of simulated transcoding chains. The degraded audio creator unit 312 receives a clean audio signal 310 and outputs one or more degraded audio signals 308. Advantageously, one clean audio signal may result in a plurality of clean-degraded audio signal pairs, where the input clean audio signal 310 is part of each pair, and where the degraded audio signal 308 in each pair comprises different types of artefacts.

Each simulated transcoding chain in the degraded audio creator unit 312 contains a series of codecs and filters. For example, the generation of the degraded audio signal may comprise applying at least one codec (e.g. a voice codec) to the clean audio signal. The generation of the degraded audio signal may alternatively or additionally comprise applying an Intermediate Reference System, IRS, filter to the clean audio signal. The generation of the degraded audio signal may alternatively or additionally comprise applying a low pass filter to the clean audio signal.

Below follow 1 1 examples of transcoding chains which have been proved advantageous for training an MLM as described herein. The details of the 1 1 transcoding chains are:

(1 ) Clean audio signal Low pass filter & IRS8 AMR-NB (5.1 ) G.71 1 VSV degraded audio signal,

(2) Clean audio signal Low pass filter & IRS8 AMR-NB (12.20) G.71 1

degraded audio signal,

(3) Clean audio signal Low pass filter & IRS8 G.729 G.729 (delayed by 12

samples) G.71 1 VSV degraded audio signal,

(4) Clean audio signal Low pass filter & IRS8 dynamic range compression Opus Narrowband (6 Kbps) G.71 1 VSV degraded audio signal,

(5) Clean audio signal Low pass filter & IRS8 Opus Narrowband (6 kbps) AMR- NB (6.70) G.71 1 VSV degraded audio signal,

(6) Clean audio signal Low pass filter & IRS8 dynamic range compression AMR- NB (6.70) G.71 1 VSV degraded audio signal,

(7) Clean audio signal Low pass filter & IRS8 AMR-NB (5.1 ) MNRU G.71 1 VSV (MOS = 3.0) degraded audio signal,

(8) Clean audio signal Low pass filter & IRS8 AMR-NB (5.1 ) MNRU G.71 1 VSV (MOS = 2.5) degraded audio signal

(9) Clean audio signal Low pass filter & IRS8 CVSD dynamic range

compression AMR-NB G.71 1 (Simulating GSM mobile on Bluetooth) VSV degraded audio signal,

(10) Clean audio signal Low pass filter & IRS8 iLBC G.71 1 (simulating iLBC SIP truck) VSV degraded audio signal,

(1 1 ) Clean audio signal Low pass filter & IRS8 speex G.71 1 (simulating speex SIP truck) VSV degraded audio signal.

The degraded audio signals outputted from the 1 1 transcoding chains may further be convolved with a narrow band impulse response before being used for training the MLM to simulate reverberations. The dynamic range compression may be performed by any suitable compressor, depending on the context and requirements. For example, the dynamic range may be used to mimic the compression in a PSTN transcoding chain.

Returning now to Fig. 1 . After receiving the degraded 308 and clean 310 audio signal, the method 100 comprises extracting S104 a first set of features from the received degraded audio signal 308, and extracting S106 a second set of features from the received clean audio signal 310. Each feature of the first and second sets of features corresponds to a frequency band of the respective received audio signals 308, 310.

Embodiments of the feature extraction S104, S106 will now be described.

The received degraded audio signal 308 and clean audio signal 310 is converted into the frequency domain. The frequency domain refers to the analysis of the audio signals with respect to frequency, rather than time. Any suitable mathematical transforms (Fourier transform, Wavelet transform, etc.,) for the conversation may be employed. Advantageous examples comprise a short time Fourier transform, SFTF, a modified discrete cosine transform, MDCT, and a shifted discrete frequency transform, MDXT. A reason for using MDXT instead of MDCT or DFT is that it provides both the energy compaction property of the MDCT and the phase information similar to DFT.

The features of the first and second set of features are advantageously banded features, meaning that a feature corresponds to a frequency band rather than a frequency bin. This will reduce the complexity of the training of the MLM, since less input values will be used for training.

For this reason, according to some embodiments, the first and second set of features are extracted by, for each frequency band of a plurality of frequency bands, for frequency bins of the frequency band, combining complex features of the frequency domain representation of the respective audio signal corresponding to the frequency bins into a single feature corresponding to that frequency band. The combination of the complex features may comprise calculating an absolute value of the complex values of the bins. The logarithm of the combined value may then be added to the first/second set of features. In some embodiments, the features of the of the first and second set of features corresponds to Mel-frequency band powers, Bark Scale band powers, log-frequency band powers or ERB band powers.

Putting it differently, the first and second sets of features may be extracted by:

converting the received degraded audio signal and clean audio signal into the frequency domain, for each frequency band, j, of a plurality of frequency bands

combining (as described above) frequency components (i.e. corresponding to the bins of the frequency band j) of the frequency domain representation of the degraded audio signal into a feature, fi _j, corresponding to the frequency band, and adding log(fi _j) to the first set of features;

combining (as described above) frequency components (i.e. corresponding to the bins of the frequency band j) of the frequency domain representation of the clean audio signal into a feature, f _2j, corresponding to the frequency band and adding log(f _2,j) to the to the second set of features.

In some embodiments, the combining of frequency components of the frequency domain representation of the degraded/clean audio signal into a feature comprises weighting the frequency components with different weights.

The frequency bands may be determined such that each comprise a same number of bins (such as 100, 160, 200, 320, etc., bins).

In one embodiment, the log of the power in of a number of spectral bands equally spaced in Mel frequency (hereinafter referred to as“logmelspec” values) is computed, for example every 10ms. In one embodiment, for the most recent 5-20 frames of the audio signal (clean, degraded) is used and such features are“stacked up” into an extended 230- 460 dimensional feature vector (first and second set of features). In other words, the first and second set of features are extracted by combining extracted features from a plurality of audio frames of the repective audio signals.

A method for normalisation of“logmelspec” features that are“stacked up” over multiple analysis frames will now be described. It should be noted that this normalization scheme is entirely optional to include in the method for supervised training of a machine learning model, MLM, as described herein. In summary, the optional normalisation technique:

1 ) only normalizing among the features in the stack (e.g. over a 100-200ms time window of the audio signal). This means it is helpful in accurately recognizing speech or detecting specific words (in a speech recognizing system) even when a user suddenly begins speaking in never-before-heard acoustic conditions.

2) separates macro-scale spectral shape from micro-scale detail. This is helpful in creating systems which are robust to, for example, microphone with different equalization curves or room impulse responses (RIRs) with different spectral shape. It also means the system is more robust to background noise. 3) accommodates weighting (certainty) in the features. This means the system can take additional input from pre-processing systems such as echo suppressors, noise suppressors and non-linear beamforming systems which greatly aid in robustness.

The normalization scheme comprises the following equations.

where

+ Denotes Moore-Penrose pseudoinverse

Notes:

Equation 200: In this step, the mean spectrum over all frames in the stack is computed. • Equation 201 : In this step, the mean level over all frames in the input stack 108 is computed (in this example by taking the mean of the spectrum over all frequencies).

• Equation 202: In this step, the mean level-independent spectrum is computed.

• Equation 203: In this step, a smooth cepstral approximation of the mean level- independent spectrum is computed for a small number of cepstral dnabs. The cepstral components except for the one corresponding to the flat basis function (usually, this means excluding the first cepstral component) are taken as a set of cepstral output components 1 10B which summarise the general spectral shape of the audio data in the stack in a format that is convenient for a speech recogniser to use.

• Equation 204: In this step, the smooth cepstral approximation is transformed back into a smooth spectrum.

• Equation 205: In this step, the normalised spectral output is computed by removing the smoothed mean spectrum from the input.

• x[f,t] is an unnormalized input feature at a particular time t (in the range [1 ,T], index 1 corresponding to the most recent data) in the past and a particular frequency band f (in the range [1 ,F]).

• m is the mean spectrum averaged over all frames [1 ,T]

• L is mean broadband level.

• /I is the mean level-independent spectrum averaged over all frames [1 ,T]

• c[b] is a cepstral decomposition of Ji obtained by taking the truncated Discrete

Cosine Transform (DCT) of Ji w.r.t. the DCT basis matrix C[b,f] for each of the cepstral dnabs b=[1..B] (a typical value for B is 3).

• s[f] is the spectral resynthesis of c[b] obtained by taking the truncated Inverse

Discrete Cosine Transform (IDCT) of c w.r.t. the IDCT basis matrix S[f,b]

• y[f,t] are the normalised output features

• c[b] for b > 1 (that is, excluding the first cepstral dnab) are the cepstral output features

• output from the normalisation scheme comprises y[f,t] for f in [1 ,F] and t in [1 ,T] as well as c[b] for b in [2,B]

• The DCT basis C[b,f] is computed using the equations 206 and 207.

• The IDCT basis S[f,b] is computed by taking the Moore-Penrose pseudoinverse of C[b,f]

In some embodiments, the mean across time (mu) and the mean across time and frequency (L) can both be taken as weighted means if a confidence weighting w[f,t] is available for each input feature x[f,t]. This provides added noise robustness. In this extension, equation 200 would be replaced with 200A and equation 201 would be replaced with equation 201 A.

o 200A: o 201 A:

• The mean across time has be implemented as an HR (recursive) mean instead of the

FIR mean shown: m[ί,ί] = am[ί,M] + (1 -a)x[f,tj. An example value of a is 0.96.

Note that in we use the term cepstrum to mean the discrete cosine transform of the logmelspec data. It is common to reverse the characters in part of the word in order to come up with a cepstral term that corresponds to a spectral term. For example, filtering

implemented in the cepstral domain is commonly known as“Mftering”. Therefore, we herein term the cepstral equivalent of a spectral band a“dnab”.

A system for implementing the above described normalization method may comprises a speaker speaking into a microphone. The audio signal recorded by the microphone is sent to an Analogue-to-digital converter. The pulse code modulated data (PCM data) may be sent to a digital pre-processing unit (may include echo suppression, noise suppression, beam-forming for example). The PCM data is then used for feature extraction as described above. For example, the log of power in 25 frequency bands of equal width in Mel-frequency space are computed, resulting in a feature vector (eg 25 real numbers) The features from one audio frame is then sent to a stacking unit, which has a a history buffer and stores or“stacks up” multiple feature vectors into a 2 dimensional array of features in time and frequency. For example, for every 10ms a new 25 band feature vector may be computed and the stacking unit keeps the most recent 10 such vectors so that its output is a 25 (in frequency) by 10 (in time) array of feature history. Subsequently, the normalization as described above in conjunction with equations 201 -208 are performed. The normalization feature set comprising:

- A two-dimensional“stack” of normalised features. This two-dimensional array will typically have the same size as the unnormalized feature stack.

- A small number of cepstral features describing the overall average spectral shape in the feature stack. This would, for example, encompass 2 real numbers. The normalization feature set will then be used for input (optionally pre-processed to adjust the frequency energy distribution as described herein) to the MLM, e.g. a Deep Neural Network (DNN), Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN).

Returning again to Fig. 1 . When the first and second sets of features have been extracted S104, S106, in some embodiments, pre-processing is applied to the sets of features. The pre-processing is adapted to adjust the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is substantially equal to the frequency energy distribution of the second set of features. This is done to reduce over suppression of certain frequency bands when the trained MLM is used for enhancing a degraded audio signal. As described above, the coding/transcoding of the degraded audio signal may result in a timbre quality issue, where e.g. a high/low frequency bands of the degraded audio signal 308 have a substantially different frequency energy distribution compared to the clean audio signal 310. In other embodiments, other frequency bands other than the high or low frequency part may show similar problems. In other words, the frequency energy distribution of degraded audio signal 308 and clean audio signal 310 is different.

To mitigate this problem, the method 100 for training an MLM may comprise a preprocessing step S108 that adjusts the frequency energy distribution of the first and/or the second set of features such that the frequency energy distribution of the first set of features is substantially equal to the frequency energy distribution of the second set of features. The adjustment may be applied to the first set of features, or the second set of features, or the first and the second set of features. In the following, pre-processing of the second set of features is described by way of example.

In some embodiments, the pre-processing comprises balancing a frequency energy distribution of the second set of features to be substantially equally distributed across the entire frequency band of the received clean audio signal. This may be done, since for a clean audio signal 310, energy trends to decrease from low frequency to high frequency. But for degraded audio signal 308, frequency energy may be more balanced or not decrease as much as for the clean audio signal 310. That means that their frequency energy shape is different, and if the first and second set of features is used as is to train the MLM, this may lead to over suppression in high frequencies. Such an embodiment is shown in Fig. 2.

The balancing of the frequency energy distribution of the second set of features may comprise: fitting S202 a polynomial curve to the second set of features, defining S208 a filter based on difference between the polynomial curve and a constant function, and applying S210 the filter to the second set of features. The adjusted second set of features may then be used for calculating S1 10 the gains for training (which will be further described below).

In some embodiments, the frequency energy distribution of the second set of features is only balanced in case the original frequency energy shape fulfils some prerequisites. For example, a difference between a minimum value and a maximum value of the polynomial curve may need to exceed a threshold value such as 3 dB, 5 dB or any other suitable threshold. The threshold value may thus correspond to e.g. a 3 dB difference in a frequency energy distribution of the first set of features across the entire frequency band of the received clean audio signal 310. In these embodiments, the balancing of the frequency energy distribution of the second set of features may comprise: fitting S202 a polynomial curve to the second set of features, calculating S204 a difference between a minimum value and a maximum value of the polynomial curve, upon determining (in step S206) that the difference exceeds a threshold value: defining S208 a filter based on difference between the polynomial curve and a constant function, and applying S210 the filter to the second set of features. The adjusted second set of features may then be used for calculating S1 10 the gains for training (which will be further described below). In case the difference does not exceed the threshold (as determined in S206), the second set of features are used as is for calculating S1 10 the gains for training (which will be further described below).

The polynomial curve may be one from a list of: a linear curve, a quadratic curve and a cubic curve.

The value of the constant function may be set to the maximum value, minimum value or mean value of the polynomial curve.

It should be noted that in other embodiments, the first set of features may be adjusted to be more similar (in energy distribution) to the second set of features using similar processes as described above.

It should further be noted that in some embodiments, both the first and second set of features may be adjusted to meet a target energy distribution. Such target energy distribution may be based on e.g. a perpetual importance of the different frequency bands.

Returning again to Fig. 1 . In some embodiment, multi-style training S109 of the MLM is employed. Training of an MLM is advantageously employed using a lot of data, and the data need to be various to make sure the trained MLM robust. Multi-style training is a method of data augmentation, which is used for making data diversification. This step S109 may comprise adding random noise to the first and/or second set of features (adjusted or not according to step S108) before used of calculating the gains. In some embodiments, the noise is added to the first set of features. The noise may have a limit of 5 %, 10 %, 20 %, etc., of the value of the feature it is added to. In some embodiments, a same amount of noise is added to both a feature pair of the first and second set of features (i.e. one feature from the first set of features and a corresponding feature, relating to the same frequency band, of the second set of features), to simulate a change of energy level for the frequency band corresponding to the feature pair. In some embodiments, the noise is added only for a first threshold number of epochs when training the MLM.

In some embodiments, the multi style training comprises using distinct

adjustment/augmentation parameters during each training epoch (e.g., each epoch, minibatch, and/or pass, of or in the training loop), and such embodiments will now be described in conjunction with figure 6. For example, such procedure efficiently adds different noise and reverb in the feature domain on each training pass/epoch. This may increases the ability of speech technology such as speech recognisers, wakeword detectors and noise suppressors to operate in real-world far-field conditions without many of the overfitting problems that traditionally plague multi-style training approaches.

In figure 6, the degraded audio signal 308 and the clean audio signal 310 is received by a feature extraction unit 602. The feature extraction unit extracts a first set of features 604 from the degraded audio signal 308 and a second set of features from the clean audio signal 310 as described above.

It should be noted that, for simplicity, figure 6 does not include the optional feature of adjusting of the frequency energy distribution of the first and/or the second set of features, but such extension of the multi style training of figure 6 would mean that a pre-processing unit would receive the first 604 and second 606 set of features and perform the adjusting of the frequency energy distribution as described herein.

The first 604 and second 606 set of features are received by a data augmentation unit 608 that implements the multi-style training (S109 in figure 1 ). The data augmentation unit 608 takes the first 604 and second 606 set of features and adjusts a set of features 604, 606 by applying augmentation (e.g., addition of reverberation, addition of stationary noise, addition of non-stationary noise, and/or addition of simulated echo residuals) thereto, thus generating augmented set of features 604 ^*, 606 ^*. The data augmentation unit 608 may operate on both or one of the received sets of features 604, 606. It should be noted that the data augmentation unit 608 operates:

• in the feature domain. For this reason, implementations may be fast and efficiently implemented on a GPU as part of a deep learning training procedure; and • inside the pass/epoch loop of the training of the MLM 612, which means that different augmentation conditions (e.g., distinct room/reverberation models, distinct noise levels, distinct noise spectra, distinct patterns of non-stationary noise or music residuals) can be chosen on each training epoch of the MLM 612.

In case a stopping criterion 616 is not fulfilled (i.e. a defined number of epochs of training of the MLM 612 or a convergence criterion of the MLM 612), the data augmentation unit will again augment the sets of features 604, 606 and the MLM 612 will be trained based on new augmented set(s) of features 604 ^*, 606 ^*. In case the stopping criterion 616 is fulfilled, the feature extraction unit will operate on a next audio frame (if any) of the degraded audio signal 308 and the clean audio signal 310 to proceed with the training of the MLM.

Examples of types of augmentations (using adjustment parameters) that the data augmentation unit 608 may perform in the feature domain (on the first and/or second set of features 604, 606) include (but are not limited to) the following:

• Fixed spectrum stationary noise: For each corresponding utterance in the clean and degraded audio signal 308, 310, draw a random signal to noise ratio (SNR) from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) and apply stationary noise with a fixed spectrum (e.g., white noise, pink noise, Hoth noise) at the chosen level below the incoming speech signal. When the input features are band powers in dB, adding noise corresponds to taking the bandwise maximum of noise power and signal power. An example of fixed spectrum stationary noise augmentation will be described with reference to Fig. 7;

• Variable spectrum stationary noise: Draw an SNR as for the fixed spectrum stationary noise addition, and also draw a random stationary noise spectrum from a distribution (for example, a distribution of linear slope values in dB/octave, a distribution over DCT values of the log mel spectrum (cepstral)). Apply noise at the chosen SNR with the chosen shape;

• Non-stationary noise: Add noise that is localized to random locations in the spectrogram in time and/or in frequency. For example, for each training utterance, draw ten rectangles, each rectangle having a random start time and end time and a random start frequency band and end frequency band and a random SNR. Within each rectangle, add noise at the given SNR;

• Reverberation: Draw a reverberation model (e.g., with a random RT60, mean free path and distance from source to microphone). Apply this reverberation model to the input features (e.g., as described in US Provisional Patent Application No.

62/676,095);

• Simulated echo residuals: To simulate leakage of music through an echo canceller (smart speakers, and some other smart audio devices and other devices, must routinely recognize speech incident at their microphones while music is playing from their speakers, and typically use an echo canceller or echo suppressor to partially remove echo), add music-like noise. An example of simulated echo residuals augmentation will be described with reference to the code listing below;

• Microphone equalization: Speech recognition systems often need to operate without complete knowledge of the equalization characteristics of their microphone hardware. Therefore, it can be beneficial to apply a range of microphone equalization characteristics during training. For example, choose a random microphone tilt in dB/octave (e.g., from a normal distribution mean of OdB/octave, standard deviation of 1 dB/octave) and apply a filter which has a response which has a linear magnitude response. When the feature domain is log (e.g., dB) band power, this may correspond to adding an offset to each band proportional to distance from some reference band in octaves. An example of microphone equalization augmentation will be described with reference to Fig. 8;

• Microphone cutoff: Another microphone frequency response characteristic which is not necessarily known ahead of time is the low frequency cutoff. For example, one microphone may pick up signal down to 200 Hz, while another microphone may pick up speech down to 50 Hz. Therefore, augmenting the input features by applying a random low frequency cutoff (highpass) filter may improve performance across a range of microphones; and/or

• Level: A further parameter which may vary from microphone to microphone and from acoustic situation to acoustic situation is the level or bulk gain. For example, some microphones may be more sensitive to other microphones and some talkers may sit closer to a microphone than other talkers. Further some talkers may talk louder than other talkers. Speech recognition systems must therefore deal with speech at a range of input levels. It may therefore be beneficial to vary the level of the input features during training. When the features are band power in dB, this can be accomplished by drawing a random level offset from a distribution (e.g., uniform distribution over [-20, +20] dB) and adding that offset to all band powers. The adjustment/augmentation parameters thus are derived using one or more of the above strategies. The adjustment/augmentation parameters may be drawn from one or more probability distributions.

Another embodiment of the invention, which includes fixed spectrum stationary noise augmentation, will be described with reference to Fig. 7. Fig. 7 is a diagram showing an example of fixed spectrum stationary noise addition (augmentation) in accordance with an embodiment of the invention. Elements of Fig. 7 include the following:

• 210: example noise spectrum;

• 21 1 A: flat portion of example spectrum 210. Spectrum 210 is flat below reference frequency f _peak. An example value of f _peak is 200Hz.;

• 21 1 B: portion of example spectrum 210 above frequency f _peak. Portion 21 1 B of spectrum 210 rolls off at a constant slope in dB / octave. According to experiments by Hoth (see Hoth, Daniel, The Journal of the Acoustical Society of America 12, 499 (1941 ); htps://doi.Org/ 10.1 121 /1 .1916129), a typical mean value to represent such roll off of noise in real rooms is 5 dB / octave;

• 212: Reference frequency (f _peak) below which average spectrum is modelled as flat;

• 213: Example mean speech spectrum 214 and example equivalent noise spectrum 215;

• 214: Example mean speech spectrum over one training utterance;

• 215: Equivalent noise spectrum. This is formed by shifting the noise spectrum 210 by the equivalent noise power so that the mean power over all frequency bands of the equivalent noise spectrum 215 is equal to the mean power over all bands of the mean speech spectrum 214. The equivalent noise power can be computed using the following formula:

where

X _i is the mean speech spectrum in band i in decibels (dB),

^■ n _i is the prototype noise spectrum in band i in decibels (dB), and

^■ There are N bands;

• 216: Added noise spectrum. This is the spectrum of the noise to be added to the training vector (in the feature domain). It is formed by shifting the equivalent noise spectrum 215 down by the Signal to Noise Ratio, which is drawn from the SNR distribution 217. Once created, the noise spectrum 216 is added to all frames of the training vector in the feature domain by taking the maximum of the signal band power and the noise spectrum in each time-frequency tile; and • 217: Signal to Noise Ratio (SNR) distribution. An SNR is drawn, from the distribution 217, for each training vector in each epoch/pass. In the example, SNR distribution 217 is a normal distribution with a mean of 45 dB and a standard deviation of 10 dB. Another embodiment of the invention, which includes microphone equalization augmentation, will be described with reference to Fig. 8. Elements of Fig. 8 include the following:

• 220: Example microphone equalization spectrum (a curve, indicating power in dB as a function of frequency) to be added to all frames of one training vector for one epoch/pass of training. In this example the microphone equalization curve 220 is linear in dB / octaves;

• 221 : Reference point (of curve 220). In the band corresponding to (i.e., including) a reference frequency f _ref (e.g., f _ref = 1 kHz), the power (indicated by equalization spectrum 220) is OdB; and

• 222: A point of curve 220, at an arbitrary frequency, f. At point 222, the microphone equalization curve 220 has gain,“g” dB, where g = T log ₂(f - f _ref) for a randomly selected tilt T in dB / octave. For example, T can be drawn for each training vector for each epoch/pass.

Below there is provided an example of simulated echo residuals augmentation with reference to the following code listing. The code listing (in Julia programming language) implements music residual addition. In this listing:

• coef.fband is a vector of band centre frequencies in Hz;

• coef.meandifflog_fband is mean(diff(log.(fband))); and

• coef.dt_ms is the audio frame size in milliseconds.

The examplary code listing is:

function batch_generate_residual (nframe : : Int , nvector : : Int , coef : : NoiseAdditionCoef { X, P } ) where {X<:Real, P<:Real}

nband = length (coef . fband)

tempo_bpm = X(100) .+ rand(X, 1, 1, nvector) *X ( 80 ) pitchiness = (X(l) .+ rand(X, 1, 1, nvector) . *X (10) ) .*

X(0.07) ./ coef .meandifflog_fband

melody = randn (X, 1, 1, nvector) . * X(0.01) ./

coef . meandifflog_fband

Cl = rand (X, 1, 1, nvector) .* X(20)

C2 = randn (X, 1, 1, nvector) .* X(10) X(5) f = l:nband

t 1 : nframe spectrum = Cl . * cos. (pi . * X . ( f) ./ X(nband)) .+ C2 . * cos . (X ( 2 ) . * pi . * X . ( f) ./ X(nband))

spectrum = spectrum mean ( spectrum; dims=l) parti = sin. (X(2) .* pi .* (f .+ t' .* melody) ./ pitchiness )

part2 = cos. (X(2) .* pi .* t' .* X(60 * 4) .* coef.dt_ms

./ (tempo_bpm [ 1 ] *X (1000) ) ) spectrum .+ X(10) . * parti . * part2

end

To improve the robustness of the training data, the multi-style training S109 may further comprise adding artificial pairs of features to the first and second sets of features, wherein an artificial pair of feature comprises a first feature added to the first set of features and a second feature added to the second set of features, the first and second feature having a same value and corresponding to a same frequency band.

The first and second sets of features (adjusted via processing and/or multi style training, or original) is then used for deriving a set of gains. This is done by comparing S1 10 each feature of the first set of features to a corresponding feature of the second set of features to derive a set of gains, each gain corresponding to a respective feature among the first set of features, and used as ground truth when training the MLM. The comparison comprises for each feature pair, subtracting the value of the feature of the first set of features from the value of the feature of the second set of features.

In some embodiments, over suppression in the trained MLM is reduced by defining S1 1 1 a loss function of the MLM that is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain. This embodiment will now be described.

In some embodiments, the loss function is configured to punish a predicted gain being lower than the ground truth gain more than a predicted gain being higher than the ground truth gain by: multiplying a distance measurement between the predicted gain and the ground truth with a weight, the weight being relatively higher when:

- the predicted gain is lower than the ground truth gain, and

- the predicted gain is negative,

the weight being relatively lower when:

- the predicted gain is higher than or equal to the ground truth gain, or

- the predicted gain is positive.

The ratio between the relatively higher weight and the relatively lower weight may be between 3-7, for example 5.

In one embodiment, the equation for loss function is:

where i is a frame index; j is a band index; a is a punishment coefficient, which according to experiments, a = 5, gets the best result, but other values may be used depending on the context and requirements, y ^pre is the predicted gain from the MLM and y ^true is the ground truth gain. Other suitable ways of defining the loss function where a predicted gain being lower than the ground truth gain is punished more than a predicted gain being higher than the ground truth may be used. For example, the weight w may be multiplied with the second term in a) or with the sum of both the first (L2) term and the second term.

In some embodiments, a further weight, z, is added to the equation, which weight depends on the frequency band j of the features of the training set, such that an error for a feature corresponding to a relatively higher frequency band is weighted with a relatively higher weight. The equation for the loss function may be

where b > m.

For example, an error for a feature corresponding to a frequency band exceeding 6 kHz is weighted with a higher weight compared to an error for a feature corresponding to a frequency band below 6 kHz.

Other suitable ways of defining the loss function where an error for a feature corresponding to a relatively higher frequency band is weighted with a relatively higher weight may be used. For example, the weight w may be multiplied with the second term in a) or with the sum of both the first (L2) term and the second term.

Returning again to Fig. 1 , the method 100 continues by using the first set of features and the derived set of gains as a training set for training the MLM.

As described above, the MLM may be a DNN. By way of example, an architecture of such a DNN will now be described. This architecture has proven to be advantageous for the task of training the MLM for the purpose described herein. The architecture of the DNN may be a typical feed-forward full-connected deep neural network. It may have one input layer, six hidden layers and one output layer. The architecture may be summarized as follows:

(1 ) Layer 1 : input layer, number of DNN nodes is 320, include e.g. 8 frame band feature stacked together (as described above), 7 history frames and one current frame (9 history frames and one current frame, 15 history frames and one current frame, etc.,).

(2) Layer 2: hidden layer 1 , number of DNN nodes is 320, activation function is