Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ROBUST INTRUSIVE PERCEPTUAL AUDIO QUALITY ASSESSMENT BASED ON CONVOLUTIONAL NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2022/112594
Kind Code:
A2
Abstract:
Described herein is a computer-implemented deep-learning-based system for determining an indication of an audio quality of an input audio frame. The system comprises at least one inception block configured to receive at least one representation of an input audio frame and to map the at least one representation of the input audio frame into a feature map; and at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, wherein the at least one fully connected layer is configured to determine the indication of the audio quality of the input audio frame. Described are further respective methods of operating and training said system.

Inventors:
BISWAS ARIJIT (DE)
JIANG GUANXIN (DE)
Application Number:
PCT/EP2021/083531
Publication Date:
June 02, 2022
Filing Date:
November 30, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DOLBY INT AB (NL)
International Classes:
G10L25/30; G06N3/04; G10L25/60
Other References:
P. M. DELGADOJ. HERRE: "Can we still use PEAQ? a performance analysis of the ITU standard for the objective assessment of perceived audio quality", 2020 TWELFTH INTERNATIONAL CONFERENCE ON QUALITY OF MULTIMEDIA EXPERIENCE (QOMEX, 2020, pages 1 - 6, XP033784024, DOI: 10.1109/QoMEX48832.2020.9123105
Attorney, Agent or Firm:
DOLBY INTERNATIONAL AB PATENT GROUP EUROPE (NL)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame, the system comprising: at least one inception block configured to receive at least one representation of an input audio frame and to map the at least one representation of the input audio frame into a feature map; and at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, wherein the at least one fully connected layer is configured to determine the indication of the audio quality of the input audio frame; wherein the at least one inception block comprises: a plurality of parallel paths of convolutional layers, wherein at least one parallel path includes a convolutional layer with an m x n sized kernel, wherein the integer in is different from the integer n.

2. The system of claim 1, wherein the at least one representation of the input audio frame corresponds to a gammatone spectrogram with a first axis representing time and a second axis representing frequency.

3. The system of claim 1 or 2, wherein the plurality of parallel paths of convolutional layers includes at least one convolution layer with a horizontal kernel and at least one convolutional layer with a vertical kernel.

4. The system of claim 3 when depending on claim 2, wherein the horizontal kernel is an in x n sized kernel with in > n. so that the horizontal kernel is configured to probe temporal dependencies of the input audio frame.

5. The system of claim 3 or 4 when depending on claim 2, wherein the vertical kernel is m m x n sized kernel with in < n. so that the vertical kernel is configured to probe timbral dependencies of the input audio frame.

6. The system of any one of claims 1 to 5, wherein the at least one inception block further comprises a path with a pooling layer.

7. The system of claim 6, wherein the pooling layer comprises an average pooling.

8. The system of any one of claims 1 to 7, wherein the system further comprises at least one squeeze-and-excitation, SE, layer.

9. The system of claim 8, wherein the squeeze-and-excitation layer follows a last convolutional layer of the plurality of parallel paths of convolutional layers of the at least one inception block.

10. The system of claim 8 or 9, wherein the squeeze-and-excitation layer comprises a convolutional layer, two fully connected layers and sigmoid activation.

11. The system of claim 10, wherein, in the squeeze-and excitation layer, the convolutional layer followed by a scaling operation with the two fully connected layers generates a respective attention weight for each channel of the feature map output by the at least one inception block, and applies said attention weights to the channels of the feature map and performs concatenation of the weighted channels.

12. The system of any one of claims 1 to 11, wherein the system comprises two or more inception blocks and two or more squeeze-and-excitation layers, and wherein the inception blocks and the squeeze-and excitation layers are alternately arranged.

13. The system of any one of claims 1 to 12, wherein the input audio frame is derived from a mono audio signal, and wherein the at least one representation of the input audio frame comprises a representation of a clean reference input audio frame and a representation of a degraded input audio frame.

14. The system of any one of claims 1 to 12, wherein the input audio frame is derived from a stereo audio signal comprising a left channel and a right channel, and wherein the at least one representation of the input audio frame comprises for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels.

15. The system of any one of claims 1 to 14, wherein the indication of the audio quality comprises at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

16. The system of any one of claims 1 to 15, wherein the at least one fully connected layer comprises a feed forward neural network.

17. A method of operating a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal, wherein the system comprises at least one inception block and at least one fully connected layer, the method including the steps of: receiving, by the at least one inception block, at least one representation of the input audio frame of the mono audio signal comprising a representation of a clean reference input audio frame of the mono audio signal and a representation of a degraded input audio frame of the mono audio signal; mapping, by the at least one inception block, the at least one representation of the input audio frame into a feature map; and predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature map.

18. The method of claim 17, wherein the indication of the audio quality comprises at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

19. The method of claim 17 or 18, wherein the system further comprises at least one squeeze-and-excitation layer subsequent to the inception block and the method further comprises applying, by the squeeze-and-excitation layer, respective attention weights to the channels of the feature map output by the at least one inception block.

20. The method of any one of claims 17 to 19, wherein the at least one inception block comprises a plurality of parallel paths of convolutional layers, and wherein at least one parallel path includes a convolutional layer with an m x B sized kernel, wherein the integer in is different from the integer n.

21. A method of operating a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio signal, wherein the system comprises at least one inception block and at least one fully connected layer, the method including the steps of: receiving, by the at least one inception block, at least one representation of the input audio frame comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels; mapping, by the at least one inception block, the at least one representation of the input audio frame into feature maps; and predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature maps.

22. The method of claim 21, wherein the indication of the audio quality comprises at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

23. The method of claim 21 or 22, wherein the system further comprises at least one squeeze-and-excitation layer subsequent to the inception block and the method further comprises applying, by the squeeze-and-excitation layer, respective attention weights to the channels of the feature map output by the at least one inception block.

24. The method of any one of claims 21 to 23, wherein the at least one inception block comprises a plurality of parallel paths of convolutional layers, and wherein at least one parallel path includes a convolutional layer with an m x B sized kernel, wherein the integer in is different from the integer n.

25. The method of any one of claims 21 to 24, wherein the method further includes prior to receiving the at least one representation of the input audio frame, receiving one or more weight coefficients of at least one inception block that had been obtained for a computer- implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal and initializing one or more weight coefficients of the at least one inception block based on said received one or more weight coefficients.

26. A method of training a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block and at least one fully connected layer, the method including the steps of: receiving, by the at least one inception block, at least one representation of an input audio frame of an audio training signal comprising a representation of a clean reference input audio frame of the audio training signal and a representation of a degraded input audio frame of the audio training signal; mapping, by the at least one inception block, the at least one representation of the input audio frame of the audio training signal into a feature map; predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame of the audio training signal based on the feature map; and tuning one or more parameters of the computer-implemented deep-leaming based system based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

27. The method of claim 26, wherein the comparison of the predicted indication of the audio quality and the actual indication of the audio quality is based on a smooth LI loss function.

28. A method of training a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio training signal, wherein the system comprises at least one inception block and at least one fully connected layer, the method including the steps of: initializing one or more weight coefficients of the at least one inception block based on one or more weight coefficients that had been obtained for at least one inception block of a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio training signal; receiving, by the at least one inception block, at least one representation of an input audio frame of a stereo audio training signal comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels; mapping, by the at least one inception block, the at least one representation of the input audio frame of the stereo audio training signal into feature maps; predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame of the stereo audio training signal based on the feature maps; and tuning one or more parameters of the computer-implemented deep-leaming based system based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

29. The method of claim 28, wherein the comparison of the predicted indication of the audio quality and the actual indication of the audio quality is based on a smooth LI loss function.

Description:
ROBUST INTRUSIVE PERCEPTUAL AUDIO QUALITY ASSESSMENT BASED ON CONVOLUTIONAL NEURAL

NETWORKS

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority of the following priority application: US provisional application 63/119,318 (reference: D20118USP1), filed 30 November 2020.

TECHNICAL FIELD

The present disclosure relates generally to a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame. In particular, the system comprises at least one inception block and at least one fully connected layer. The present disclosure relates further to respective methods of operating a computer- implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal or a stereo audio signal and to respective methods of training said system.

BACKGROUND

Audio quality perceived by human is a core performance metric in many audio and multimedia networks and devices, such as voice over internet protocol (VoIP), digital audio broadcasting (DAB) systems and streaming services. Steady, continuous and fast transmission of audio files from a server to a remote client is bounded by many technical restrictions, such as limited bandwidth, congested network or overloaded client devices. An audio codec is a computer program designed to encode and decode a digital audio stream. To be more precise, it compresses and decompresses digital audio data to and from a compressed format with the help of codec algorithms. Audio codec intends to reduce the storage space and bandwidth while keeping a high-fidelity of broadcast or transmitted signals. Due to the lossy compression methods the audio quality could be to some extent noticeably inferior and affect the user experience. In order to authentically reflect the audio quality that a human has perceived, listening tests for audio excerpts rated by a group of trained listeners are conducted and the resulting average scores represent the quality of corresponding audio excerpts. However, listening tests for a huge amount of audio files are impossible to perform because it is a tedious work and requires the involvement of more experienced manpower to perform the repetitive work.

Algorithms and techniques are sought by engineers to refrain from heavy workload of listening tests. The audio quality evaluation methods can be roughly categorized into objective and subjective methods. Subjective method usually refers to listening tests and objective evaluations are the numerical measures by machines and devices and are a computational proxy for listening tests. Classic objective audio quality assessment methods such as Perceptual Evaluation of Audio Quality (PEAQ), Perceptual Objective Listening Quality Analysis (POLQA), and the Virtual Speech Quality Objective Listener (ViSQOL) are designed for specific sound codecs i.e., speech or audio codec, and/or specific bitrate operating points. These objective methods share a common problem that they would become obsolete with the emergence of new scenarios. For instance, service providers constantly update their codecs to optimize encoding and decoding process. Under these circumstances, the codec alteration needs to be frequently validated by performing subjective or objective tests. However, massive listening tests are impractical and objective evaluation of a targeted specific codec or bitrate could be beyond the scope of their capacity. Deep learning approaches offer a new perspective to derive an audio quality assessment model that is both accurate, rapidly re-trainable and easily expandable to new scenarios and applications.

SUMMARY

In accordance with a first aspect of the present disclosure there is provided a computer- implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame. The system may comprise at least one inception block configured to receive at least one representation of an input audio frame and to map the at least one representation of the input audio frame into a feature map. And the system may comprise at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, wherein the at least one fully connected layer is configured to determine the indication of the audio quality of the input audio frame. The at least one inception block may comprise a plurality of parallel paths of convolutional layers, wherein at least one parallel path includes a convolutional layer with an in x n sized kernel, wherein the integer in is different from the integer n. In some embodiments, the at least one representation of the input audio frame may correspond to a gammatone spectrogram with a first axis representing time and a second axis representing frequency.

In some embodiments, the plurality of parallel paths of convolutional layers may include at least one convolution layer with a horizontal kernel and at least one convolutional layer with a vertical kernel.

In some embodiments, the horizontal kernel may be an m x/ sized kernel with m > n, so that the horizontal kernel may be configured to probe temporal dependencies of the input audio frame.

In some embodiments, the vertical kernel may b m m x n sized kernel with m < n, so that the vertical kernel may be configured to probe timbral dependencies of the input audio frame.

In some embodiments, the at least one inception block may further comprise a path with a pooling layer.

In some embodiments, the pooling layer may comprise an average pooling.

In some embodiments, the system may further comprise at least one squeeze-and-excitation, SE, layer.

In some embodiments, the squeeze-and-excitation layer may follow a last convolutional layer of the plurality of parallel paths of convolutional layers of the at least one inception block.

In some embodiments, the squeeze-and-excitation layer may comprise a convolutional layer, two fully connected layers and sigmoid activation.

In some embodiments, in the squeeze-and excitation layer, the convolutional layer followed by a scaling operation with the two fully connected layers may generate a respective attention weight for each channel of the feature map output by the at least one inception block, and may apply said attention weights to the channels of the feature map and may perform concatenation of the weighted channels.

In some embodiments, the system may comprise two or more inception blocks and two or more squeeze-and-excitation layers, and the inception blocks and the squeeze-and excitation layers may be alternately arranged. In some embodiments, the input audio frame may be derived from a mono audio signal, and the at least one representation of the input audio frame may comprise a representation of a clean reference input audio frame and a representation of a degraded input audio frame.

In some embodiments, the input audio frame may be derived from a stereo audio signal comprising a left channel and a right channel, and the at least one representation of the input audio frame may comprise for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels.

In some embodiments, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

In some embodiments, the at least one fully connected layer may comprise a feed forward neural network.

In accordance with a second aspect of the present disclosure there is provided a method of operating a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal, wherein the system comprises at least one inception block and at least one fully connected layer. The method may include the step of receiving, by the at least one inception block, at least one representation of the input audio frame of the mono audio signal comprising a representation of a clean reference input audio frame of the mono audio signal and a representation of a degraded input audio frame of the mono audio signal. The method may further include the step of mapping, by the at least one inception block, the at least one representation of the input audio frame into a feature map. And the method may include the step of predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature map.

In some embodiments, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

In some embodiments, the system may further comprise at least one squeeze-and-excitation layer subsequent to the inception block and the method may further comprise applying, by the squeeze-and-excitation layer, respective attention weights to the channels of the feature map output by the at least one inception block.

In some embodiments, the at least one inception block may comprise a plurality of parallel paths of convolutional layers, and at least one parallel path may include a convolutional layer with m m x n sized kernel, wherein the integer m may be different from the integer n.

In accordance with a third aspect of the present disclosure there is provided a method of operating a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio signal, wherein the system comprises at least one inception block and at least one fully connected layer. The method may include the step of receiving, by the at least one inception block, at least one representation of the input audio frame comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels. The method may further include the step of mapping, by the at least one inception block, the at least one representation of the input audio frame into feature maps. And the method may include the step of predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature maps.

In some embodiments, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA.

In some embodiments, the system may further comprise at least one squeeze-and-excitation layer subsequent to the inception block and the method may further comprise applying, by the squeeze-and-excitation layer, respective attention weights to the channels of the feature map output by the at least one inception block.

In some embodiments, the at least one inception block may comprise a plurality of parallel paths of convolutional layers, and at least one parallel path may include a convolutional layer with m m x n sized kernel, wherein the integer m may be different from the integer n.

In some embodiments, the method may further include prior to receiving the at least one representation of the input audio frame, receiving one or more weight coefficients of at least one inception block that had been obtained for a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal and initializing one or more weight coefficients of the at least one inception block based on said received one or more weight coefficients.

In accordance with a fourth aspect of the present disclosure there is provided a method of training a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block and at least one fully connected layer. The method may include the step of receiving, by the at least one inception block, at least one representation of an input audio frame of an audio training signal comprising a representation of a clean reference input audio frame of the audio training signal and a representation of a degraded input audio frame of the audio training signal. The method may further include the step of mapping, by the at least one inception block, the at least one representation of the input audio frame of the audio training signal into a feature map. The method may further include the step of predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame of the audio training signal based on the feature map. And the method may include the step of tuning one or more parameters of the computer-implemented deep-learning based system based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

In some embodiments, the comparison of the predicted indication of the audio quality and the actual indication of the audio quality may be based on a smooth LI loss function.

In accordance with a fifth aspect of the present disclosure there is provided a method of training a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio training signal, wherein the system comprises at least one inception block and at least one fully connected layer. The method may include the step of initializing one or more weight coefficients of the at least one inception block based on one or more weight coefficients that had been obtained for at least one inception block of a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio training signal. The method may further include the step of receiving, by the at least one inception block, at least one representation of an input audio frame of a stereo audio training signal comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels. The method may further include the step of mapping, by the at least one inception block, the at least one representation of the input audio frame of the stereo audio training signal into feature maps. The method may further include the step of predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame of the stereo audio training signal based on the feature maps. And the method may include the step of tuning one or more parameters of the computer-implemented deep-learning based system based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

In some embodiments, the comparison of the predicted indication of the audio quality and the actual indication of the audio quality may be based on a smooth LI loss function.

According to a further aspect, there is presented a deep-leaming-based system for determining an indication of an audio quality of an input audio frame. The system may include at least one inception block configured to receive at least one representation of an input audio frame and map the at least one representation of the input audio frame into a feature map, where the at least one inception block may include a plurality of stacked convolutional layers configured to operate in parallel paths. The at least one convolutional layer of the plurality of stacked convolutional layers may include m m x n sized kernel, wherein the integer in may be different than the integer n. The system may further include at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, where the at least one fully connected layer may be configured to determine the indication of the audio quality of the input audio frame.

In some instances, the plurality of stacked convolutional layers may include at least one convolution layer comprising a horizontal kernel and at least one convolutional layer comprising a vertical kernel.

In some instances, the horizontal kernel may be configured to learn temporal dependencies of the input audio frame.

In some instances, the vertical kernel may be configured to learn timbrel dependencies of the input audio frame.

In some instances, the inception block further includes a squeeze-and-excitation (SE) layer.

In some instances, the squeeze-and-excitation layer may be applied after the last stacked convolutional layer of the plurality of stacked convolutional layers. In some instances, the inception block further includes a pooling layer.

In some instances, the at one representation of an input audio frame includes a representation of a clean reference input audio frame and a representation of a degraded input audio frame.

In some instances, the indication of the audio quality comprises at least one of a mean opinion score, MOS, or a multiple stimuli with hidden reference and anchor, MUSHRA.

In some instances, the at least one fully connected layer includes a feed forward neural network.

According to a still further aspect, there is presented a method for operating a deep-leaming- based system for determining an indication of an audio quality of an input audio frame is also presented, where the system includes at least one inception block and at least one fully connected layer. The method may include mapping, by the at least one inception block, the input audio frame into a feature map, and predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature map.

In some instances, the at least one representation of the input audio frame comprises a representation of a clean reference input audio frame and a representation of a degraded input audio frame.

In some instances, the indication of the audio quality comprises at least one of a mean opinion score, MOS, or a multiple stimuli with hidden reference and anchor, MUSHRA.

The methods described herein may be implemented as computer program products comprising a computer-readable storage medium with instructions adapted to cause a device to carry out the respective methods.

BRIEF DESCRIPTION OF THE DRAWINGS

Example embodiments of the disclosure will now be described, by way of example only, with reference to the accompanying drawings in which:

Fig. 1.1 illustrates an example of a workflow of a ViSQOL v3 system.

Fig. 1.2 illustrates an example a workflow of a ViSQOL v3 system (left) and an example of a workflow of an InceptionSE model (right) in case of a mono audio signal as input.

Fig. 1.3 illustrates an example of a workflow of an InceptionSE model in case of a mono audio signal as input (left) and an example of a workflow of an InceptionSE model in case of a stereo audio signal as input (right). Fig. 1.4 illustrates an example of a method of operating a computer-implemented deep- leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal, wherein the system comprises at least one inception block and at least one fully connected layer.

Fig. 1.5 illustrates an example of a method of training a computer-implemented deep- leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block and at least one fully connected layer.

Fig. 1.6 illustrates an example of a method of operating a computer-implemented deep- leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio signal, wherein the system comprises at least one inception block and at least one fully connected layer.

Fig. 1.7 illustrates an example of a method of training a computer-implemented deep- leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio training signal, wherein the system comprises at least one inception block and at least one fully connected layer.

Figs. 2.1 to 2.5 illustrate schematically an example of convolution with square shaped kernels.

Fig. 2.6 illustrates schematically an example of convolution with a horizontal kernel with stride (2,2).

Fig. 2.7 illustrates schematically an example of convolution with a vertical kernel with stride

(2,1).

Figs. 2.8 to 2.10 illustrate examples of activation functions: rectified linear unit (ReLU, 2.8), sigmoid function (2.9) and hyperbolic tangent function (tanh, 2.10).

Fig. 2.11 illustrates schematically an example of non-overlapping pooling.

Fig. 2.12 illustrates schematically an example of overlapping pooling.

Fig. 2.13 illustrates schematically an example of a fully connected layer for classification.

Fig. 2.14 illustrates schematically an example of a fully connected layer for regression.

Figs. 2.15 and 2.16 illustrate schematically an example of the procedure of dropout.

Fig. 2.17 illustrates schematically an example of an unrolled recurrent neural network. Fig. 2.18 illustrates schematically an example of a long-short term memory unit.

Figs. 2.19 and 2.20 illustrate schematically the basic structure of self-attention by a simple example of a sentence.

Fig. 2.21 illustrates schematically an example of core operations in a squeeze-and-excitation layer.

Fig. 2.22 illustrates schematically an example of a naive inception block.

Fig. 2.23 illustrates schematically an example of an improved version of an inception block.

Fig. 2.24 illustrates schematically an example of an inception block with rectangular kernels.

Fig. 3.1 illustrates schematically an example of a CNN Model (vanilla).

Fig. 3.2A to 3.2D are to be combined in this order to illustrate schematically an example of an Inception model (naive).

Fig. 3.3A to 3.3D are to be combined in this order to illustrate schematically an example of an Inception model without head layer.

Fig. 3.4A to 3.4F are to be combined in this order to illustrate schematically an example of an InceptionSE model (naive).

Fig. 3.5A to 3.5F are to be combined in this order to illustrate schematically an example of an InceptionSE model without head layer.

Fig. 4.1 illustrates schematically an example of a pipeline of data generation and labeling.

Fig. 4.2 illustrates a gammatone spectrogram example of 01 Angry Mono.

Fig. 4.3 illustrates a MOS-LQO score distribution in a training dataset.

Fig. 4.4 illustrates an average MOS-LQO score based on bitrates in a training dataset.

Fig. 4.5 illustrates a spectrogram of original WADmus047 sample.

Fig. 4.6 illustrates a spectrogram of WADmus047 coded at 128kbps.

Fig. 4.7 illustrates a spectrogram of WADmus047 coded at 96kbps.

Fig. 4.8 illustrates a spectrogram of WADmus047 coded at 64kbps.

Fig. 4.9 illustrates a spectrogram of WADmus047 coded at 48kbps.

Fig. 4.10 illustrates a spectrogram of WADmus047 coded at 32kbps. Fig. 4.11 illustrates a spectrogram of WADmus047 coded at 24kbps.

Fig. 4.12 illustrates a spectrogram of WADmus047 coded at 20kbps.

Fig. 4.13 illustrates a spectrogram of WADmus047 coded at 16kbps.

Fig. 4.14 illustrates an average MOS-LQO score based on bitrates in a modified training dataset.

Fig. 4.15 illustrates a spectrogram of original CO 02 OMensch.

Fig. 4.16 illustrates a spectrogram of CO 02 OMensch coded at high bitrate.

Fig. 5.1 illustrates a prediction of scOl (trumpet) when trained without noise and silence.

Fig. 5.2 illustrates a prediction of scOl (trumpet) when trained with noise and silence.

Fig. 6.1 illustrates a prediction of 09-Applaus-5-l 20 trained without noise and silence.

Fig. 6.2 illustrates a prediction of 09-Applaus-5-l 20 trained with noise and silence.

Fig. 6.3 illustrates a prediction of KoreanMl trained without noise and silence.

Fig. 6.4 illustrates a prediction of KoreanMl trained with noise and silence.

Fig. 6.5 illustrates a prediction of SpeechOverMusic 1 trained without noise and silence.

Fig. 6.6 illustrates a prediction of SpeechOverMusic 1 trained with noise and silence.

DETAILED DESCRIPTION

Subjective Quality Evaluation Metrics

Mean opinion score (MOS) is a standardized measure used in Quality of Experience (QoE).

It is expressed as a rational number on a scale from 1 to 5, where 1 represents the lowest perceived quality and 5 represents the highest perceived quality. Another ITU-R Recommendation methodology in codec listening tests is Multiple Stimuli with Hidden Reference and Anchor (MUSHRA). The intuitive difference from MOS is that MUSHRA scales from 0 (bad) to 100 (excellent) and allows participants to rate these audio excerpts with small difference. In addition, MUSHRA requires fewer listeners to obtain statistically significant results compared to MOS. Listeners are presented with the reference, some anchors, and a set of test samples. A low-range and a mid-range anchors are recommended to be included in the listening tests, which are typically a 3.5 kHz and a 7.0 kHz low-pass filtered reference signals. The purpose of anchors and reference is to calibrate the scale when comparing results across different researches. MUSHRA is used in subjective listening tests and MOS scores are used to evaluate the QoE in POLQA and ViSQOL.

Objective Quality Evaluation Tools

Objective audio quality evaluation can be classified as either parameter-based or signal- based. Parameter-based models predict quality by modeling characteristics of a transmission channel of audio, such as packet loss rate and delay jitters.

Signal-based models estimate quality based on information taken from signals rather than the medium of their transmission. Signal-based methods can be further divided into intrusive and non-intrusive, i.e., with or without a clean reference signal. In non-intrusive approaches, the algorithms take in only degraded or contaminated signals for evaluating the excerpt quality. While intrusive algorithms take both clean reference and the degraded signals as input, and the correlation between reference and degraded signals is taken into consideration in the algorithm. Intrusive methods are considered to be more accurate than non-intrusive methods. PEAQ, POLQA, PEMO-Q and ViSQOL are four examples of such intrusive models, which are used to rate the quality of full-band coded audio.

Early intrusive models focus mainly on speech and narrow frequency bands. PESQ evaluates broader frequency bands of speech and fixed the weakness of its predecessor. POLQA as the successor of PESQ has extended to super-wideband (50 - 14000 Hz) speech excerpts. PEAQ on the contrary is designed for evaluating encoded audio. However, the output of PEAQ is a set of variables and coefficients rather than an intuitive score such as MOS. This set of coefficients and variables is then inputted into a machine learning model to get a distortion index. This distortion index is mapped to an Objective Difference Grade (ODG), where the grade 1 represents very annoying and 5 represents imperceptible quality degradation. PEMO- Q is also a perceptually motivated intrusive model that computes an error estimation including three components: distortion, interference and artifact. The weights of these components are mapped to generate an Overall Perception Score (OPS) ranging from very bad 0 to excellent quality 100.

ViSQOL is a speech quality evaluation model that was later adapted to audio quality evaluation (ViSQOLAudio). In short, ViSQOL takes in both the coded, degraded signals and its corresponding original reference and predicts Mean Opinion Score-Listening Quality Objective (MOS-LQO) scores for those degraded signals. The latest version, ViSQOL v3, which is illustrated in Figure 1.1, is a combined release of old ViSQOL and ViSQOLAudio. The basic structure of ViSQOL consists of four phases: pre-processing, pairing, comparison and mapping from similarity to quality. In the pre-processing stage, the mid-channel is extracted from the reference and degraded signals considering that input audio could be either stereo or mono. A global alignment such as removing the initial zero padding in the signals is then performed and gammatone spectrograms with 32 bands and a minimum frequency of 50 Hz are extracted from the reference and degraded signals, respectively. In pairing stage, reference signals are first segmented into consecutive patches, which is composed of 30 frames and each frame is 20 ms long. The degraded signals are scanned frame by frame to look for a set of most similar patch-pairs between the reference and degraded signals. In comparison stage, the similarity score of each patch-pair across every frequency band is measured and averaged per frequency band and patch, which creates a Neurogram Similarity Index Measure (NSIM) score. This NSIM score is then in the mapping phase fed to a support vector regression (SVR) model that outputs a corresponding MOS-LQO value. The new version of ViSQOL v3 contains incremental improvements to the existing old framework and is re-implemented in C++. It unites the old ViSQOL for speech and ViSQOLAudio by sharing most of the common components. The ViSQOL v3 system is shown in Fig. 1.1 with the newly added components emphasized by thick edges.

Despite the fact that previous versions of ViSQOL have included two levels of alignment (global and patch), there were still issues with the patch alignment due to the spectrogram frames being misaligned at a fine scale. ViSQOL v3 introduces an additional alignment step to address this issue. Furthermore, ViSQOL v3 also introduces silence thresholds on the gammatone spectrogram. ViSQOL was too sensitive to different levels of ambient noise and these silence and low-level ambient noise excerpts would be rated with a low MOS-LQO, despite being perceptually inaudible to human. This silence threshold introduces an absolute floor which filters interference signals such as noise and silence.

In a conclusion, ViSQOL achieves a high and stable performance for all items overall. Its degradation index NSIM is so far the best performing feature compared with the degradation indices of PEMO-Q and PEAQ, because ViSQOL NSIM presents the most balanced and high performance for all signal categories. Furthermore, all objective measures except for ViSQOL showed a weak correlation against subjective scores. However, ViSQOL does not propose a solution to evaluate a generative model that can be well analyzed by existing intrusive methods. Nevertheless, up to now, ViSQOL is thus far the most ideal coded audio quality evaluation model, covering a wide range of content types and quality levels. Future versions of ViSQOL are also within the scope of the disclosure. Perceptually-inspired Representation

One of salient characteristics of ViSQOL is that it analyses the audio quality based on the gammatone spectrogram. Human brains describe the information collected from ears by visualizing sound as a time-varying distribution of energy in frequency. The important difference between the traditional spectrogram and how sound is actually analysed by the ear is, that ear's frequency sub-bands get wider for higher frequencies, whereas the spectrogram has a constant bandwidth across all frequency channels. Gammatone filters are a popular linear approximation to the filtering performed by the ear. The gammatone-based spectrogram is constructed by first calculating a conventional, fixed-bandwidth spectrogram, and then combining the fine frequency resolution of the fast Fouier transform (FFT)-based spectra into the coarser, smoother gammatone responses via a weighting function. In short, a gammatone-based spectrogram can be considered as a more perceptually driven representation than the traditional spectrogram.

ViSQOL has built in the same gammatone spectrogram function in their C++ implementation and it constructs gammatone-based spectrogram with a window size 80 ms, a hop size 20 ins and 32 frequency bands with minimum frequency 50 Hz and maximum frequency equal to half of sampling rate. In this disclosure, the originally published MATLAB implementation to construct the gammatone-based spectrogram with the same parameter setting as used in ViSQOL was used as experimental input to the systems and methods for determining an indication of an audio quality of an input audio frame, as described herein.

Deep-Learning Based System for Audio Quality Assessment

Referring to the examples of Figures 1.2 and 1.3, there is provided a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame, 100, 200, 300. In this disclosure, a system (model), 100, 200, 300, is proposed that is a full-reference audio quality assessment network with a backbone of adapted Inception blocks and optional Squeeze-and-Excitation (SE) layers, which may take gammatone spectrograms of the reference, 101a, 201a, 301a, and degraded audio 101b, 201b, 301b, as input and predicts quality scores, 104, 208, 308, e.g. MOS-LQO scores, 104, 208, for those degraded excerpts. Hereinafter, the model, 100, 200, 300, is also denoted as InceptionSE model. The similarity and difference between the workflow of ViSQOL v3 and the InceptionSE model is presented by comparison of Fig. 1.1 and Figs.1.2 and 1.3. In the example of Figure 1.2, the workflow of ViSQOL v3 is illustrated as compared to the workflow of the InceptionSE model, 100, in general. In the examples of Figure 1.3, the workflow of the InceptionSE model for the cases of a mono, 200, and a stereo, 300, audio signal as input are compared in more detail, wherein In denotes an Inception block, 203a, 203b, 205, 303a, 303b, 305, SE denotes a squeeze-and-excitation layer, 204, 206, 304, 306, and FCL denotes a fully connected layer, 207, 307. In case of the stereo model, 300, L denotes left channel, R denotes right channel, M denotes mid-channel and S denotes side- channel. Both workflows will be described in more detail below.

Referring to the examples of Figures 1.2 and 1.3, a computer-implemented deep-leaming- based system for determining an indication of an audio quality of an input audio frame, 100, 200, 300, is illustrated schematically. It is to be noted that in Figure 1.2 the model, 100, includes an optional global alignment block, 102. With both, mono and stereo models, it may be assumed that the input reference and degraded signals are time-aligned. If they are not time-aligned, they may be manually aligned. For example, assuming an encoder-decoder delay of a codec is 1600 samples, then the reference and degraded signals may be aligned by chopping off the first 1600 samples from the degraded signal.

Referring to the example of Figure 1.3, the system, 200, 300, may comprise at least one inception block, 203a, 203b, 205, 303a, 303b, 305, configured to receive at least one representation of an input audio frame and to map the at least one representation of the input audio frame into a feature map, wherein the at least one inception block, 203a, 203b, 205, 303a, 303b, 305, may comprise a plurality of parallel paths of convolutional layers, wherein at least one parallel path may include a convolutional layer with an in x n sized kernel, wherein the integer m may be different from the integer n.

The system, 200, 300, may further comprise at least one fully connected layer, 207, 307, configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, 203a, 203b, 205, 303a, 303b, 305. The at least one fully connected layer, 207, 307, may be configured to determine the indication of the audio quality of the input audio frame.

In some embodiments, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, 104, 208, and a multiple stimuli with hidden reference and anchor, MUSHRA, 308.

Referring to the example of Figure 1.4, a method of operating a computer-implemented deep- leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal, 200, wherein the system comprises at least one inception block, 203a, 203b, 205, and at least one fully connected layer, 207, is illustrated.

In step S 101, at least one representation of the input audio frame of the mono audio signal comprising a representation of a clean reference input audio frame of the mono audio signal and a representation of a degraded input audio frame of the mono audio signal is received by the at least one inception block. The at least one inception block may comprise a plurality of parallel paths of convolutional layers, and at least one parallel path may include a convolutional layer with an in x n sized kernel, wherein the integer in may be different from the integer n.

In step SI 02, the at least one inception block then maps the at least one representation of the input audio frame into a feature map. Steps SI 01 and SI 02 may be performed by an inception block as described below with reference to Figure 2.24, for example.

And in step S103, the indication of the audio quality of the input audio frame is predicted by the at least one fully connected layer based on the feature map. Step SI 03 may be performed by a fully connected layer as described below with reference, for example, to Figure 2.14.

As described above, in an embodiment, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, 104, 208, and a multiple stimuli with hidden reference and anchor, MUSHRA.

In addition, a training dataset for training a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame, 100, 200, 300, was built. The training dataset covered, for example, 10 hours of mono music excerpts, 2 hours of mono speech excerpts and 45 minutes noise and silence excerpts, which were encoded and decoded by High-Efficiency Advanced Audio Coding (HE-AAC) and Advanced Audio Coding (AAC) codecs with a bitrate ranging from 16 kbps to 128 kbps. In this example, both, AAC and HE-AAC, were chosen as codec in the training, so that both waveform coding (AAC) and parametric coding tools (Spectral Band Replication a.k.a SBR in HE-AAC) were considered. To refrain from massive listening tests and the need to label each audio excerpt manually, MOS-LQO scores predicted by ViSQOL v3 were used as ground truth to train and derive the model, 100, 200, 300.

Referring to the example of Figure 1.5, a method of training a computer-implemented deep- leaming-based system, 100, 200, for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block, 203a, 203b, 205, and at least one fully connected layer, 207, is illustrated.

In step S201, at least one representation of an input audio frame of an audio training signal comprising a representation of a clean reference input audio frame of the audio training signal and a representation of a degraded input audio frame of the audio training signal is received by the at least one inception block.

In step S202, the at least one inception block maps the at least one representation of the input audio frame of the audio training signal into a feature map. Steps S201 and S202 may be performed by an inception block as described below with reference to Figure 2.24, for example.

In step S203, the indication of the audio quality of the input audio frame of the audio training signal is predicted by the at least one fully connected layer based on the feature map. Step S203 may be performed by a fully connected layer as described below with reference, for example, to Figure 2.14.

And in step S204, one or more parameters of the computer-implemented deep-learning based system are then tuned based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

As described further below with reference to equation 2.11, in an embodiment, the comparison of the predicted indication of the audio quality and the actual indication of the audio quality may be based on a smooth LI loss function.

In an embodiment, the at least one representation of the input audio frame (received by the at least one inception block) may correspond to a gammatone spectrogram, 103, 202, 302, with a first axis representing time and a second axis representing frequency.

The InceptionSE model, 100, 200, 300, takes in, for example, two channels of frequency x time- sized gammatone spectrogram, 103, 202, 302, and each channel represents the gammatone spectrogram of the clean reference, 101a, 201a, 301a, and degraded, 101b, 201b, 301b, signals. After several layers of adapted Inception blocks, 203a, 203b, 205, 303a, 303b, 305, and optional SE blocks (SE layers), 204, 206, 304, 306, the feature maps are then flattened and fed to three fully connected layers, 207, 307, to project the features to a continuous MOS-LQO score, 104, 208, between 1 and 5, or a MUSHRA score, 308. These predicted MOS scores (or MUSHRA scores) may be compared with ViSQOL and finally evaluated on US AC verification listening tests in order to calibrate the performance of the InceptionSE model on listening tests.

Referring again to the examples of Figs. 1.2 and 1.4, in an embodiment, the input audio frame may be derived from a mono audio signal, wherein the at least one representation of the input audio frame may comprise a representation of a clean reference input audio frame, 201a, and a representation of a degraded input audio frame, 201b.

In a nutshell, the InceptionSE model has achieved a comparable performance on the training dataset summarized in Table 4.1 below, the test dataset "Set of Critical Excerpts for Codecs" also described further below, and on US AC verification listening tests including both mono and stereo audio. Moreover, it can be easily adapted to non-intrusive methods as well by simply removing the reference from the input and re-training the model. As will be described below, the model can be adapted to applications in stereo and multi-channels signals.

Referring to the example of Fig. 1.3, in an embodiment, the input audio frame may be derived from a stereo audio signal comprising a left channel and a right channel, wherein the at least one representation of the input audio frame may comprise for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame, 301a, and a representation of a degraded input audio frame,

301b, the middle and side channel corresponding to a sum and a difference of the left and right channels.

That is:

L-\-R L—R

M = — and S = -y- with M = middle channel; L = left channel; R = right channel; S = side channel. By performing ablation studies, it was found: stereo audio quality prediction accuracy improves with the inclusion of mid- and side-channels (of reference and degraded signals). The inclusion of the side-channel improves the prediction accuracy towards lower bitrates. To save complexity, it was found that it is sufficient to exclude the mid-channel (without significantly degrading the prediction accuracy).

Referring further to the example of Figure 1.6, a method of operating a computer- implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio signal, 300, wherein the system comprises at least one inception block, 303a, 303b, 305, and at least one fully connected layer, 307, is illustrated.

In step S301, at least one representation of the input audio frame comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels is received by the at least one inception block. In an embodiment, the at least one inception block may comprise a plurality of parallel paths of convolutional layers, and at least one parallel path may include a convolutional layer with an in x n sized kernel, wherein the integer in may be different from the integer n. In a further embodiment, the method may further include prior to receiving the at least one representation of the input audio frame, receiving one or more weight coefficients of at least one inception block that had been obtained for a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio signal and initializing one or more weight coefficients of the at least one inception block based on said received one or more weight coefficients. In this, the stereo use case may be conveniently implemented based on a model for determining an indication of an audio quality of an input audio frame of a mono audio signal. In other words, the weights from the at least one inception block of the mono model can be re-used in the stereo model. This concept of transfer learning is further described below.

In step S302, the at least one inception block then maps the at least one representation of the input audio frame into feature maps. Steps S301 and S302 may be performed by an inception block as described below with reference to Figure 2.24, for example.

And in step S303, the indication of the audio quality of the input audio frame is predicted by the at least one fully connected layer based on the feature maps. Step S303 may be performed by a fully connected layer as described below with reference, for example, to Figure 2.14.

Also in this case, in an embodiment, the indication of the audio quality may comprise at least one of a mean opinion score, MOS, and a multiple stimuli with hidden reference and anchor, MUSHRA, 308.

In addition, referring to the example of Figure 1.7, a method of training a computer- implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a stereo audio training signal, 300, wherein the system comprises at least one inception block, 303a, 303b, 305, and at least one fully connected layer, 307, is illustrated.

In step S401, one or more weight coefficients of the at least one inception block are initialized based on one or more weight coefficients that had been obtained for at least one inception block of a computer-implemented deep-leaming-based system for determining an indication of an audio quality of an input audio frame of a mono audio training signal. Step S401 may thus follow the concept of transfer learning as described further below.

In step S402, at least one representation of an input audio frame of a stereo audio training signal comprising for each of a middle channel, a side channel, a left channel and a right channel a representation of a clean reference input audio frame and a representation of a degraded input audio frame, the middle and side channel corresponding to a sum and a difference of the left and right channels, is received by the at least one inception block.

In step S403, the at least one representation of the input audio frame of the stereo audio training signal is mapped by the at least one inception block into feature maps. Steps S402 and S403 may be performed by an inception block as described below with reference to Figure 2.24, for example.

In step S404, the indication of the audio quality of the input audio frame of the stereo audio training signal is predicted by the at least one fully connected layer based on the feature maps. Step S404 may be performed by a fully connected layer as described below with reference, for example, to Figure 2.14.

And in step S405, one or more parameters of the computer-implemented deep-leaming based system are tuned based on a comparison of the predicted indication of the audio quality and an actual indication of the audio quality.

Also in this case, as described further below with reference to equation 2.11, in an embodiment, the comparison of the predicted indication of the audio quality and the actual indication of the audio quality may be based on a smooth LI loss function.

In general, deep learning is based on artificial neural network and part of a broader family machine learning methods. Machine learning is the study of computer algorithms which improve automatically through experience and build a mathematical model based on sample data. Artificial neural networks (ANNs) mimic the information processing and distributed communication nodes in biological systems. Deep learning algorithms use multiple layers to progressively extract higher-level features from the raw input. Layers in deep learning architectures such as convolutional layer, pooling layer, batch normalisation layer, fully connected layer, drop out layer and activation layer will be described in more detail in the following. Advanced layers and modules including attention mechanism, long short-term memory (LSTM) layer, squeeze- and-excitation (SE) layer and Inception module (block) will be described as well.

Deep learning architecture Convolutional Layer

Convolution is a mathematical operation that expresses how the shape of one function (/ ) is modified by the other (g). In convolutional layers, a convolution operation works as such to reduce data dimensions while preserving the distinct information. The parameters of convolutional layers consist of a set of filters (or kernels). 1 -dimensional (ID) kernels may be used in tasks such as audio processing. The design of kernels depends on the input size. In audio processing tasks, wave format signals can be seen as a 1 -dimensional matrix along temporal axis and a kernel (a vector) is therefore 1-dimensional moving along time axis of audio signals. In this disclosure, for example, gammatone spectrograms as input into the CNN classifier (convolutional neural network, inception block) are 2-dimensional matrices. 2D kernels move along the frequency- and temporal-axis of gammatone spectrograms.

A convolutional layer (Conv) may be the core of a CNN. Elements in a convolutional layer are an input, 1001, filters (kernels), 1002, and an output, 1003, as illustrated in the example of Figs. 2.1-2.4. A convolution operation in Conv layers can be interpreted as that kernels look at a small region in the input. This small region has the same size of kernels, which is also referred as receptive fields. Each kernel generates a corresponding feature map and therefore n different kernels are able to extract n different features and build n-dimensional feature maps as an output of this Conv layer.

In this disclosure, the inputs may be also 3-dimensional. In addition to temporal- and frequency-axis, there may be 2 channels along z-axis, which represent gammatone spectrograms of reference and degraded signals. In this example, the size of input IV, x Hi is 7 x 7, and the number of input channels G is 2. The kernel size F is 3. The number of kernels K is 3. The output size W 0 x H 0 can be calculated according to Equation 2.1 where P is the number of zero padding, S is the stride. The number of output channels C 0 is equal to K. The parameters in filters are weights and bias. Bias is not mandatory and can be therefore set to 0. The output at one specific position is the sum of element-wise multiplication of weights and inputs in the receptive field plus the bias. Kernels slide along x- and y-axis of the input with a stride S and the value at one specific position in the output V, can be calculated as:

Vi = åXi - W + b, (2.2) where W and b are weights and bias of kernels, respectively; and X, is the local input in the receptive field.

An example is illustrated in Figs. 2.1-2.4. It is worthy to note that each kernel always has the same amount of channels as its corresponding input. The input size may be 7 x 7x2, in which 2 denotes number of channels. And the kernel size 3 x3x2 has 2 channels as well. The size of the output is calculated according to Equation 2.1 with P = 0 and S = 2. The kernel moves along x- and y-axis of inputs from upper left to bottom right. According to the given value in Fig. 2.2, the output at corresponding position is calculated as follows:

O ci = 4 1 + 1 1 + 3 1 + 7 0 + 0 · ( 1) + 4 0 + (-5) (-1) + 1 1 + 5 0 =6

O c2 = 1 ( 1) + 2 ( 2) + ( 1) 0 + ( 5) 0 + (-l) (-1)

+ 2 2 + ( 1) 1 + 0 1 + ( 1) 1 =0

0\ =Oc\ + Oc 2 + bias =6 + 0 + (-1)

=5, where O c i and 0 2 denote convolution results of the first and second channels depicted in diagonal and horizontal hatchings, respectively. Ox is the output at upper left position convolved with the first kernel. The kernel keeps moving along x-axis direction with a step size 2 shown in Fig. 2.3 and Fig. 2.4 and repeat the same calculation procedure as in Fig. 2.1 and Fig. 2.2.

This convolutional computation through the input is repeated by all the kernels, and each kernel generates a feature map, which builds a channel in the output. As in Fig. 2.5, differently illustrated layers in the output, 1003, are the feature maps extracted by the corresponding kernels, 1002, and therefore the number of channels in the output, 1003, is equal to the number of kernels, 1002. Rectangular kernels

A spectrogram is a “visual” representation of the spectrum of frequencies of a signal as it varies with time. The x- (horizontal) and y-(vertical) axis of a spectrogram represents temporal resolution and frequency bands, respectively. Therefore, wider (horizontal) kernels, 1005, may be capable of probing (and in consequence, e.g., learning) longer temporal dependencies in the audio domain while higher (vertical) kernels, 1008, may be capable of probing (and in consequence, e.g., learning) more spread timbrel features. Horizontal kernels probing longer temporal dependencies in the audio domain may thus be said to refer to the horizontal kernels being sensitive towards features that extend along the temporal axis of the spectrogram, while vertical kernels may be said to be sensitive towards features that extend along the frequency axis of the spectrogram. That is, in other words, horizontal kernels leam/scan/look at longer temporal dependencies in the audio domain while vertical kernels leam/scan/look at the more spread timbrel features. In yet other words, horizontal kernels may be capable of detecting (and mapping to respective feature maps) patterns extending along the horizontal axis, whereas vertical kernels may be capable of detecting (and mapping to respective feature maps) patterns along the vertical axis. Needless to say, it is understood that the above assignment of time and frequency to horizonal (e.g., x) and vertical (e.g., y) axes is a mere example and that other assignments may in principle be chosen as well.

Referring to the examples of Figures 2.6 and 2.7, in an embodiment, the plurality of parallel paths of convolutional layers (in the at least one inception block) may include at least one convolution layer with a horizontal kernel, 1005, and at least one convolutional layer with a vertical kernel, 1008.

In an embodiment, the horizontal kernel, 1005, may be an m x n sized kernel with m > n, so that the horizontal kernel, 1005, may be configured to probe temporal dependencies of the input audio frame.

In a further embodiment, the vertical kernel, 1008, may b m m xn sized kernel with m < n, so that the vertical kernel, 1008, may be configured to probe timbral dependencies of the input audio frame. Possible properties of such m x n sized kernels are set out in the examples below.

Rectangular kernels ( m xn kernels) are capable of learning time and frequency features at the same time. This kind of kernel is typically used in the music technology literature. Such kernels are able to extract different musical features depending on the scale of m and n. For example a bass or a kick could be well analyzed with a small kernel, which represents a sub band for a short-time. Examples such as cymbals or snare drums, which have a broad frequency band with a fixed decay time, could be learned by vertical kernels with wider span over frequency bands. Even though a bass or a kick could also be modeled with this kernel, it would not be an optimal choice due to the following reasons. A kernel per note could best characterize the timbre for the whole pitch range of an instrument. Larger kernels will lead to a less efficient representation, because most of the weights would be zero, wasting the representational power of the CNN kernels.

• temporal kernels (1 x«) learn relevant rhythmic/tempo patterns within the analyzed bin.

• frequency kernels (n x 1) learn timbre or equalization setups.

Activation Layer

Activation layers introduce non-linearity into the neural network. In biological inspired neural network, the activation function represents the rate of action potential firing in the cell. In the simplest form, activation function is binary: firing or not. Mathematical functions with a positive slope are considered to be used as activation functions. However, functions in the form f(x) = ax are linear and not capable of decision making. Therefore, activation functions are designed to be non-linear. Examples of activation functions are rectified linear unit (ReLU) , sigmoid function, hyperbolic tangent function (tanh) and softmax function and their equations can be found in the equation group 2.3 and ReLU, sigmoid and tanh functions are plotted in Fig. 2.8-2.10. f(x ) = max(0,x ), (2.3a)

ReLU has replaced sigmoid and tanh functions and become the mainstream choice in recent years. The biggest advantage of ReLU is non-saturation of its gradient, which greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid and tanh functions. Moreover, ReLU also introduces some sparsity effect on the network. Another nice property of ReLU is that it avoids expensive operations such as exponential in sigmoid and tanh functions. ReLU can be implemented by simply thresholding a matrix above zero. Softmax function maps its x into the interval (0, 1) and the sum of output is equal to 1. Softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted classes.

Pooling Layer

Referring to the examples of Figures 2.11 and 2.12, in an embodiment, the at least one inception block may further comprise a path with a pooling layer, 2002, 2003, 2005, 2006. The pooling layer may comprise an average pooling, 2003, 2006. Possible properties of the pooling layer are set out in the examples below.

Convolutional layers are sensitive to the location of features in the input and one approach to address this sensitivity is to down-sample the feature maps. This could improve the robustness of the network to local translation invariance and be helpful in extracting high- level patterns in the input. Other advantages of pooling layers are that it assists to reduce spatial dimension in feature maps, increase the computation efficiency and prevent overfitting problems. A pooling layer may be added after a convolutional layer and an activation layer to summarize the presence of features in feature maps. A pooled feature map is generated much like a sliding window to be applied to feature maps. Max pooling, average pooling, global max pooling (GMP) and global average pooling (GAP) are examples of the four most active pooling operations used in neural networks. As the name itself conveys, max pooling shrinks the region of feature maps within the sliding window to the maximal value in that region and average pooling shrinks to the average value. Global max pooling and global average pooling calculate the maximum or mean of the entire input rather than value in local neighbourhood. The window (kernel) size in pooling layers may be 2 x2 with a stride of 2 pixels. The output size can be calculated by the same equation in convolutional layers without taking padding into consideration and the equations are shown in 2.4.

Examples of 2D max and average pooling are shown in Fig. 2.11 and 2.12. The difference between overlapping and non-overlapping pooling is whether a stride step is smaller than the window (kernel) size or not. In these examples, non-overlapping pooling uses a stride of 2 and window (kernel) size of 2 x2, while overlapping pooling uses a stride of 2 and window (kernel) size of 3 x3. Different hatchings denote stride steps and corresponding results.

The pooling regions are disjointed by non-overlapping pooling and more spatial information will be lost in each pooling layer. Even though pooling operations help to improve robustness against spatial aliasing, it would be at the same time detrimental if abused because the network will focus on some dominant features and could lead to overfitting in some high capacity and deep models. It is also worth noting that max pooling tends to focus on the brighter pixels, while average pooling smooths out inputs and could not detect sharp features.

Once the pooled feature map is obtained, it is then transformed by a flattening layer into a single column. The flattened feature map is passed to fully connected (FC) layers.

Fully Connected Layer

Referring to the examples of Figure 1.3, the systems, 200, 300, as described herein may comprise at least one fully connected layer, 207, 307, configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, 203a, 203b, 205, 303a, 303b, 305. The at least one fully connected, 207, 307, layer may be configured to determine the indication of the audio quality of the input audio frame. In an embodiment, the at least one fully connected layer, 207, 307, may comprise a feed forward neural network. Possible properties of the at least one fully connected layer,

207, 307, are set out in the examples below.

Fully connected (FC) layers may be simple feed forward neural networks. It is worth noting that the only difference between fully connected layers and convolutional layers is that neurons in convolutional layers have only local connections in the input and several neurons in a convolutional layer share parameters. A fully connected layer may be said to be a function mapping from R" to R m with the following equation:

0 = W I, (2.5) where W denotes weights matrix including bias. The number of leamable parameters in fully connected layers is equal to the size of weights matrix, which is the number of input nodes plus an extra bias multiplying the number of output nodes. Each neuron in a fully connected layer has full connection to every neuron in the coming layer. Pictorially, a fully connected layer is represented as follows in Fig. 2.13. Fully connected layers may be added at the end of a CNN. It tunes the weight parameters to create a stochastic likelihood representation of each class for classification tasks. Therefore, the output layer has the same amount of neurons as the number of predicted classes. Regression tasks such as this disclosure predicts continuous value. A single neuron is constructed in the output layer and the weight matrix of fully connected layers can be seen as polynomial matrix of the targeted regression line.

Batch Normalization

An internal covariate shift occurs when the input distribution to the network changes. The layers in a CNN learn to adapt to the new distribution and thus it slows down the training progress and convergence speed to a global minimum. Batch normalization (also known as batch norm) is a technique for training deep neural networks that standardizes the inputs to a layer by re-scaling and re-centering. Furthermore, batch normalization also has a more fundamental impact on the training progress. It makes the optimization landscape significantly smoother, which induces a more predictive and stable behavior of the gradients, allowing for faster training.

Another advantage of batch normalization is that it improves the robustness of the neural network against ill-conditioned parameters initialization. As the network goes deeper, it becomes more sensitive to the initial random weights and configuration of the learning algorithm.

Dropout

Overfitting is a modeling error when a function fits too closely to a limited set of data. Regularization reduces overfitting and fits approximately on the given training dataset by adding a penalty to the loss function. Dropout refers to that during the training processing certain set of neurons are randomly dropped or ignored. These neurons will not be taken into consideration during a forward or backward pass. Dropout offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and improve generalization error (out-of-sample error).

Dropout may be applied to most types of layers such as convolutional layers, fully connected layers and recurrent layers. It introduces a new hyperparameter, which specifies the probability (/?). It induces how many nodes at this layer are dropped out or retained in the training procedure. A common choice for p is between 0.5 and 0.8. For example, p = 0.8 means that 80 percent of nodes are retained and 20 percent of nodes are dropped out. Dropout is deactivated during prediction/inference. The procedure of dropout is illustrated in Fig. 2.16. Compared to the full connection between all nodes in Fig. 2.15, the second node, h, is deactivated in Fig. 2.16 and its connection to the next layer is cut down after applying dropout, so that its parameters will not be updated during the current training stage.

Dropout has proven to work more effectively in practice than other regularization methods, such as weight decay, filter norm constraints and sparse activity regularization.

Advanced Layers

In the following, four advanced modules and structures will be described and they build the backbones of the experimented and proposed models described herein. A classic recurrent neural network (RNN): long-short term memory networks (LSTMs) will be introduced. Then, the attention mechanism will be described and its extension application: self-attention and squeeze-and-excitation (SE) network. In the last part, the Inception module and its variation in the proposed model will be described.

Long Short-Term Memory Network

Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. The necessity for RNNs structure is obvious: the way of human’s thinking is coherent and what we understand the current state is based on the previous one. It is impossible for traditional neural networks to preserve the previous states. RNNs address this issue by persisting previous information in loops. A loop allows information to be passed from one-time step of the network to the next. A recurrent neural network is unrolled in Fig. 2.17.

The chain-like structure of RNNs reveals that it is ideally suited for handling sequences and time series. Another advantage of RNNs is that RNNs take no predetermined limited size of input, while its model size does not increase with the size of input. In the last decade, RNNs have been successfully applied to a variety of problems such as speech recognition and translation. However, the drawbacks of RNNs are also impossible to neglect. The computation of RNNs is slow and RNNs have bad memory at the information long time ago owning to the so-called “vanishing gradient problem”. The vanishing gradient problem refers to the phenomenon that the gradient shrinks as it back-propagates through time. If a gradient value becomes extremely small, RNNs will forget the early information in a long sequence and thus have a short term memory.

Long-short term memory (LSTM) network was created as a solution to short term memory. It has internal mechanisms called gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away. The internal design of a LSTM unit is shown in Fig. 2.18.

The dotted box, 3001, marks the forget gate. This gate decides what information should be thrown away or kept. Information from the previous hidden state and from the current input is passed through the sigmoid (s) function. Values come out between 0 (forget) and 1 (remember). The short dashed box, 3002, marks the input gate, which updates the new input in every time step. The dashed dotted box, 3004, marks the cell state and the long-dashed box, 3003, around the output gate decides what information the next hidden state should carry. Different applications are summed up in the Table 2.1-2.2:

Table 2.1: Applications of RNNs

Table 2.2: Applications of RNNs

Gammatone spectrograms are a visual representation of audio signals and inherit the time dimension of audio. Inputs in this disclosure are 8 seconds long gammatone spectrograms with a temporal resolution of 20 milliseconds and output is a single continuous score between 1 and 5. Therefore, the estimation is formulated as a regression problem and the solution fits the many-to-one RNN prototype.

Attention Mechanism

Attention is motivated by the psychological attention, which is the cognitive process of selectively concentrating on one or a few things while ignoring others. Attention mechanism allows models to focus on a certain region or component, which has greater importance for decision making, and skip the rest. In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: features with higher weights contribute more to the decision-making process and vice versa.

The attention mechanism was first bom to help memorize long source sentences in neural machine translation (NMT). The Seq2Seq model in the field of language modeling aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. The Seq2Seq model is usually composed of an encoder and a decoder, which are both based on RNN architectures, such as LSTM units. A critical and apparent disadvantage of encoders and decoders is its fixed length of context vector and incapability of remembering long sentences. Although LSTM is supposed to capture the long-range dependency better than RNN, it tends to become forgetful in specific cases. The attention mechanism was created to resolve this problem.

With the assistance of the attention mechanism, the dependencies between source and target sequences are not restricted by the in-between distance anymore. In broad terms, Attention may be one component of a network's architecture, and may be in charge of managing and quantifying the interdependence. General attention manages the interdependence between inputs and outputs, while self-attention works within the input elements.

Compared to previous architectures, the main advantages of self-attention are the ability of parallel computing (compared to RNN) and no need of deep networks in handling with long sequence (compared to CNN). The basic structure of self-attention is illustrated by a simple example of a sentence in the Fig. 2.19 and 2.20.

Every input word is first embedded into a feature vector a. Concepts of query, key and value are introduced into the computation of an attention matrix, which can be calculated based on those feature vectors in Equation 2.6. The key/value/query concepts originate from retrieval systems. The attention operation turns out to be a retrieval process as well, so the key/value/query concepts are applied in order to assist to build an interdependence matrix within inputs. The corresponding weights W q , W* and W are the target to be trained. q i = W q . a > k! = W k d, V = W d. (2.6)

An attention matrix A is generated as inner product of query of the /th input and key of the yth input with the Equation 2.7. where d represents the dimension of query and key. Softmax functions are then applied to the attention matrix A by row in order to re-scale the weights between 0 and 1 and ensure the sum of weights to be 1. The output b, of the whole self-attention module is the sum of attention weights multiplied by the value in Equation 2.8, where d i y denotes the re-scaled attention weight between the /th and /th input.

Self-attention layers may be a potential solution to deal with spectrograms. Self-attention can be applied along either temporal axis or frequency axis to build attention matrices between time steps or frequency bands.

Squeeze-and-Excitation Network

Referring again to the examples of Figure 1.3, in an embodiment, the system may further comprise at least one squeeze-and-excitation, SE, layer, 204, 206, 304, 306. The squeeze- and-excitation layer, 204, 206, 304, 306, may follow a last convolutional layer of the plurality of parallel paths of convolutional layers of the at least one inception block, 203a, 203b, 205, 303a, 303b, 305. The squeeze-and-excitation layer, 204, 206, 304, 306, may comprise a convolutional layer, two fully connected layers and sigmoid activation. In the squeeze-and excitation layer, 204, 206, 304, 306, the convolutional layer followed by a scaling operation with the two fully connected layers may generate a respective attention weight for each channel of the feature map output by the at least one inception block, 203a, 203b, 205, 303a, 303b, 305, and may apply said attention weights to the channels of the feature map and may perform concatenation of the weighted channels.

In an embodiment, the system may comprise two or more inception blocks, 203a, 203b, 205, 303a, 303b, 305, and two or more squeeze-and-excitation layers, 204, 206, 304, 306, and the inception blocks, 203a, 203b, 205, 303a, 303b, 305 and the squeeze-and excitation layers,

204, 206, 304, 306, may be alternately arranged.

Possible features of Squeeze and excitation layers (networks) are set out in the examples below.

Squeeze-and-Excitation Networks (SENets) introduce a building block for, for example, CNNs that may improve channel interdependencies with negligible computational overhead and may be added to any baseline architectures. During the procedure of the feature extraction in CNN, channels are those feature maps extracted by different kernels and stacked up along z-axis.

In CNNs, the convolutional kernels are responsible for constructing the feature maps based on the learned weights within those kernels. Kernels are able to leam features such as edges, comers and texture. Collectively they leam different feature representations of the target class and therefore the number of channels denotes the number of convolutional kernels. However, these feature maps also have a different magnitude of importance, which means that some feature maps are more helpful than others in targeted tasks. Thus, those feature maps should gain extra attention than others by re-scaling the important channels with a higher weight. This is exactly what Squeeze-Excitation Networks propose.

The Squeeze-and-Excitation block (SE-block) includes three core operations (shown in the Fig. 2.21), which are squeeze, 4002, excite, 4003, and scale, 4004. The feature map set (C, H, W ), 4001, is essentially the output tensor from the previous convolutional layer. The initials represent channels, height and width, respectively. Therefore, each feature map has to operate with H xW data and the need to decompose the information of each feature map into a single value will be crucial in reducing the computational complexity of the whole operation. This is the so-called ’’squeeze” procedure. The method used for reduction of spatial size in SE-block is global average pooling (GAP) and the output tensor is transformed into C c 1 c 1 after the operation of GAP.

The excitation operation follows the squeeze operation and consists of 2 fully connected layers and a sigmoid layer to leam the adaptive scaling weights for these channels. The first fully connected layer reduces the ’’squeezed” tensor by a reduction factor r and the second fully connected layer projects it back to the dimension C x 1 x 1. The sigmoid helps to scale the ’’excited” tensor between 0 and 1. Different shadings in Fig. 2.21 represent channels with different attention weights learnt from the excitation procedure. Subsequently these weights are applied directly to the input by a simple broadcast element-wise multiplication and re scaling back to the same dimension as input C x H x W.

The standard SE block may be applied directly after the final convolutional layer of the architecture. There are several other integration strategies of SE. For example, in a residual network, the SE block may be plugged in after the final convolutional layer, which is prior to the addition of the residual in the skip connection. In an inception network, the SE blocks may be inserted in every inception block after the final convolutional layer.

Inception Module

Instead of merely stacking convolution layers, the Inception network builds a wider architecture instead of deeper. Starting from the Inception network, there are four versions of the Inception block. The present disclosure mainly relates to Inception vl and v2 and its variants. It is also to be noted that the present disclosure proposes use of a modified version of the inception block which to the inventors’ best knowledge has not been investigated up to now.

A short recap of convolutional layers: kernels used in a single convolutional layer is fixed for all inputs. However, the salient component in different inputs can have extremely large variation in size. Smaller kernels are preferred to identify local features and larger kernels suit for detecting global patterns. Therefore, a single, fixed-size kernel could not perform universally well on all inputs and the solution that Inception network has offered is a parallel architecture with multiple kernel sizes operating on the same level.

Fig. 2.22 shows an example of a “naive” Inception module (inception block). It performs convolution on an input, 5001, with 3 different sizes of kernels (1 xl, 3 x3 and 5 x5), 5002,

5003, 5004. Additionally, max pooling, 5005, is also performed. The feature maps extracted by different kernels are concatenated along channel-axis and sent to the next layer, 5006.

An extra l xl convolution is added before 3 x3 and 5 x5 convolutions in order to limit the number of input channels, because l xl convolution is far more computationally cheaper than 5 x5 convolution. It is worth noting that l xl convolution is introduced after the max pooling layer rather than before.

A variation is illustrated in the example of Fig. 2.23 which is to factorize a 5 x5 convolution,

5004, into two 3 x3 convolution operations, 5004a, 5004b, to improve computational speed and decrease the number of parameters that need to be trained. Similarly, any kernel size in xn can be factorized into a combination of 1 x« and m x 1 convolutions. This operation also attempts to decrease the number of parameters by decomposing a large kernel into two smaller ones.

The most obvious characteristic of Inception module is its adaption to various receptive fields. In this disclosure we replace the conventional square-shaped kernels with rectangle shaped, which is more adaptive to the input spectrograms.

The systems and methods for determining an indication of an audio quality of an input audio frame take advantage of this point and fit in Inception blocks (Figure 1.3) with horizontal and vertical kernels as depicted in Fig. 2.24. Since audio signals may contain both tonal and transient signals, exhibited by horizonal and vertical lines on (gammatone) spectrograms, respectively; it is optimal to apply both horizonal and vertical kernels for modeling audio.

The far-left branch is composed of two kernels, which jointly build a 3 x7 vertical kernel, 6002, 6006. The second branch from left is equivalently a 7 x3 horizontal kernel, 6003, 6007. The right two branches inherit the classic Inception module branches with l xl Conv, 6005, and a pooling operation, 6004. In this example, the max pooling is replaced with average pooling, considering that spectrograms do not possess sharp features and average pooling could smooth out the input. This Inception module forms the backbone of the proposed system (model) and displays its significant learning ability in detecting features of various sizes.

The Process of Learning

In the following, a standard training workflow including data processing, parameter initialization, loss function and optimization methods is introduced. In the last section, transfer learning is described against conventional model building and training.

Data Pre-processing

Machines do not understand data such as audio files themselves. They take in Is and Os and therefore loss function is able to calculate the overall error based on these numeric representations. Considering that these data might derive from different sources, a perfect dataset is almost impossible because data could be insufficient, or in different formats and sizes. Necessary measures to unify and adjust those messy data are considered before feeding them to models. General measures are checking missing or inconsistent values, resizing, normalization, down- sampling or up-sampling (data augmentation) dataset, transforming data into the expected format (e.g. audio files into spectrograms) and splitting dataset into training, validation and test set. Further details about data processing in this disclosure will be elaborated below.

Parameter Initialization

Parameters need to be initialized when training the whole network from scratch. There are two popular techniques to initialize the parameters in a neural network, which are zero initialization and random initialization. As a general practice, biases are initialized with zero and weights are initialized with random numbers, for example with a normal distribution.

However, as the network gets deeper and more complicated, the whole training procedure could last weeklong and a better initialization is preferred. Larger models also expose vanishing/exploding gradient problems and slow down convergence speed. In this case, more clear initialization strategies such as He and Xavier initialization are needed depending on the activation functions used in the network.

Another strategy to shorten the training cycle is to train the model with the pre-trained parameters, which will be further explained in detail in the transfer learning section.

Loss Function

Examples of loss functions and possible properties thereof will be described next. The loss function and optimization methods are the bread and butter of deep learning. Loss function, sometimes also called as objective function, is a method of evaluating how well an algorithm models the given dataset. If predictions deviate too much from actual results, the loss function will generate a relatively large error. Optimization methods assist the loss functions to reduce this error by adjusting to the optimal weights and bias in the network through the whole training procedure.

Broadly, loss functions can be divided into two major categories: regression losses and classification losses. Since this disclosure is a regression task, we will explain loss functions used in regression, which are L2 loss, LI loss and smooth LI loss.

L2 Loss/Mean Square Error

As the name conveys, mean square error (MSE) is measured as the mean of squared difference between predictions and actual observations as in Equation 2.9. L2 loss is more sensitive to the outliers, because it squares the error. Generally speaking, L2 offers an empirically more stable solution than LI loss and converges usually faster than LI loss to a global minimum.

Loss/Mean Absolute Error

Mean absolute error (MAE) is measured as the mean of sum of absolute differences between predictions and actual observations as in Equation 2.10. LI norm is a penalty function in math because it promotes sparsity and it is robust to outliers. LI loss might converge to a local minimum compared with L2 and output multiple solutions rather than the optimal one.

Smooth LI Loss Generally, the loss function to be used for the comparison of the predicted indication of the audio quality and the actual indication of the audio quality, as described above, is not limited. Flowever, in an embodiment, said comparison may be based on a smooth LI loss function, an example of which will be described next.

Smooth LI loss can be interpreted as a combination of LI and L2 loss as in Equation 2.11. It behaves like L2 loss when the predictions and actual observation are close enough. LI loss is used when the difference between predictions and actual observation is beyond a preset threshold to avoid over-penalizing outliers.

Optimization

For a given architecture parameters determine how accurately the model implements the task. A loss function evaluates how big the difference is between the predictions and actual observations and the goal of optimizer is to minimize this difference and to figure out the best parameter set that matches the prediction with reality. In this section, four optimization methods are exemplarily introduced, which are gradient descent, stochastic gradient descent, mini-batch gradient descent and Adam.

Gradient Descent

Gradient descent, for example, is the most basic and commonly used optimization strategy, such as backpropagation in the neural network. Gradient descent calculates the first order derivative of the loss function in Equation 2.12 and the weights are therefore altered towards the global minima of this loss function. q = Q - a VL(6>), (2.12) where a denotes the learning rate, Q denotes updated parameters and 1.(0) denotes the loss function used in the network. Although gradient descent is easy to compute and implement, it is often bothered by the issue that the gradient traps at local minima. Moreover, it often requires for large memory and relatively long computation time when tackling large dataset. Since it updates weights after calculating the gradient on the whole dataset, and it might take weeks to converge to minima.

Stochastic Gradient Descent

Stochastic gradient descent (SGD), for example, is a variant of gradient descent and it updates the parameters after computation of each training sample rather than after the whole dataset. The equation of SGD is formulated in Equation 2.13: q = q a nΐf, C i ,g i ), (2.13) where c> and n, represent the individual training sample. SGD requires less memory and converges in less time owning to the frequent update. But the variance of the model parameters is quite high.

Mini-batch Gradient Descent

Mini-batch gradient descent, for example, is a method between SGD and gradient descent. Mini -batch gradient descent updates weights after every batch in Equation 2.14, instead of after a single sample or a whole dataset. Batch refers to a sub-set of data, that are calculated together in every computation. q = q - a - VL(e; Bi . (2.14)

The advantages of mini -batch gradient descent are obvious: it occupies a medium amount of memory and has less variance when keeping a relatively high frequent update of parameters. Elowever, it does not address the issues as gradient descent and SGD confront with. It could be trapped at local minima and does not have an adaptive learning rate of different parameters. Furthermore, the convergence could take very long if the learning rate is small.

Adam

Adam, for example, is an abbreviated name of adaptive moment estimation. As its name suggests, it works with the first and second order momentums. Momentum is a term that is introduced in the optimization methods to reduce the high variance in methods such as SGD and smooth the convergence curve. It accelerates gradient vector towards the right direction and reduces the oscillations in the irrelevant directions. The motivation behind the Adam is, however, to slow down the gradient descent velocity a little bit for a careful search in the relevant direction and to avoid to jump over the minima. Adam introduces exponentially decaying average of previous gradients m (k] and previous squared gradients v®. In other words, m (k ’ is the first moment, which is mean and v® is the second moment, which is the uncentered variance of the gradients. Adam is expressed with the equation group 2.15:

In the suggested configuration settings, m is usually equal to 0.9, p equal to 0.999 and h takes 0.001. Even though Adam is much computationally expensive than other methods, it converges rapidly to solutions with better quality, such as a global minimum or a better local minimum.

Transfer Learning

Transfer learning is a machine learning method that could inherit previous knowledge and transfer it across tasks. It utilizes the pre-trained models as a starting point instead of building and training a model from scratch. One of the obstacles in deep learning is to collect and build a well-labeled, reasonably structured and sufficiently large dataset. Data collection is tremendously time-consuming and a “new-born” model would also take very long to look through this large dataset and optimize the parameters according to it. Many trainings configurations are restricted by the hardware of computers and therefore it is cost-inefficient to train a model in a traditional way.

Another reason to utilize the existing models is that it could also initialize the parameters from previous models rather than random initialization. To be more specific, the top few layers of CNN models always leam some basic features such as frequency bands in audio. These basic elements aggregate together and build up the complex features in deeper layers, which are task-dependently distinctive and are able to differentiate classes (in classification) or value (in regression). Transfer learning accommodates pre-trained models with the new dataset by fine-tuning. Fine-tuning in transfer learning has two basic steps: (1) to cast off the previous output layers and add a new one for the current task; (2) to retain partial parameters in pre-trained models and initialize parameters in the newly added layers randomly and retrain this model on the new dataset. The most common operation is to freeze (retain) those parameters in top few layers because those features are generic in source and target tasks.

Referring to the example of Figure 1.3 and the method illustrated in Figure 1.7, transfer learning in the present disclosure refers to initializing one or more weight coefficients of the at least one inception block of the stereo model based on one or more weight coefficients that had been obtained for at least one inception block of a computer-implemented deep-leaming- based system for determining an indication of an audio quality of an input audio frame of a mono audio training signal.

Models

Examples of neural networks for use in accordance with the present disclosure will be described next. These architectures can be roughly categorized into CNN-based networks and Inception-based networks. As an example of models proposed by the present disclosure, the InceptionSE model evolving from Inception-based backbone will be elaborated in detail below.

CNN -based Architectures

RNNs including LSTM and GRU are experts at handling audio signals due to its chain-like structure. CNNs on the other hand are proficient in handling visualised feature representations, such as spectrograms. Attention mechanism was integrated into neural networks due to the inefficiency of RNN architectures when dealing with long sequences. In this section, a fundamental CNN model is built and LSTM layers, self-attention layers and SE layers are integrated separately into the backbone of CNN to form the other three CNN-based models.

CNNModel

Task complexity and size of dataset determine the depth of CNNs. The CNN depth and kernel sizes are considered in and it is experimented on a vanilla CNN architecture for the optimal depth and kernel sizes with a grid search.

The vanilla CNN architecture is shown in Fig. 3.1. It includes 7 convolutional layers (Convl to Conv7) with batch normalization followed by 3 fully connected layers (FCL1 to FCL3) with a dropout rate of 0.5. ReLU is used as activation function throughout the network.

Architectures deeper than this show a tendency of overfitting during the experiments and the error stops to decrease on the validation set. Therefore, this vanilla model is used as the backbone for other experimented architectures. Further experiments confirm that kernel sizes larger than 7 cannot effectively seize tiny features in spectrograms, while kernel sizes 3 x 3, 5 x 5 and 7 x 7 do not show significant difference in the learning process. A kernel size 5 x 5 was chosen for the first convolutional layer and 3 x 3 for the rest of layers, considering that the first layer can be viewed

CNN-BASED ARCHITECTURES

Layer Filter Size/Stride Output Size

Input 2 x 32 x 360

Convl 5 64 x 32 x 360

MaxPool 2 64 x 31 x 359

Conv2 3 128 x 16 x 180

Conv3 3 128 x 16 x 180

MaxPool 2 128 x 15 x 179

Conv4 3 256 x 15 x 179

Conv5 3 256 x 15 x 179

MaxPool 2 256 x 14 x 178

Conv6 3 512 x 14 x 178

Conv7 3 512 x 14 x 178

AvgPool 10 5) 512 x 1 x 34

FCL1 1536 x 1

FCL2 512 x 1

FCL2 1 X 1

Table 3.1: Architectures and parameters of CNN model (vanilla) as a T/F-slot-level computation and larger kernels could better summarize patterns.

CNN-LSTM Model

LSTM is capable of learning long-term dependencies and remember the information for a long period of time. Spectrogram inherits the time dimension from audio signals and therefore a classic RNN structure such as LSTM and GRU are also applied in many speech quality assessment tasks based on spectrograms. Models such as NISQA transforms signals into log-mel- spectrograms as actual input to the network. NISQA consists of a CNN head and a LSTM tail. The CNN helps to capture the features from spectrograms and predict a per-frame quality. However, the overall quality cannot be viewed as a simple summation of each per-frame quality, because interference such as short interruptions has been proven to sound more annoying than steady background noise. Therefore, they use bidirectional LSTM (bLSTM) after CNN to model dependencies between frames and forecast an overall speech quality.

In this disclosure, a bLSTM layer was applied along time/frequency dimension to analyse the association between frames and frequency bands (shown in Table 3.2). It is added behind the last convolutional layer in vanilla CNN model and before the global average pooling layer. This model, however, has been observed in later experiments with exponentially high computation cost compared to other models. Therefore, other efficient layers to bypass LSTM are looked for.

Layer Filter Size/Stride Output Size

Input 2 x 32 x 360 bLSTM 512 x 14 x 178

(time/frequ ency)

AvgPool 10 x 10 / (5, 5) 512 x 1 x 34

FCL1 1536 x 1

FCL2 512 x 1

FCL2 _ 1 x 1 _

Table 3.2: Architectures and parameters of CNN-LSTM model

CNN-Attention Model

Attention mechanism has been found to be a better alternative to LSTM for long sequence. They could capture longer dependencies than LSTM and compute the input in parallel rather sequentially. Self-attention is able to compute an attention matrix between all frames or frequency bands regardless of the length of input. In the present disclosure, LSTM units are replaced with a lightweight self-attention layer to simplify the computation complexity.

Table 3.3 shows a CNN-Attention architecture. Self-attention layers are applied to both frequency dimension and time dimension and inserted between convolutional layers of the vanilla CNN model.

D20118W001

Layer Filter Size/Stride Output Size

Input 2 x 32 x 360

SelE 512 x 14 x 178

Attention

(time/frequ ency)

AvgPool 10 x 10 / (5, 5) 512 x 1 x 34

FCL1 1536 x 1

FCL2 512 x 1

FCL2 1 x 1

Table 3.3: Architectures and parameters of CNN-Attention model

CNN -SE Model

The Squeeze-and-excitation mechanism aims to boost meaningful features, while suppressing weak ones by extra attention weights applied to channels. The SE module was thus used for channel recalibration of feature maps. In one implementation, these SE modules were incorporated within the vanilla CNN model with minimal computation complexity and the performance of this CNN-SE model was monitored. Two extra SE layers may be integrated into the backbone of vanilla CNN model as shown in Table 3.4. They compute 256 x 1 and 512 x 1 attention matrices, respectively, which are applied to the feature maps by a broadcast element-wise multiplication. Inception-based Models

The input size 2 x 32 x 360 (an example input size used in the training, and explained later in training dataset preparation), as explained in above, is a long narrow rectangular matrix. Classic kernels such as 3 x 3 and 5 x 5 would not optimally adapt to rectangular input. A rectangular kernel such as a horizontal kernel as temporal kernel could seize rhythmic and tempo patterns. A vertical kernel (frequency kernel) could leam timbre and such setups over broader frequency bands. These rectangular kernels were fit into the structure of the Inception block and a new backbone for audio quality evaluation was found. Inception network has been successfully transfer learned in many audio classification tasks. This indicates that Inception network could effectively capture the latent features in audio. Therefore, four variants were constructed based on the backbone of Inception blocks with rectangular kernels, which are Inception model (naive), InceptionSE model (naive), Inception model without head layer and InceptionSE model without head layer. InceptionSE model without head layer is the preferred architecture.

Layer Filter Size/Stride Output Size

Input 2 x 32 x 360

Convl 5 x 5/(1, 1) 64 x 32 x 360

MaxPool 2 x 2/(1, 1) 64 x 31 x 359

Conv2 3 x 3 / (2, 2) 128 x 16 x 180

Conv3 3 x 3/(1, 1) 128 x 16 x 180

MaxPool 2 x 2/(1, 1) 128 x 15 x 179

Conv4 3 x 3/(1, 1) 256 x 15 x 179

Conv5 3 x 3/(1, 1) 256 x 15 x 179

MaxPool 2 x 2/(1, 1) 256 x 14 x 178

AvgPool 10 x 10/(5, 5) 512 x 1 x 34

FCL1 1536 x 1

FCL2 512 x 1

FCL2 1 X 1

Table 3.4: Architectures and parameters of CNN-SE model

Inception Model

Kernel sizes larger than 9x9 shows inefficiency in capturing feature within this input size. Therefore, we restrict the current Inception model to kernel sizes l x 1,1 x3, 3 x 1, 1 x 5, 5 x 1, 3 x7, 7x3, 3 x5, 5 x3. Inception module forms a combination of three convolutional layers in different kernels size and a parallel pooling path with their output feature maps concatenated into a single output vector. As features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of smaller kernel sizes should increase as moving to higher layers.

In general, this Inception network includes modules of the above types stacked upon each other while keeping the first layer as a traditional convolutional fashion. The head convolutional layer is not strictly necessary, which reflects the inefficiencies of this naive Inception model.

Further, the auxiliary classifier is removed considering that Inception-based networks as described herein are relatively shallow, but one head layer is kept in our naive Inception model to examine how it will impact the learning process.

Table 3.5: Architectures and parameters of Inception model (naive)

The combination of Figs. 3.2A to 3.2D illustrates the architecture of naive Inception model and its parameters are listed in Table 3.5. Its basic Inception block derives from the adapted Inception block depicted in Fig. 2.24 with a ReLU and a BatchNorm after each convolution operation.

InceptionSE Model

Similarly, SE layers were introduced into the backbone of Inception network motivated by improving the resistance of the model against the so-called “adversarial examples”.

Taking the output from generative models as example, the objective audio quality estimators so far fail to predict the quality of these generated samples. Adversarial examples are one huge obstruction in constructing a robust audio quality predictor because machines take only Os and Is matrices (rather than hear them). An excerpt added with fine-tuned inaudible noise would heavily disturb machine's algorithms but human would not sense any difference. SE layers are believed to facilitate the model to focus on generic features and reduce the influence from random variations at the instance level.

The naive InceptionSE model is depicted in the combination of Figs. 3.4A to 3.4F and its parameters in Table 3.6, where SE layers are inserted between Inception blocks and produce an attention weight on each channel of Inception blocks.

Table 3.6: Architectures and parameters of InceptionSE model (naive)

Inception and InceptionSE Model without Head Layer

Features captured by the top few layers in CNNs are usually reckoned as some primary features such as frequency and temporal patterns in audio related tasks. Gammatone spectrograms as a visual frequency -time representation of audio signals are extracted by Gammatone filters in advance and an extra convolutional layer in head before Inception blocks is intuitively redundant for current incarnations. The head layer has been removed from both the naive Inception model and the naive InceptionSE model and the performance of the modified model prototypes was examined. The parameters of the InceptionSE model without head layer are listed in Table 3.7 and the table for the Inception model without head layer is omitted since they share the same parameters apart from additional SE layers, while their architectures are depicted in the combination of Figs. 3.3A to 3.3D and the combination of Figs. 3.5A to 3.5F, respectively.

Table 3.7: Architectures and parameters of InceptionSE model without Head

In a nutshell, the InceptionSE model without head layer (from now on abbreviated as InceptionSE model) is chosen as proposed model of this disclosure due to its overall performance as well as its compact architecture. It has been observed in the later experiments that it offers a rapid training process and a stable performance on both training and test datasets. Further details of systems and methods for determining an indication of an audio quality of an input audio frame with different architectures and fine-tuning parameters will be elaborated below.

Datasets

The diversity of tasks in audio processing is the major reason of lacking in a unified database. Datasets used in the previous audio and speech quality evaluation research are collected and annotated by individuals and not published for other research. Therefore, own data was used covering music, speech and speech mixed with music excerpts to create a 48 kHz sampling rate training dataset. In the last section, two test datasets will be introduced to evaluate the proposed model.

Training Dataset Preparation

The duration of excerpts used in related audio and speech quality evaluation tasks varies from 6 to 15 seconds long and minimally 1000 clean reference samples are generated for training. Therefore, a corpus of 10 hours long clean music excerpts was used and 2 hours long clean speech excerpts at a sampling rate of 48 kHz. The data generation and labeling procedure is outlined in Fig. 4.1.

The clean reference is first segmented into 5400 approximately 8 seconds long excerpts and the left channel is extracted from each excerpt as a mono signal. These mono clips are then encoded and decoded by High-Efficiency Advanced Audio Coding (HE-AAC) and Advanced Audio Coding (AAC) codec with the following bitrates: 16, 20, 24, 32, 48, 64, 96, and 128 kbps. Bitrates 16, 20, 24, 32, and 48 kbps was encoded with HE-AAC and bitrates 64, 96, and 128 kbps was encoded with plain AAC. Coding above 128 kbps will be hardly audibly different from un-coded signals and coding below 16 kbps will greatly reduce the audio quality . Thus, 43, 200 degraded signals are generated from 5400 clean reference excerpts.

The reference and degraded signals are aligned and paired and later fed into ViSQOL v3 to produce MOS-LQO as the corresponding ground truth labels instead of human annotated MOS scores. Gammatone spectrograms of reference and degraded signals are extracted based on the MATLAB implementation, whose C++ is running inside ViSQOL v3. It was verified that both MATLAB and C++ version generate same gammatone spectrograms. The gammatone spectrogram of the audio signal is calculated with a window size of 80 ms, hop size of 20 ms, and 32 frequency bands ranging from 50 Hz up to half of the highest sampling rate, i.e. 24 kHz. The resulting gammatone spectrograms of reference and degraded signals are paired and stacked along channel dimension, which results in an input size 2 x 32 x 360 to the neural network. Note that this is just an input size used only for training. Since this is a convolutional model, it can operate on sequences of any input length. That is, the model could accept smaller time frame as input, and predict quality on a frame-by -frame basis (e g. every 600 ms) or over the whole music duration (e.g. 5 minutes or longer as permitted by memory).

Here, 2 denotes 2 channels, 32 denotes 32 frequency bands and time resolution is calculated according to Equation 4.1 :

Since not all excerpts are exactly 8 seconds, and excerpts shorter than 8 seconds result in 360 columns in time dimension. The input size was unified into 2 x 32 x 360 by abandoning the extra columns after 360. The gammatone spectrogram of one training sample is illustrated in Fig. 4.2.

The resulting training set is illustrated in Fig. 4.3 and 4.4. Two obvious characteristics of this training dataset can be observed from the distribution based on MOS and bitrates: almost 80% of samples are rated above 4.0 by ViSQOL and the highest rated samples are concentrated in the region of high bitrates according to Fig. 4.3. Second, according to Fig. 4.4 the overall trend of MOS-LQO increases as the bitrate except for the bitrate points at 48 kbit/s and 64 kbit/s. ViSQOL rates 64 kbps excerpts lower in quality than 48 kbps, which is contrary to empirical experience and intuition. One possible reason for this abnormality could be 48 kbps excerpts are coded with HE-AAC, which creates a higher and extended bandwidth with spectral band replication (SBR), but 64 kbps excerpts are coded with AAC. It can be also observed from spectrograms in the Fig. 4.8 and 4.9 that the sample WADmus047 has wider bandwidth at 48 kbps than at 64 kbps, while the other bandwidths (shown in Fig. 4.5 to 4.13) widen as the bitrate increases. Another possibility could be that 64 kbps bitrate encoder operating point is not fine-tuned optimally. Thus, 48 kbps and 64 kbps excerpts in the training set could mislead the model to learn a wrong pattern and thus those excerpts were excluded from the training set. In order to balance this biased training set towards low MOS-LQO and calibrate the scores with subjective listening tests in evaluation, the clean reference was included as well as two anchors, which are 3.5 kHz and 7.0 kHz low-pass filtered reference signals, into the degraded signals. In addition, excerpts coded at 40 kbps, 48 kbps (but, bandwidth limited to 18 kHz) and 80 kbps are included into degraded signals and labeled with ViSQOL v3 estimation as well. One severe drawback of ViSQOL v3 is its inaccuracy in predicting the quality of clean original signals. When ViSQOL v3 takes reference- reference (ref-ref) pairs to predict the quality of the clean original excerpts, the estimated MOS score is constrained to 4.73 rather than the highest score of 5. Therefore, all ref-ref pairs were manually labeled with the highest MOS score of 5 as ground truth in an attempt to push the models to rate reference signals with the highest score of 5.

Fig. 4.14 plots the new training set (except for the reference). In the figure, 48 refers to the excerpts coded at 48 kbps bandwidth limited to 18 kHz. It is jumped over the MOS valley by excluding 64 kbps excerpts from the training and it is obtained a monotonically increasing MOS-LQO in the training set. This special care was taken in making sure that the quality increases with bitrates. This helps in making sure that the model is capable of ranking correctly (leaming-to-rank strategy); i.e. if a signal xj is a programmatically degraded version of the same original signal x_i, then their scores should reflect such relation, i.e., MOS_i > MOSJ.

Noise and Silence

During evaluation, the model has been observed to fail in predicting accurately the quality of the excerpts which have very low-energy content in high frequency bands.

When such audio excerpts are coded at high bitrates, listeners would not perceive the slight defect existing in high frequency bands and would also rate these excerpts with high scores. However, machines are more sensitive than humans because they “see” the matrices of spectrograms instead of actually ’’hearing” them. Therefore, the model treats every T/F-slot alike and is not able to ignore those defects in high frequency bands like the way humans do.

Figs. 4.15 and 4.16 display one such samples with low-energy content in the high frequency region. Most of its content concentrates under 4 kHz and when this excerpt is coded at higher bitrates, there is a huge spectral hole existing between 10 kHz and 16 kHz on its spectrogram (marked by the dashed box in Fig. 4.16). Even though listeners cannot perceive it, computers could see this defect in the spectrogram and rate this excerpt with an unexpectedly low score.

Table 4.1: Training dataset composition

ViSQOL v3 sets up a spectrogram threshold to address this issue, while it is attempted to include fine-tuned noise and silence excerpts in the training set to push up the quality scores predicted by the model for this type of defect. These noise and silence excerpts are low-energy (lower than -108 dB) high-pass filtered signals and then encoded and decoded at high bitrates (80 kbps, 96 kbps and 128 kbps), which are inaudible to listeners with normal hearing. Motivation behind this idea is to train the model with visibly different (on spectrograms), but perceptually equivalent pairs of (unencoded-coded) signals. These reference- degraded (ref-deg) pairs are labeled manually as MOS score 5, considering that their addition into audio excerpts would hardly harm listener experience. Through adding these noise and silence excerpts in the training set, it is expected to force the model to learn the pattern of noises in high frequency bands and be more tolerant of such imperfections.

In total, 5 minutes of silence and 90 minutes of noise excerpts, including 60 minutes of white noise, 15 minutes of pink noise and 15 minutes of brown noise were generated and coded at 80 kbps, 96 kbps and 128 kbps, respectively. These excerpts were then segmented into 8 seconds long excerpts and paired with its corresponding reference.

In conclusion, the overall composition of the final training dataset is listed in the Table 4.1. Only ref-deg pairs of music and speech signals use the estimation of ViSQOL as ground truth and the other pairs are annotated manually with the highest score 5. The paired signals are transformed into Gammatone spectrogram matrix of shape of 2 x 32 x 360 as input to the models.

Test Datasets

Test dataset is a vital link to verify the utility of trained models. Excellent performance of the model on the training set does not necessarily mean the success of training. Test set is a collection of samples that models have never seen and thus it assists to rectify the model (according to model's performance on it). A collection of 56 critical excerpts typically used to evaluate audio codecs and USAC Verification Listening Tests are the two test sets.

Set of Critical Excerpts for Codecs

This set includes one applause excerpt, speech excerpts in different languages and genders, music excerpts as well as excerpts which are a mix of both speech and music. However, this test set is lacking in MOS scores and therefore this test set is used to evaluate how well the model correlates with ViSQOL v3 on unseen data. Samples in the test set are processed exactly as the training set. These 56 excerpts are encoded and decoded with the same codecs at following bitrates: 16, 20, 24, 32, 40,

48 (bandwidth limited 18 kHz), 80, 96, and 128 kbps. The ref-deg pairs are fed into ViSQOL v3 for MOS-LQO scores (marked as MOS-v) and transformed into gammatone spectrogram matrices, which are then fed into InceptionSE model for a new MOS-LQO scores (marked as MOS-i). MOS-v and MOS-i are compared and illustrated to evaluate the performance of the models with respect to ViSQOL.

IJSAC Verification Listening Tests

The Unified Speech and Audio Coding (USAC) technology was developed to provide coding of signals having an arbitrary mix of speech and audio content with consistent quality across content types, in particular at intermediate and low bitrates. USAC Verification Listening Tests contain 27 items coded with USAC and the other two best codecs tailored for general audio or speech (i.e. HE-AACv2 and AMR- WB+) on the whole range of bitrates from 8 kbps mono to 96 kbps stereo.

The verification tests were designed to provide information on the subjective performance of USAC in mono and stereo over a wide range of bit rates from 8 to 96 kbps compared to the subjective performance of the other codecs (i.e. HE-AACv2 and AMR-WB+). Depending on the listening test, at least 8 listeners at 6 to 13 test sites participated each test and more than 38000 individual scores on three different codecs were collect. These verification tests provide us a standardised quality metric to evaluate the performance of InceptionSE model and correlate the predictions with subjective quality scores.

Table 4.2: Conditions for Mono Listening Test 1 at low bitrates Three separate listening tests have been performed, including mono signals at low rates, stereo at low rates and high rates. The conditions included in each test are given in the Tables 4.2 to 4.4. Along with USAC, HE-AACv2 and AMR-WB+ are codecs evaluated in tests.

All tests used MUSHRA method, whose quality scale ranges from 0 to 100 with no decimals digits. All test items are about 8 seconds in duration. The scores are excluded if the following criterion are not satisfied:

• The listener score for the hidden reference is greater than or equal to 90 (i.e. HR >= 90)

• The listener scores the hidden reference, the 7.0 kHz low-pass anchor and the 3.5 kHz low-pass anchor are monotonically decreasing (i.e. HR >= LP70 >= LP35).

Mono Listening Testl

The items in this test are mono signals coded at low bitrates. InceptionSE model is designed and trained on mono signals as well. Therefore, the performance of InceptionSE model on this mono listening test is for fine-tuning the model at the low bitrates with respect to the subjective quality scores.

Stereo Listening Test2 and Test3

Listening test2 and test3 are conducted on the stereo signals at low rates and high rates, respectively. They are supplementary tests to examine the performance of the models on stereo signals. The mid-channel is calculated as an average of the left and right channels of a stereo audio excerpt and input the mid-channel into InceptionSE model for an objective MOS score of the stereo audio excerpt. However, this simple summation of multi-channel is not how the ear process the stereo signals. Hence, the results on stereo listening tests work merely as an indicator to show the potential of InceptionSE for stereo signals.

Table 4.3: Conditions for Stereo Listening Test 2 at low bitrates

Table 4.4: Conditions for Stereo Listening Test 3 at high bitrates

Experiments

The models constructed in the previous sections were experimented using the data corpus described above. The performance of each model is evaluated under the following criteria: computation efficiency, parameter amount, mean squared error (MSE), Spearman 's correlation coefficient (R s ) and Pearson 's correlation coefficient (R p ). In the final section, the proposed model will be examined on the test sets and the results give an insight into how the proposed model performs on the unseen data coded with other codecs.

Experiments with Architecture Variants

In this section, five models with distinct backbones are evaluated training on the dataset including noise and silence excerpts. These models are vanilla CNN model, CNN-LSTM model, CNN- Attention model, CNN-SE model and naive Inception model. Further, the advantages of the Inception-based model are discussed.

Comparison between CNN-based and Inception Models

The training dataset including noise and silence results in a total of 67624 ref-deg pairs (more precisely, ref-ref and ref-deg pairs). The data were partitioned randomly into 80% for training and 20% for validation. A 5-fold cross-validation is applied to ensure that the model could make full use of limited data. The average R p and MSE per fold and epoch will be calculated to represent the overall performance of each model. The mean and variance of the input features were normalised using the estimation from the training set. All experiments are implemented with PyTorch and were trained for 50 epochs on a Nvidia GTX 1080 Ti GPU using the Adam optimizer. Grid searches were performed over learning rates, batch sizes and a learning rate of 0.0004 and a batch size of 32 as training parameters were selected.

Batch normalization was applied after all convolutional layers and all models use MSE loss as the loss function. In the view of the training dataset size, dropout was used as the regularization technique to prevent overfitting on the limited dataset.

The amount of parameters (counted in millions M ) and the training time for an epoch are listed in the Table 5.1. CNN-LSTM model occupies memory exponentially high if not downsampling input size and CNN-LSTM was excluded from the further work due to its computation inefficiency. Another finding is that extra self-attention layers and SE layers do not aggravate much computational burden. Furthermore, Inception-based network streamlines the architectures with fewer parameters and layers and shorten the training time needed per epoch.

Table 5.1: Computation efficiency of experimented models

Table 5.2 summarizes performance of the remaining four models. The progress was monitored via R P and MSE over a validation subset. Interestingly, naive Inception model has achieved a comparable performance as a CNN- Attention model with merely half of parameters and less training time. SE layers, on the contrary, sacrifice the accuracy of CNN model for higher robustness (as expected).

Table 5.2: R p and MSE of experimented models

In short, naive Inception model has outperformed others in terms of high correlation with ground truth and a low average error on the validation set with merely half of training parameters.

Further performance gains would be achieved by fine-tuning Inception-based model and it was experimented further on the variants of Inception-based model.

Comparison between Inception-based Variants

Three more variants from Inception-based models with SE layers inserted were evolved and the head layer removed. The progress of these models was recorded and the performance is listed in Table 5.3.

Table 5.3: R p and MSE of Inception-based variants

A significant performance improvement has been achieved by removing the head layer in Inception-based models, and parameters for training are further reduced. This modification depends on the input format fed into the models. Gammatone spectrograms are a visual representation of a set of frequency bands over the entire time scale. The top layers in CNNs usually leam primary features such as frequency bands and pattern over time, which have been extracted by the gammatone filters in the data pre-processing. Therefore, an extra convolutional layer on the top would be redundant for computation in this case. Inception-based models could also be adapted to analysing raw audio waveforms. Under this circumstance, it is hypothesized that such a head layer may be necessary in extracting the primary features.

On the contrary, SE layers do not increase the correlation of experimented models with ViSQOL as expected. One property of SE layers is to improve the robustness of Inception-based models against adversarial examples. This could sacrifice the model's accuracy on the targeted task and achieve better adaptability on a broader range of samples.

Experiments with Loss Functions

Experiments in search for a suitable loss function are conducted and the influence of loss functions on InceptionSE model without the head layer is evaluated.

Table 5.4: Performance of InceptionSE model with various loss functions

As can be observed from Table 5.4, all these loss functions have achieved an average R p over

0.9 with respect to ViSQOL v3 on the validation set from Table 5.4. The best model of each configuration has all exceeded a R p of 0.99. However, Smooth LI loss has been observed to stabilize the training progress with a slightly fast convergence speed. In the final proposal, the best model trained by Smooth LI loss was chosen. Alternatively, models trained by the other two loss functions may be feasible as well.

Experiments with Reference, Noise and Silence

One apparent deficiency of ViSQOL v3 during generation of scores is its limitation in predicting quality scores of clean reference signals. The highest prediction scores produced by ViSQOL is bounded by 4.73, even though the audio excerpts are perceptually transparent for listeners. All ref-ref pairs were manually labeled as 5 rather than 4.73 predicted by ViSQOL v3, which enables the models to learn the correct upper boundary as 5.

Excerpts such as the sample CO 02 OMensch and scOl (a trumpet excerpt, from the MPEG-12 set) are featured with low-energy content in high frequency bands. The models in the early stage would estimate this type of samples coded with high bitrates with far lower scores than ViSQOL v3 and subjective listening tests scores. This issue is addressed by including low-energy high-pass filtered noise and silence excerpts in the training set. Positive impact of noise and silence is confirmed by the new prediction of CO 02 OMensch and scOl.

CO 02 OMensch in Table 5.5 is underestimated by the early model trained without noise and silence, while ViSQOL v3 rates this excerpt with its highest score as 4.73 and in subjective listening tests, it is rated almost as high as 5. The new estimation by the model trained with noise and silence surpasses the performance of ViSQOL v3 and matches better with subjective quality scores.

Table 5.5: Performance of InceptionSE model on the excerpt CO 02 OMensch

ScOl is another typical example that disturbs the early model trained without noise and silence. As depicted in Fig. 5.1 and 5.2, cross represents the prediction from neural networks and bullet is the prediction from ViSQOL v3. Compared to the early estimation of the model on the left, the predicted MOS-LQO scores have significantly improved, especially at high bitrates. It is also worth noting that the models trained with manually annotated ref-ref pairs successfully predict the original quality of scOl as 5 (instead of 4.73 as predicted by ViSQOL v3).

Results on the Set of Critical Excerpts for Codecs

Test set of critical excerpts for codecs consists of 56 items including speech, music and applause excerpts. The correlation between our prediction and that of ViSQOL v3 is computed on this test set. The InceptionSE (without head layer) model is examined under two conditions, namely training with noise and silence and without noise and silence. R p and R s are calculated in the Table 6.1, including two anchors (3.5 kHz and 7.0 kHz) and the reference. As can be seen, both R p and R s have improved after training with specially designed noise and silence; and the proposed InceptionSE (without head layer) model has achieved a strong correlation with ViSQOL v3 with a correlation coefficient over 0.97 on the unseen data.

Table 6.1: Performance of Inception-based model on the set of critical excerpts for codecs

Apart from scOl (trumpet), two more samples from the MPEG-US AC-43 set and an extra applause excerpt are illustrated in Fig. 6.1 to 6.6. KoreanMl is a korean male speech, 09-Applaus-5- 1 2 0 is an applause excerpt and SpeechOverMusic 1 is an English female speech over stadium noise. In these figures, different degrees of improvement of predicted MOS-LQO scores can be observed in the high bitrate regions. The prediction of KoreanMl has an overall improvement including its low bitrate regions. KoreanMl is also observed to be a low-energy content in its high frequency bands and therefore the model trained with noise and silence creates more obvious positive influence over this type of excerpts.

Results on IJSAC Verification Listening Tests

In addition, to examining the correlation between the model and ViSQOL v3, it is also critical to verify the utility of the proposed model on the unseen data labeled with subjective quality scores. USAC Verification tests is another test set, which provides subjective quality scores on 27 items coded with AMR, USAC as well as HE-AAC covering a large range of bitrates. Three separate tests are conducted for mono excerpts at low bitrates, stereo excerpts at low and high bitrates. R p and R s are calculated for mono listening test 1 between subjective quality scores and the prediction of different objective audio quality methods including PEAQ, ViSQOL with MATLAB implementation,

ViSQOL v3 with C++ implementation and the proposed model, the InceptionSE model.

Tests on stereo excerpts are supplementary experiments and their results display the potential in extending the current model towards stereo.

Results on Mono Listening Test-1

USAC Verification Listening Testl is designed for mono signals encoded at low bitrates. The corresponding combination of codecs and bitrates can be found out in Table 4.2. It is referred to the results in P. M. Delgado and J. Herre. “Can we still use PEAQ? a performance analysis of the ITU standard for the objective assessment of perceived audio quality”, In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pages 1-6, 2020, where the authors examined the correlation between different objective scores (PEAQ and ViSQOL with MATLAB implementation) against subjective scores. The results are listed with PEAQ (Opticom implementation of PEAQ-Advanced version) and ViSQOL (MATLAB) together with ViSQOL v3 and InceptionSE model in Table 6.2.

Table 6.2: Performance of PEAQ, ViSQOL (MATLAB), ViSQOL v3 and InceptionSE on mono listening test-1 The proposed model has achieved a comparable result as ViSQOL v3 and outperformed the old version ViSQOL (MATLAB) and PEAQ. Furthermore, the performance of the model and ViSQOL v3 is examined on individual codec, namely AMR, HE-AAC and USAC. Similarly, the reference and two anchors are included in each examination and corresponding R p and R s are listed in Table 6.3.

Table 6.3: Performance of InceptionSE model on various codecs

As can be seen in Table 6.3, the proposed model results in a slightly better R p than ViSQOL v3 and has an equivalents as ViSQOL v3. The overall performance of the model on different codecs is consistent with that of ViSQOL v3. Both the proposed model and ViSQOL v3 have performed homogeneously well on the experimented codecs with R p over 0.83 and R s over 0.79. The estimation of degraded quality by HE-AAC, under this comparison, is unexpectedly the worst among the three experimented codecs, even though the model is trained on the excerpts encoded with HE-AAC and AAC. One possible reason lies in that the model takes the ground truth labeled by ViSQOL v3 and it captures the exact pattern as how ViSQOL v3 evaluates the quality of signals encoded by HE- AAC. Therefore, the performance of InceptionSE model on different codecs is very similar to that of ViSQOL v3.

There are totally 24 items listed in Table 6.4, without the 3 items that were used for training the listeners. The excerpt Siefried02 is another example with low-energy content in high frequency region. Compared to the estimation of ViSQOL v3, InceptionSE is obviously more robust to this type of excerpts and displays a more stable and reliable prediction than ViSQOL v3. In general, the model presents a comparatively high and robust performance across all signal categories.

Results on Stereo Listening Test-2 & Test-3

Listening test 2 and test 3 are the tests conducted with stereo excerpts coded at low and high bitrates, respectively. The corresponding combination of codecs and bitrates can be found out in Table 4.3 and 4.4. The performance of ViSQOL v3 and InceptionSE model was evaluated against subjective quality scores and corresponding results are shown in Table 6.5 and 6.6. InceptionSE model results in a slightly better R p than ViSQOL on the two stereo listening tests.

Moreover, InceptionSE model displays a higher accuracy in estimating the quality of excerpts encoded with high bitrates. Albeit only as supplementary tests, results on the stereo tests display

Table 6.4: Performance of ViSQOL v3 and InceptionSE Model on items

Table 6.5: Performance of ViSQOL v3 and InceptionSE Model on stereo low bitrates test

Table 6.6: Performance of ViSQOL v3 and InceptionSE Model on stereo high bitrates test a strong correlation between the prediction of InceptionSE model and subjective quality score.

Performance of the Stereo Model on USAC Verification Listening Testsln the following tables 6.7 to 6.9, the performance of ViSQOL v3, InceptionSE Model mono and stereo on mono low bitrates test, stereo low bitrates test and stereo high bitrates test is shown. The codecs included in the MUSHRA tests were AMR-WB+, HE-AAC-vl, and USAC. In the mono listening test, the stereo signal fed to the stereo model for comparison was dual mono ( L = R). In the stereo listening tests, the stereo signal fed to ViSQOL-v3 and the mono model for comparison was the mid-signal: M=\/2(L+R).

Table 6.7: Performance of ViSQOL v3 and InceptionSE Model mono and stereo on mono low bitrates test

Table 6.8: Performance of ViSQOL v3 and InceptionSE Model mono and stereo on stereo low bitrates test

Table 6.9: Performance of ViSQOL v3 and InceptionSE Model mono and stereo on stereo high bitrates test

Further Applications in Audio Quality Assessment

Another embodiment may include migrating the current model from ViSQOL-labeled data to subjective listening test data. This may improve consistency with the perceived audio quality. Considering that organizing a giant dataset labeled with subjective quality scores is time-consuming and not realistic, transfer learning from the current model could be a possible solution confined to a limited labeled dataset. Transfer learning also enables to re-train the InceptionSE model for new scenarios such as newly updated codecs. Transfer learning from the current model to similar tasks would greatly shorten the training time as well as development circle.

Furthermore, samples from generative models are still one of the toughest challenges that all current objective audio/speech quality assessment methods are confronted with.

Generative models generate new data instances that resemble the training data. Its application in audio coding is so far relatively limited, such as coded audio quality enhancement and neural network-based audio decoding. The generated samples from GANs are so-called “adversarial examples,” which interfere with the existing neural network and result in an incorrect output of the network. For example, an audio excerpt added with a carefully fine-tuned noise would be perceptually identical to listeners.

However, this type of attack could fool state-of-the-art algorithms in audio processing and deep learning, which demonstrates a fundamental difference between human auditory system and machines.

Further Applications in Generative Adversarial Networks (GANs)

In some embodiments, the InceptionSE model may be utilized as a discriminator in a GAN, to fine-tune the generator. The InceptionSE would be configured to determine between the real signal and the fake signal generated by the generator during the training process of the GAN.

Furthermore, with transfer learning, the InceptionSE model may be configured to be fine- tuned for specific content-types and/or codec-types.

Interpretation

Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

In a similar manner, the term “processor” may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory to transform that electronic data into other electronic data that, e.g., may be stored in registers and/or memory. A “computer” or a “computing machine” or a “computing platform” may include one or more processors. The methodologies described herein are, in one example embodiment, performable by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including main RAM and/or a static RAM, and/or ROM. A bus subsystem may be included for communicating between the components. The processing system further may be a distributed processing system with processors coupled by a network. If the processing system requires a display, such a display may be included, e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT) display. If manual data entry is required, the processing system also includes an input device such as one or more of an alphanumeric input unit such as a keyboard, a pointing control device such as a mouse, and so forth. The processing system may also encompass a storage system such as a disk drive unit. The processing system in some configurations may include a sound output device, and a network interface device. The memory subsystem thus includes a computer-readable carrier medium that carries computer-readable code (e.g., software) including a set of instructions to cause performing, when executed by one or more processors, one or more of the methods described herein. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. The software may reside in the hard disk, or may also reside, completely or at least partially, within the RAM and/or within the processor during execution thereof by the computer system. Thus, the memory and the processor also constitute computer-readable carrier medium carrying computer-readable code. Furthermore, a computer-readable carrier medium may form, or be included in a computer program product.

In alternative example embodiments, the one or more processors operate as a standalone device or may be connected, e.g., networked to other processor(s), in a networked deployment, the one or more processors may operate in the capacity of a server or a user machine in server-user network environment, or as a peer machine in a peer-to-peer or distributed network environment. The one or more processors may form a personal computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.

Note that the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

Thus, one example embodiment of each of the methods described herein is in the form of a computer-readable carrier medium carrying a set of instructions, e.g., a computer program that is for execution on one or more processors, e.g., one or more processors that are part of web server arrangement. Thus, as will be appreciated by those skilled in the art, example embodiments of the present disclosure may be embodied as a method, an apparatus such as a special purpose apparatus, an apparatus such as a data processing system, or a computer- readable carrier medium, e.g., a computer program product. The computer-readable carrier medium carries computer readable code including a set of instructions that when executed on one or more processors cause the processor or processors to implement a method. Accordingly, aspects of the present disclosure may take the form of a method, an entirely hardware example embodiment, an entirely software example embodiment or an example embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of carrier medium (e.g., a computer program product on a computer- readable storage medium) carrying computer-readable program code embodied in the medium.

The software may further be transmitted or received over a network via a network interface device. While the carrier medium is in an example embodiment a single medium, the term “carrier medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “carrier medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by one or more of the processors and that cause the one or more processors to perform any one or more of the methodologies of the present disclosure. A carrier medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus subsystem. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. For example, the term “carrier medium” shall accordingly be taken to include, but not be limited to, solid-state memories, a computer product embodied in optical and magnetic media; a medium bearing a propagated signal detectable by at least one processor or one or more processors and representing a set of instructions that, when executed, implement a method; and a transmission medium in a network bearing a propagated signal detectable by at least one processor of the one or more processors and representing the set of instructions.

It will be understood that the steps of methods discussed are performed in one example embodiment by an appropriate processor (or processors) of a processing (e.g., computer) system executing instructions (computer-readable code) stored in storage. It will also be understood that the disclosure is not limited to any particular implementation or programming technique and that the disclosure may be implemented using any appropriate techniques for implementing the functionality described herein. The disclosure is not limited to any particular programming language or operating system.

Reference throughout this disclosure to “one embodiment”, “some embodiments” or “an example embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrases “in one embodiment”, “in some embodiments” or “in an example embodiment” in various places throughout this disclosure are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinal adjectives “first”, “second”, “third”, etc., to describe a common object, merely indicate that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.

It should be appreciated that in the above description of example embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single example embodiment, Fig., or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects he in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the Description are hereby expressly incorporated into this Description, with each claim standing on its own as a separate example embodiment of this disclosure.

Furthermore, while some example embodiments described herein include some but not other features included in other example embodiments, combinations of features of different example embodiments are meant to be within the scope of the disclosure, and form different example embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed example embodiments can be used in any combination.

In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Thus, while there has been described what are believed to be the best modes of the disclosure, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the disclosure, and it is intended to claim all such changes and modifications as fall within the scope of the disclosure. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present disclosure.

In the following, further details of the present disclosure will be described in a non-limiting manner by a set of enumerated example embodiments, EEEs:

EEE1. A deep-leaming-based system for determining an indication of an audio quality of an input audio frame, the system comprising: at least one inception block configured to receive at least one representation of an input audio frame and map the at least one representation of the input audio frame into a feature map, wherein the at least one inception block comprises: a plurality of stacked convolutional layers configured to operate in parallel paths, wherein at least one convolutional layer of the plurality of stacked convolutional layers comprises an m x n sized kernel, wherein the integer m is different than the integer n; and at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, wherein the at least one fully connected layer is configured to determine the indication of the audio quality of the input audio frame.

EEE2. The system ofEEEl, wherein the plurality of stacked convolutional layers comprises at least one convolution layer comprising a horizontal kernel and at least one convolutional layer comprising a vertical kernel.

EEE3. The system of EEE2, wherein the horizontal kernel is configured to leam temporal dependencies of the input audio frame.

EEE4. The system of EEE2, wherein the vertical kernel is configured to leam timbrel dependencies of the input audio frame.

EEE5. The system of EEE1, wherein the at least one inception block further comprises a squeeze-and-excitation, SE, layer.

EEE6. The system of EEE5, wherein the squeeze-and-excitation layer is applied after the last stacked convolutional layer of the plurality of stacked convolutional layers.

EEE7. The system of EEE1, wherein the at least one inception block further comprises a pooling layer.

EEE8. The system of EEE7, wherein the pooling layer comprises an average pooling.

EEE9. The system ofEEEl, wherein the at least one representation of an input audio frame comprises a representation of a clean reference input audio frame and a representation of a degraded input audio frame. EEE10. The system of EEE1, wherein the indication of the audio quality comprises at least one of a mean opinion score, MOS, or a multiple stimuli with hidden reference and anchor, MUSHRA.

EEE11. The system of EEE1, wherein the at least one fully connected layer comprises a feed forward neural network.

EEE12. A method of operating a deep-leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block and at least one fully connected layer, the method comprising: mapping, by the at least one inception block, at least one representation of the input audio frame into a feature map; and predicting, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the feature map.

EEE13. The method of EEE12, wherein the at least one representation of the input audio frame comprises a representation of a clean reference input audio frame and a representation of a degraded input audio frame.

EEE14. The method of EEE12, wherein the indication of the audio quality comprises at least one of a mean opinion score, MOS, or a multiple stimuli with hidden reference and anchor, MUSHRA.

Alternatively, or additionally, there is described as a further enumerated example embodiment a deep-leaming-based system for determining an indication of an audio quality of an input audio frame, the system comprising: at least one inception block configured to receive and process at least one representation of the input audio frame; and at least one fully connected layer configured to determine the indication of the audio quality of the input audio frame based on the output of the at least one inception block.

Alternatively, or additionally, there is described as a further enumerated example embodiment a method of operating a deep-leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one inception block and at least one fully connected layer, the method comprising: receiving and processing, by the at least one inception block, at least one representation of the input audio frame; and determining, by the at least one fully connected layer, the indication of the audio quality of the input audio frame based on the output of the at least one inception block.

Alternatively, or additionally, there is described as a further enumerated example embodiment a deep-leaming-based system for determining an indication of an audio quality of an input audio frame, the system comprising: at least one processing block configured to receive and process at least one representation of the input audio frame, and to determine and output the indication of the audio quality of the input audio frame.

Alternatively, or additionally, there is described as a further enumerated example embodiment a method of operating a deep-leaming-based system for determining an indication of an audio quality of an input audio frame, wherein the system comprises at least one processing block, the method comprising: receiving and processing at least one representation of the input audio frame; and determining and outputting the indication of the audio quality of the input audio frame.