Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HARMONICS BASED TARGET SPEECH EXTRACTION NETWORK
Document Type and Number:
WIPO Patent Application WO/2022/204612
Kind Code:
A1
Abstract:
An apparatus may include a processor and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a weighting vector based on a feature map of reference audio of a speaker, generate a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker, and extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal that includes the voice of the speaker and other sounds.

Inventors:
ZHANG YI (US)
LIN YUAN (US)
Application Number:
PCT/US2022/025476
Publication Date:
September 29, 2022
Filing Date:
April 20, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G10L21/0208; G10L21/0224; G10L21/0232; G10L21/0272; G10L25/84; G10L25/87; G10L25/90; G10L25/93
Foreign References:
US20210082438A12021-03-18
Other References:
TU YOUZHI: "Deep Speaker Embedding for Robust Speaker Verification", THESIS, HONG KONG POLYTECHNIC UNIVERSITY, 1 October 2021 (2021-10-01), Hong Kong Polytechnic University, XP055976842, Retrieved from the Internet [retrieved on 20221101]
RIKHYE RAJEEV; WANG QUAN; LIANG QIAO; HE YANZHANG; MCGRAW IAN: "Multi-User Voicefilter-Lite via Attentive Speaker Embedding", 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), IEEE, 13 December 2021 (2021-12-13), pages 275 - 282, XP034076961, DOI: 10.1109/ASRU51503.2021.9687870
Attorney, Agent or Firm:
BRATSCHUN, Thomas D. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A computer-implemented method comprising: generating a weighting vector based on a feature map of reference audio of a speaker; generating a speaker embedding based on the feature map and the weighting vector; wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker; and extracting, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal that includes the voice of the speaker and other sounds.

2. The method of claim 1, wherein the reference audio is a two-dimensional time- frequency domain spectrogram of the voice of the speaker.

3. The method of claim 1, wherein extracting audio of the speaker further comprises: filtering the speech components of the voice of the speaker from the speech mixture; and generating a masked spectrogram of the speech components filtered from the speech mixture.

4. The method of claim 3, further comprising: generating time domain enhanced speech based, at least in part, on a combination of the masked spectrogram and a spectrogram of the speech mixture.

5. The method of claim 1, wherein generating the weighting vector comprises: aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 ; and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin.

6. The method of claim 5, wherein aggregating information in each frequency bin of the feature map includes global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self-attention function to the weighting vector.

7. The method of claim 1, wherein generating speaker embedding further comprrses: generating a weighted feature map by pointwise multiplication of the weighting vector and feature map; and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

8. The method of claim 7, further comprising: optimizing one or more elements of the frequency transformation matrix by adjusting one or more elements of the frequency transformation matrix such that a mean- squared error of the audio of the speaker extracted from the speech mixture is reduced.

9. An apparatus, comprising: a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: generate a weighting vector based on a feature map of reference audio of a speaker; generate a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker; and extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal that includes the voice of the speaker and other sounds.

10. The apparatus of claim 9, wherein the reference audio is a two-dimensional time-frequency domain spectrogram of the voice of the speaker.

11. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: filter the speech components of the voice of the speaker from the speech mixture; generate a masked spectrogram of the speech components filtered from the speech mixture.

12. The apparatus of claim 11, wherein the set of instructions is further executable by the processor to: generate time domain enhanced speech based, at least in part, on a combination of the masked spectrogram and a spectrogram of the speech mixture.

13. The apparatus of claim 9, wherein generating the weighting vector comprises: aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 ; and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin.

14. The apparatus of claim 13, wherein aggregating information in each frequency bin of the feature map includes global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self- attention function to the weighting vector.

15. The apparatus of claim 9, wherein generating speaker embedding further comprises: generating a weighted feature map by pointwise multiplication of the weighting vector and feature map; and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

16. A non-transitory computer readable medium having encoded thereon a set of instructions executable by a processor to: generate a weighting vector based on a feature map of reference audio of a speaker; and generate a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker; and extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal that includes the voice of the speaker and other sounds.

17. The non-transitory computer readable medium of claim 16, wherein generating the weighting vector comprises: aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 ; and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin.

18. The system of claim 17, wherein aggregating information in each frequency bin of the feature map includes global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self-attention function to the weighting vector.

19. The non-transitory computer readable medium of claim 16, wherein generating speaker embedding further comprises: generating a weighted feature map by pointwise multiplication of the weighting vector and feature map; and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

20. The non-transitory computer readable medium of claim 19, wherein the set of instructions is further executable by the processor to: optimize one or more elements of the frequency transformation matrix by adjusting one or more elements of the frequency transformation matrix such that a mean-squared error of the audio of the speaker extracted from the speech mixture is reduced.

Description:
HARMONICS BASED TARGET SPEECH EXTRACTION NETWORK

COPYRIGHT STATEMENT

[0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0002] The present disclosure relates, in general, to methods, systems, and apparatuses for speech extraction.

BACKGROUND

[0003] Isolating and extracting speech from audio of a crowded environment or with multiple speakers presents various challenges. Existing techniques for target speech extraction utilize convolution structures to extract speaker embeddings, which inherently focus on local correlations. However, such models cannot and do not exploit the harmonics of a target speaker's voice. Thus, a harmonics based target speech extraction network is provided.

SUMMARY

[0004] Tools and techniques for are provided for a harmonics based target speech extraction network.

[0005] A method includes generating a weighting vector based on a feature map of reference audio of a speaker, generating a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker, and extracting, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding. [0006] An apparatus includes a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executable by the processor to generate a weighting vector based on a feature map of reference audio of a speaker, and generate a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker. The instructions may further be executable by the processor to extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding.

[0007] A system may include a speaker encoder and a speech extractor. The speaker encoder may be configured to generate a speaker embedding based on reference audio of a speaker. The speaker encoder may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executable by the processor to generate a weighting vector based on a feature map of the reference audio of the speaker, and generate the speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker. The speech extractor may be coupled to the speaker encoder and configured to extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding.

[0008] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided therein.

BRIEF DESCRIPTION OF THE DRAWINGS [0009] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components. [0010] Fig. 1 is a schematic block diagram of a system for a harmonics based speech extraction network, in accordance with various embodiments;

[0011] Fig. 2 is a functional block diagram of a harmonic block of a harmonics based speech extraction network, in accordance with various embodiments;

[0012] Fig. 3 is a flow diagram of a method for harmonics based speech extraction, in accordance with various embodiments;

[0013] Fig. 4 is a schematic block diagram of a computer system for a harmonics based speech extraction network, in accordance with various embodiments.

DETAILED DESCRIPTION OF EMBODIMENTS [0014] Various embodiments provide tools and techniques for a harmonics based speech extraction network.

[0015] In some embodiments, a method for harmonics based speech extraction is provided. A method includes generating a weighting vector based on a feature map of reference audio of a speaker, generating a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker, and extracting, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding. [0016] In some examples, the reference audio is a two-dimensional time-frequency domain spectrogram of the voice of the speaker. In some examples, extracting audio of the speaker may further include filtering the speech components of the voice of the speaker from the speech mixture, and generating a masked spectrogram of the speech components filtered from the speech mixture. In some examples, the method may further include generating time domain enhanced speech based, at least in part, on a combination of the masked spectrogram and a spectrogram of the speech mixture.

[0017] In some examples, generating the weighting vector may include aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 , and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin. In some examples, aggregating information in each frequency bin of the feature map may include global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self-attention function to the weighting vector.

[0018] In some examples, generating speaker embedding may further include generating a weighted feature map by pointwise multiplication of the weighting vector and feature map, and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

[0019] In some examples, the method may further include optimizing one or more elements of the frequency transformation matrix by adjusting one or more elements of the frequency transformation matrix such that a mean-squared error of the audio of the speaker extracted from the speech mixture is reduced.

[0020] In some embodiments, an apparatus for a harmonics based speech extraction network is provided. An apparatus includes a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executed by the processor to extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding.

[0021] In some examples, the reference audio may be a two-dimensional time- frequency domain spectrogram of the voice of the speaker. In some examples, the set of instructions may further be executable by the processor to filter the speech components of the voice of the speaker from the speech mixture, and generate a masked spectrogram of the speech components filtered from the speech mixture. In some examples, the set of instructions may further be executable by the processor to: generate time domain enhanced speech based, at least in part, on a combination of the masked spectrogram and a spectrogram of the speech mixture.

[0022] In some examples, generating the weighting vector may further include aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 , and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin. In further examples, aggregating information in each frequency bin of the feature map includes global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self-attention function to the weighting vector.

[0023] In some examples, generating speaker embedding further includes generating a weighted feature map by pointwise multiplication of the weighting vector and feature map, and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

[0024] In some further examples, the set of instructions may further be executable by the processor to optimize one or more elements of the frequency transformation matrix by adjusting one or more elements of the frequency transformation matrix such that a mean- squared error of the audio of the speaker extracted from the speech mixture is reduced.

[0025] In some embodiments, non-transitory computer readable medium having encoded thereon a set of instructions for harmonics based speech extraction network is provided. The non-transitory computer readable medium may have encoded thereon a set of instructions executable by a processor to generate a weighting vector based on a feature map of the reference audio of the speaker, and generate the speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker. The speech extractor may be coupled to the speaker encoder and configured to extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal in which the voice of the speaker is mixed with other sounds based on the speaker embedding.

[0026] In some examples, generating the weighting vector may include aggregating information in each frequency bin of the feature map to produce the weighting vector having a dimension equal to the number of frequency bins (F) x 1 , and determining weights for each element of the weighting vector, wherein each element corresponds to a respective frequency bin. In some examples, aggregating information in each frequency bin of the feature map may include global average pooling of the frequency bins of the feature map, and wherein determining weights further comprises applying a self-attention function to the weighting vector [0027] In some examples, wherein generating speaker embedding may further include generating a weighted feature map by pointwise multiplication of the weighting vector and feature map, and generating a transformed feature map based on the weighted feature map, wherein generating the transformed feature map includes applying a frequency transform matrix to each of a plurality of time slices of the weighted feature map, wherein the frequency transformation matrix is a trainable matrix that includes information regarding global frequency correlations of the feature map.

[0028] In some examples, the set of instructions may further be executable by the processor to optimize one or more elements of the frequency transformation matrix by adjusting one or more elements of the frequency transformation matrix such that a mean- squared error of the audio of the speaker extracted from the speech mixture is reduced.

[0029] In the following description, for the purposes of explanation, numerous details are set forth to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments may be practiced without some of these details. In other instances, structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0030] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0031] The various embodiments include, without limitation, methods, systems, apparatuses, and/or software products. Merely by way of example, a method might comprise one or more procedures, any or all of which may be executed by a computer system. Correspondingly, an embodiment might provide a computer system configured with instructions to perform one or more procedures in accordance with methods provided by various other embodiments. Similarly, a computer program might comprise a set of instructions that are executable by a computer system (and/or a processor therein) to perform such operations. In many cases, such software programs are encoded on physical, tangible, and/or non-transitory computer readable media (such as, to name but a few examples, optical media, magnetic media, and/or the like).

[0032] Various embodiments described herein, embodying software products and computer-performed methods, represent tangible, concrete improvements to existing technological areas, including, without limitation, extraction of target speech from audio. Specifically, implementations of various embodiments provide for a way to use a harmonics based speech extraction network to identify speech originating from a specific speaker utilizing a harmonics-based approach. Thus, the framework for a harmonics based speech extraction network, set forth below, allows for a more robust technique for target speech extraction.

[0033] Fig. 1 is a schematic block diagram of a system 100 for a harmonics based speech extraction network, in accordance with various embodiments. The system 100 includes a reference audio input 105, speaker encoder 110, two dimensional (2D) convolution layer 115, one or more harmonic blocks 120, linear layer and softmax function 125, probability vector 130, cross-entropy (CE) loss function 135, speech mixture input 140, speech extractor 145, one dimensional (ID) convolution layer 150, concatenation block 155, fully-connected layer 160, enhanced speech output 165, and mean-squared error MSE loss function 170. It should be noted that the various components of the system 100 are schematically illustrated in Fig. 1, and that modifications to the various components and other arrangements of system 100 may be possible and in accordance with the various embodiments.

[0034] In various embodiments, reference audio 105 may be provided to the speaker encoder 110. Speaker encoder 110 may include a 2D convolution layer 115 and one or more harmonic blocks 120. The output of the speaker encoder 110 may be coupled to a linear layer and softmax function 125, which may output probability vector 130 to a CE loss function 135. The output of the speaker encoder 100 may further be provided to speech extractor 145 as an input. A speech mixture 140 may be provided to the speech extractor 145. The speech extractor 145 may include a ID convolution layer 150, concatenation block 155, and a fully-connected (FC) layer 160. The output of the FC layer 160 may be combined with the speech mixture 140 to produce an enhanced speech output 165, which may be provided to an MSE loss function 170.

[0035] In various embodiments, the system 100 and/or the components of the system

100, including, without limitation, speaker encoder 110 and its sub-components (e.g., 2D convolution layer 115 and one or more harmonic blocks 120), linear layer and softmax function 125, CE loss function 135, speech extractor 145 and its sub-components (e.g., ID convolution layer 150, concatenation block 155, and FC layer 160), and MSE loss function 170 may be implemented in hardware, software, or a combination of both hardware and software. For example, in some embodiments, the system 100 may include one or more signal processors executing logic. Thus, the speaker encoder 110, speech extractor 145, CE loss function 135, and MSE loss function 170 may be implemented in logic executed by the one or more signal processor. In further embodiments, the system 100 may be implemented in dedicated hardware (e.g., digital logic circuit), or custom integrated circuit (ICs).

[0036] In various embodiments, the reference audio 105 may include audio (e.g., an audio clip or recording) of a speaker. In various examples, the reference audio 105 may include a spectrogram representation of an analog audio signal (e.g., time-frequency (T-F) domain spectrogram), over time. For example, in some embodiments, the reference audio 105 may include a T-F spectrogram representation of a decoded digital audio file. Specifically, the spectrogram may represent audio of the speaker's speech visually, as a 2D image, with frequency along one axis (e.g., the y-axis), and time along the other axis (e.g., the x-axis). In some examples, a spectrogram may be produced via Fourier transform (such as a fast Fourier transform (FFT) and/or short time Fourier transform (STFT)) of an analog audio signal, which may be sampled over time. The Fourier transform of each sample of the audio signal may capture the frequency components of the respective sample. Accordingly, the frequency components of each sample may be plotted sequentially in time, with frequency oriented along the y-axis, and time (e.g., sample) along the x-axis.

[0037] In various examples, the reference audio 105 may include audio (or a spectrogram of the audio) of the speaker speaking utterances, such as, without limitation, words and phrases, or other sounds of their voice. Reference audio 105 may, in some examples, include a script or passage that is ready by the speaker. In further examples, the reference audio 105 may further capture different manners of speech, such as, without limitation, yelling, whispering, singing, shifts in tone, or other various manners of speech. [0038] In various embodiments, the speaker encoder 110 may be configured to generate a speaker embedding based on the reference audio 105. In some examples, the speaker encoder 110 may convert the reference audio 105 into a speaker embedding by passing the reference audio 105 through the 2D convolution layer and one or more cascaded harmonic blocks 120. A speaker embedding of a speaker may include voiceprint information of the speaker, and may be configured to direct the attention of the speaker extraction network (e.g., speech extractor 145) to the voice of the speaker.

[0039] In various embodiments, the 2D convolution layer 115 may be configured to generate a feature map based on a target feature. For example, the 2D convolution layer 115 may be configured to perform a 2D convolution on the reference audio 105 with the target feature. The target feature may include one or more local features of a T-F spectrogram. Specifically, the 2D convolution layer 115 may perform a 2D convolution with the target feature, where the target feature(s) may be a spectrogram (e.g., image) representing localized audio feature, such as a sound, word or part of a word, or a phrase. Thus, the 2D convolution layer 115 may identify local features of the reference audio 105 (e.g., spectrogram) visually corresponding to a target feature. In this way, in some examples, it may be identified when the speaker makes a target sound or speaks a target word in the reference audio 105.

[0040] The output of the 2D convolution layer 115 may be provided to the one or more harmonic blocks 120. In some examples, the output of the 2D convolution layer 115 may be a feature map of the reference audio 105 spectrogram, based on the target feature. In various examples, the one or more harmonic blocks 120 may be cascaded, and correspond to harmonics of the speakers voice. Typical approaches to speaker recognition utilize a convolutional neural network (CNN) or recurrent neural network (RNN) structure to extract speaker embeddings. Conventional CNN or RNN kernels, however, do not capture the harmonics in the time-frequency (T-F) spectrogram. The reason is that feature correlations in images are mostly local, while in speech spectrograms along the frequency axis, features are mostly non-local.

[0041] Thus, in various embodiments, the one or more harmonic blocks 120 may be used to generate a voice print of the speaker. Specifically, each of an N-number of harmonic blocks may correspond to an N-number of harmonics corresponding to the overtones (e.g., higher frequency harmonics) of a base frequency (e.g., a fundamental tone) of the speaker's voice. At any given point of time, the value at a base frequency is strongly correlated with the values at its overtones. Therefore, in various examples, the one or more harmonic blocks 120 may capture global correlations between harmonics of the in the reference audio 105, thereby producing a voice print. The details of the one or more harmonic blocks 120 are described in greater detail below, with respect to Fig. 2.

[0042] Accordingly, in various embodiments, the speaker encoder 110, and the one or more harmonic blocks 120, may output an N-dimensional speaker embedding corresponding to the number of harmonic blocks of the one or more harmonic blocks 120. For example, in some embodiments, the speaker embedding may be a 256-dimensional embedding. In some examples, the speaker embedding may be a voice print of the speaker specific to the target feature used in the 2D convolution layer 115. In further examples, the speaker embedding may be voice print generally related to the speaker, and generated based on a plurality of target features. In some embodiments, the softmax loss of the output of the one or more harmonic blocks 120 (e.g., the speaker embedding) may further be determined. For example, in some embodiments, the output of the one or more harmonic blocks 120 may be provided to a linear and softmax function 125. The linear and softmax function 125 may comprise a softmax output layer configured to output a probability vector 130. The probability vector 130 may, in some examples, correspond to a probability for a given class (e.g., harmonic or set of one or more harmonics) corresponding to each respective harmonic block of the one or more harmonic blocks. A CE loss of the probability vector 130 may then, in some embodiments, be determined via the CE loss function 135 and used to optimize the one or more harmonic blocks 120, such as a squeeze and excitation network of the one or more harmonic blocks 120, as will be described with respect to Fig. 2.

[0043] In various embodiments, the speech extractor 145 may be configured to receive, as an input, a speech mixture 140. As with the reference audio 105, in various examples, the speech mixture 140 may be a spectrogram of mixed audio that includes the speaker's voice. In some examples, the speech mixture 140 may include audio in which multiple people, including the speaker, may be speaking concurrently. In other examples, the speaker's voice may be mixed with music or an environmental noise.

[0044] The speech extractor 145 may further receive, as an input, the speaker embedding from the one or more harmonic blocks 120 of the speaker encoder 110. The speaker embedding may be used, by the speech extractor 145, to predict a soft mask to filter out the target speech component. In some examples, to generate the soft mask, the speech mixture 140 may be passed through a ID convolution layer 150, which may produce an output vector at each sample (e.g., at each time). The output of the ID convolution layer 150 at each sample may then be passed to the concatenation block 155, at which the output of the ID convolution layer 150 is concatenated with the voice embedding. In some examples, the concatenation block 155 may ensure the voice embedding has the same dimensions (e.g., in the T-F domain) as the speech mixture 140.

[0045] The output of the concatenation block may then be passed to the fully- connected layer 160, to produce a masked spectrogram. For example, each sample output by the ID convolution layer 150 may be masked by the speaker embedding, via the fully- connected layer 160, to filter out target speech components (e.g., harmonic information of the speaker's voice) and a masked spectrogram produced. In some examples, the masked spectrogram (also referred to as a masked magnitude spectrogram) may be combined with a noisy phase spectrogram (e.g., the speech mixture 140), to generate time-domain enhanced speech 165 via inverse short time Fourier transform (ISTFT). In some examples, an MSE loss of the spectrogram of the enhanced speech 165 may further be calculated, via the MSE loss function 170, which may be used to optimize the speech extractor 145, and/or one or more harmonic blocks 120 of the speaker encoder 110.

[0046] Fig. 2 is a schematic diagram of a harmonic block 200 of a harmonics based speech extraction network, in accordance with various embodiments. The harmonic block 200 includes input signal 205, squeeze and excitation network (SENet) 210, global pooling block 215, first fully convolutional (FC) layer 220, rectified linear unit (ReLU) 225, second FC layer 230, sigmoid function 235, frequency transformation matrix (FTM) 240, and concatenation block 245. It should be noted that the various components of harmonic block 200 are schematically illustrated in Fig. 2, and that modifications to the various components and other arrangements of harmonic block 200 may be possible and in accordance with the various embodiments.

[0047] In various embodiments, the input signal 205 may include the output of a 2D convolution layer, such as 2D convolution layer 115 of Fig. 1. For example, the input signal 205 may include, without limitation, a feature map of a target feature (e.g., a sound, word, part of a word, or a phrase) in a reference audio spectrogram. In various examples, the input signal 205 may be a 2D matrix having F x T dimensions, where F is the number of sampled frequencies (e.g., frequency bins) and T is time.

[0048] The SENet 210 may be configured to receive the input signal 205, and model interdependencies between channels, in this example harmonics. Specifically, SENet 210 may improve channel interdependencies so that the, in this case the harmonic block, can increase its sensitivity to informative features with access to global information.

[0049] Accordingly, in various embodiments, the SENet 210, and more specifically the global pooling block 215, may squeeze the temporal information into a single value (or weight) for each frequency, and then rescale the input feature with the per-frequency weights. As such, informative frequencies like overtones may be given larger weights and correspondingly enhance the harmonic structure of the target feature. In some examples, a first FC layer 220, followed by a ReLU layer 225, and a second FC layer 230, followed by sigmoid activation function 235, may be used to learn the weights and better fit complex nonlinearity between channels.

[0050] The SENet 210 may, in various embodiments, generate a ID vector (e.g., having dimension F x 1) from the 2D feature map of the input signal, in which each element of the ID vector corresponds to a respective frequency, and the value of each element corresponds to a weight at the respective frequency. In some examples, the output of the SENet 210 may be referred to as a weighting vector.

[0051] In some embodiments, a pointwise multiplication may be performed of the 2D feature map (e.g., input signal 205) and the weighting vector output by the SENet 210. In some examples, the weighted feature map may then undergo a frequency transform via FTM 240. Accordingly, in various embodiments, the harmonic block 200 may include a trainable FTM 240, which may be applied to the weighted input signal 205. Specifically, FTM 240 may be applied to each slice of the feature map (e.g., input signal 205) corresponding to a respective point in time. A transformed feature map may be produced by the FTM 240, where each time- frequency (T-F) bin of the transformed feature map contains information from all the frequency bands, and therefore captures global information.

[0052] In various examples, the transformed feature map may be concatenated with the feature map of the input signal 205, and fused together and normalized. In some examples, fusing of the concatenated feature maps may include downsampling via lxl convolution. In some examples, a skip-connection scheme may be used to concatenate input signal 205 (e.g., feature map) with the output of the FTM 240 (e.g., transformed feature map), which may in turn speed learning and reduce the impact of vanishing gradients. In some further examples, a CE loss of the probability vector may be used to train the SENet 210, and an MSE loss (for example relative to a ground truth spectrogram) of an enhanced target speech spectrogram may be used to train the FTM 240.

[0053] Fig. 3 is a flow diagram of a method 300 for harmonics based speech extraction, in accordance with various embodiments. The method 300 begins, at block 305, by obtaining reference audio. As previously described, in various embodiments, the reference audio may be a T-F spectrogram of the audio of a speaker. The audio of the speaker may itself be an analog audio signal, or decoded digital audio. The reference audio may include audio (or a spectrogram of the audio) of the speaker speaking utterances, such as, without limitation, words and phrases, or other sounds of their voice. Reference audio 105 may, in some examples, include a script or passage that is ready by the speaker. In further examples, the reference audio 105 may further capture different manners of speech, such as, without limitation, yelling, whispering, singing, shifts in tone, or other various manners of speech. [0054] The method 300 may continue, at block 310, by generating a feature map of the reference audio using a target feature. As previously described, in some examples, the target feature may be a spectrogram (e.g., image) representing localized audio feature, such as a sound, word or part of a word, or a phrase. Thus, the feature map may identify local features of the reference audio spectrogram visually corresponding to a target feature. In this way, in some examples, it may be identified when the speaker makes a target sound or speaks a target word in the reference audio. In some examples, one or more feature maps may be generated of one or more target features.

[0055] At block 315, the method 300 continues by generating a weighting vector based on the feature map. As previously described, in some examples, the weighting vector may be generated by transforming the 2D feature map into a ID vector. In some examples, as previously described, a SENet may be used to "squeeze" the 2D feature map (or multiple 2D feature maps) into a ID weighting vector. In some examples, global average pooling may be used to generate the weighting vector. In other embodiments, other algorithms may be used to "squeeze" global harmonic information into the weighting vector. Accordingly, the weighting vector may further include harmonic information relative to the target feature (or multiple target features).

[0056] At block 320, the method 300 continues by generating a weighted feature map.

For example, in some embodiments, a pointwise multiplication of the feature map and the weighting vector may be performed. In various examples, this may be considered an "excitation" of the feature map using the squeezed global information. Thus, the weighted feature map may be considered a form of self-attention being applied to the feature map.

[0057] The method 300 continues, at block 325, by applying an FTM to the weighted feature map. As previously described, the FTM may be applied to the weighted feature map slice at each point in time. Stacking the transformed feature map slices in time (e.g., in time sequentially), a transformed feature map may be produced. In various embodiments, the FTM may be a trainable matrix that includes information regarding global frequency correlations of the feature map.

[0058] The method 300 includes, at block 330, generating a speaker embedding based on the transformed feature map. The speaker embedding may, in some examples, be generated by concatenating the transformed feature map with the feature map of the reference audio (e.g., the original feature map of reference audio spectrogram), and fusing them with a 1 x 1 convolution. Thus, the fused feature map may be used downstream, at a speech extractor, as a soft mask for filtering the desired speech components (e.g., a voice of the speaker). In some examples, a skip-connection scheme may be used to concatenate the feature map with the transformed feature map.

[0059] In further embodiments, the method 300 may further include, at block 335, optimizing the network based on optimization of a loss function. In some examples, a CE loss of the probability vector may be used to train the weights of the SENet. In further examples, an MSE loss (for example relative to a ground truth spectrogram) of an enhanced target speech spectrogram may be used to train the FTM. In some examples, one or more elements of the FTM may be adjusted to such that MSE is reduced.

[0060] The method 300 further includes, at block 340, extracting speech based on the speaker embedding. In various embodiments, the speaker embedding may be used to generate a masked spectrogram from a speech mixture. For example, the speech mixture may include audio including the speaker may be mixed or present with audio of other sounds, such as other speakers, ambient noise, music, static, or other sounds. The speech mixture may be masked with the speaker embedding to filter speech components of the speaker's voice from the speech mixture, thereby producing a masked spectrogram. In some examples, the masked spectrogram (also referred to as a masked magnitude spectrogram) may be combined with a noisy phase spectrogram of the speech mixture to generate time-domain enhanced speech.

[0061] The techniques and processes described above with respect to various embodiments may be performed by one or more computer systems. Fig. 4 is a schematic block diagram of a computer system 400 for provisioning decoded atlas information associated with volumetric content, in accordance with various embodiments. Fig. 4 provides a schematic illustration of one embodiment of a computer system 400, such as the system 100, harmonic block 200, or subsystems thereof, which may perform the methods provided by various other embodiments, as described herein. It should be noted that Fig. 4 only provides a generalized illustration of various components, of which one or more of each may be used as appropriate. Fig. 4, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0062] The computer system 400 includes multiple hardware elements that may be electrically coupled via a bus 405 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 410, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 415, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 420, which can include, without limitation, a display device, and/or the like.

[0063] The computer system 400 may further include (and/or be in communication with) one or more storage devices 425, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0064] The computer system 400 might also include a communications subsystem 430, which may include, without limitation, a modem, a network card (wireless or wired), an IR communication device, a wireless communication device and/or chipset (such as a Bluetoothâ„¢ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, a Z-Wave device, a ZigBee device, cellular communication facilities, etc.), and/or a low-power wireless device. The communications subsystem 430 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein. In many embodiments, the computer system 400 further comprises a working memory 435, which can include a RAM or ROM device, as described above.

[0065] The computer system 400 also may comprise software elements, shown as being currently located within the working memory 435, including an operating system 440, device drivers, executable libraries, and/or other code, such as one or more application programs 445, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0066] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 425 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 400. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 400 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 400 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0067] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware (such as programmable logic controllers, single board computers, FPGAs, ASICs, and SoCs) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0068] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer system 400) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 400 in response to processor 410 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 440 and/or other code, such as an application program 445) contained in the working memory 435. Such instructions may be read into the working memory 435 from another computer readable medium, such as one or more of the storage device(s) 425. Merely by way of example, execution of the sequences of instructions contained in the working memory 435 might cause the processor(s) 410 to perform one or more procedures of the methods described herein.

[0069] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the computer system 400, various computer readable media might be involved in providing instructions/code to processor(s) 410 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 425. Volatile media includes, without limitation, dynamic memory, such as the working memory 435. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 405, as well as the various components of the communication subsystem 430 (and/or the media by which the communications subsystem 430 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radio wave and infra-red data communications).

[0070] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0071] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 410 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 400. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention. [0072] The communications subsystem 430 (and/or components thereof) generally receives the signals, and the bus 405 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 435, from which the processor(s) 410 retrieves and executes the instructions. The instructions received by the working memory 435 may optionally be stored on a storage device 425 either before or after execution by the processor(s) 410.

[0073] While some features and aspects have been described with respect to the embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while some functionality is ascribed to one or more system components, unless the context dictates otherwise, this functionality can be distributed among various other system components in accordance with the several embodiments.

[0074] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with or without some features for ease of description and to illustrate aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.