METHOD AND APPARATUS FOR ACOUSTIC SOURCE SEPARATION

Title:

METHOD AND APPARATUS FOR ACOUSTIC SOURCE SEPARATION

Document Type and Number:

WIPO Patent Application WO/2013/030134

Kind Code:

Abstract:

A method of separating a plurality of source audio signals from a mixed signal. The method comprises: selecting respective training data for each source audio signal; for respective segments of data representing the mixed signal, determining which combination of a respective segment from each of the selected training data matches the respective segment of the data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing the mixed signal. The method facilitates separating component speech signals from a single channel mixture of multiple speech signals.

More Like This:

WO/2014/143060	MECHANISM FOR FACILITATING DYNAMIC ADJUSTMENT OF AUDIO INPUT/OUTPUT (I/O) SETTING DEVICES AT CONFERENCING COMPUTING DEVICES
JP2000010592	METHOD AND DEVICE FOR ERASING RECEIVED VOICE
WO/2012/108971	DEVICES FOR ENCODING AND DECODING A WATERMARKED SIGNAL

Inventors:

MING JI (GB)
SRINIVASAN RAMJI (GB)
CROOKES DANIEL (GB)

Application Number:

PCT/EP2012/066549

Publication Date:

March 07, 2013

Filing Date:

August 24, 2012

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV BELFAST (GB)
MING JI (GB)
SRINIVASAN RAMJI (GB)
CROOKES DANIEL (GB)

International Classes:

G10L21/02

Foreign References:

DE102007030209A1	2009-01-08
US5983178A	1999-11-09
US5598507A	1997-01-28

Other References:

JI MING ET AL: "A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 19, no. 4, 1 May 2011 (2011-05-01), pages 822 - 836, XP011352002, ISSN: 1558-7916, DOI: 10.1109/TASL.2010.2064312
G-J JANG ET AL: "Single-channel signal separation using time-domain basis functions", IEEE SIGNAL PROCESSING LETTERS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 10, no. 6, 1 June 2003 (2003-06-01), pages 168 - 171, XP011433600, ISSN: 1070-9908, DOI: 10.1109/LSP.2003.811630

Attorney, Agent or Firm:

WALLACE, Alan (4 Mount CharlesBelfast, Antrim BT7 1NZ, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. A method of separating a plurality of source audio signals from a mixed signal comprising said source audio signals in an audio signal processing apparatus, said method comprising: selecting respective training data for each source audio signal; for respective segments of data representing said mixed signal, determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.

2. A method as claimed in claim 1, further including organizing said training data into a plurality of classes; and associating each source audio signal with one of said classes, wherein said selecting training data involves selecting training data from the class associated with the respective audio source signal.

3. A method as claimed in claim 2, wherein each class comprises at least one training data component, each training data component comprising, or being derived from, a respective audio training signal.

4. A method as claimed in any preceding claim, wherein said training data comprises, or is derived from, audio signals generated by one or more sources other than the sources of said audio source signals. 5. A method as claimed in any preceding claim, further including transforming said mixed signal to produce a time sequence of spectral data representing said mixed signal in the frequency domain, and wherein said segments of said mixed signal comprise segments of said transformed mixed signal.

6. A method as claimed in any preceding claim, further including modeling a plurality of audio training signals, or derivatives thereof, to produce said training data.

7. A method as claimed in claim 6, wherein said training data comprises a plurality of training data components each training data component comprising, or being derived from, a respective audio training signal, and wherein said modeling involves producing a respective model component for use as each training data component.

8. A method as claimed in 6 or 7, further including statistically modeling the respective audio training signals, or derivatives thereof, to produce at least one statistical model.

9. A method as claimed in claim 8 when dependent on claim 2, further including statistically modeling the respective audio training signals, or derivatives thereof, of each class to produce a respective statistical model for each class.

10. A method as claimed in claim 8 or 9, wherein said statistical model comprises a probabilistic model, preferably a mixture model, and most preferably a Gaussian mixture model.

1 1. A method as claimed in any one of claims 8 to 10 when dependent on claim 7, further including producing respective model components for use as said training data components by fitting said at least one statistical model to the respective audio training signals, or derivatives thereof.

12. A method as claimed in claim 11 when dependent on claim 9, further including, for each class, producing respective model components for use as the training data components of the respective class by fitting the respective statistical model for the class to the respective audio training signals, or derivatives thereof.

13. A method as claimed in claim 11 or 12, wherein said fitting involves finding the component of the respective statistical model that is statistically most likely to match the respective audio training signal, or derivative thereof. 14. A method as claimed in any one of claims 6 to 13, further including transforming said audio training signals to produce a respective time sequence of spectral data representing the respective audio training signal in the frequency domain, and modeling said spectral data to produce said training data. 15. A method as claimed in claim 3, wherein selecting respective training data for each source audio signal involves selecting at least some and preferably all of the training data components of the respective class.

16. A method as claimed in any preceding claim, wherein said training data comprises a plurality of training data components each training data component being derived from a respective audio training signal, and wherein each of said segments of training data comprises a segment of any one of said training data components. 17. A method as claimed in any preceding claim, wherein determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal involves for each segment of said data representing said mixed signal:

creating a plurality of composite training segments, each composite training segment comprising a different combination of training data segments, one from each of the respective selected training data; calculating, for each composite training segment, a measure of the similarity of said composite training segment to said respective mixed signal data segment;

selecting the composite training segment with the highest similarity measure as a match for said respective mixed signal data segment.

18. A method as claimed in claim 17, further including determining, for each selected composite training segment, the maximum length of said respective mixed signal data segment by increasing the length of said respective mixed signal data segment while matching its constituent training segments, up to the respective maximum length.

19. A method as claimed in claim 17 or 18, wherein calculating said measure of similarity involves calculating a probability that the respective test segment is matched by the respective composite training segment.

20. A method as claimed in any one of claims 17 to 19, wherein calculating said measure of similarity involves calculating a posterior probability of said composite training segment being a match given said respective mixed signal data segment.

21. A method as claimed in claim 20 when dependent on claim 18, wherein determining the maximum length of one or more of said constituent training data segments involves determining the maximum length of said mixed signal data segment that maximizes said posterior probability.

22. A method as claimed in any preceding claim, wherein said reconstructing involves, for each audio source signal, temporally aligning the segments of the respective selected training data that form a respective matching combination for each mixed signal data segment; combining the temporally aligned segments to produce a time sequence of data components representing said audio source signal.

23. A method as claimed in claim 22, wherein said temporally aligned segments are combined by applying an averaging or smoothing function.

24. A method as claimed in claim 22 or 23, wherein the beginning of each segment is aligned with a respective frame of said mixed signal data. 25. A method as claimed in claim 24, wherein said combination aligned segments involves combining respective portions of one or more of said segments that are aligned with respective frames of said mixed signal data.

26. A method as claimed in any preceding claim, wherein said training data comprises spectral data.

27. A method as claimed in claim 26, when dependent on any one of claims 22 to

25. wherein said time sequence of data components comprises a time sequence of spectral features, and wherein said reconstruction of said respective audio source signal involves performing an inverse frequency transform on said time sequence of spectral features.

28. A method as claimed in any preceding claim, further including adjusting the gain of said segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal based on a set of one or more gain update values.

29. A method as claimed in claim 28, wherein said gain update values are derived from said mixed signal data.

30. A method as claimed in any preceding claim, wherein mixed signal comprises a single channel mixture of simultaneous source audio signals from respective audio sources. 31. A method as claimed in claim 30, wherein said single channel mixture is created by an acousto-electric transducer from a plurality of simultaneous acoustic signals.

32. A method as claimed in any preceding claim, performed in a digital signal processing system, especially an audio signal separation system.

33. An apparatus for separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said apparatus comprising: training data selecting means configured to select respective training data for each source audio signal; a segment analyzer configured to, for respective segments of data representing said mixed signal, determine which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing means configured to reconstruct each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.

34. An audio signal separation system comprising an apparatus as claimed in claim 33 and an acousto-electric transducer for creating a said single channel mixture of audio signals from a plurality of simultaneous acoustic signals.

35. An audio signal separation system as claimed in claim 34, further including respective electro-acoustic transducer for rendering each separated audio signal.

36. A computer program product comprising computer usable code for causing computer to perform the method of any one of claims 1 to 32.

Description:

Method and Apparatus for Acoustic Source Separation

Field of the Invention

The present invention relates to the separation of component acoustic sources from a single channel mixture of acoustic sources. The invention relates particularly to the separation of simultaneous multi-talker speech from a single channel recording.

Background to the Invention

In real- world scenarios, speech rarely occurs in isolation and is usually accompanied by a background of many other acoustic sources. One very common scenario is multi-talker speech in which two or more speakers speak

simultaneously. A single microphone recording of this multi-talker speech will result in a mixed speech signal, and is generally termed single-channel mixed speech. Separating the component speech signals from a single channel mixture is a challenge.

Current state-of-the-art approaches addressing the problem of single-channel speech separation can be broadly grouped into three major categories: basis- function based decomposition, computational auditory scene analysis (CASA), and model-based approaches. In basis-function based decomposition, a set of bases is used to represent the short-time spectra of each component speech.

Separation is usually performed on a frame-by- frame basis, by finding a linear combination of the component basis sets that matches a given mixed speech frame. However, a frame of speech is generally very short ranging from 10 ms to 30 ms. The changing in sound properties over frames (i.e., time) are referred to as "temporal dynamics". Analysis over a small number of frames gives short-term temporal dynamics, while analysis over a larger number of frames gives long-term temporal dynamics. Long term temporal dynamics of a speech utterance is one of the most important characteristics of the utterance that distinguishes the utterance from noise, including other speaker's utterances. Separation of speech

on a frame-by- frame basis without considering the temporal dynamics is not effective. CASA is another category of algorithms, which segments

psychoacoustic cues (e.g., pitch, harmonic structures, modulation correlation, etc.) of the mixed speech into different component sources, and performs separation by masking the interfering sources. The method can only capture short-term speech temporal dynamics and hence suffers from ambiguity in correctly classifying the psychoacoustic cues.

In model-based approaches, each component speech signal is first represented by a statistical model using training samples from the component speaker. The mixed speech signal can then be represented by a model by combining the statistical models of the component speech signals. Model-based

algorithms can be divided into two groups: those assuming independence between speech, and those attempting to model variable levels of temporal dynamics. Techniques capable of capturing long-term temporal dynamics of speech demonstrate good performance for separation. However, in current techniques for speech separation, modeling long-term temporal dynamics (for example, on the subword, word or sentence level) requires knowledge of the task (vocabulary and grammar) and transcription of the training data. This limits both the applicability and the accuracy of the current techniques for everyday speech processing.

Separating speech without this knowledge remains an open research question.

It would be desirable therefore to provide improved techniques for acoustic source separation and in particular separation of separate speech sources. Summary of the Invention

A first aspect of the invention provides a method of separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said method comprising: selecting respective training data for each source audio signal; for respective segments of data representing said mixed signal, determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.

A second aspect of the invention provides an apparatus for separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said apparatus comprising: training data selecting means configured to select respective training data for each source audio signal; a segment analyzer configured to, for respective segments of data representing said mixed signal, determine which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing means configured to reconstruct each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal. A third aspect of the invention provides an audio signal separation system comprising the apparatus of the second aspect of the invention and an acousto- electric transducer for creating a said single channel mixture of audio signals from a plurality of simultaneous acoustic signals. A fourth aspect of the invention provides a computer program product comprising computer usable code for causing a computer to perform the method of the first aspect of the invention. Preferred features are recited in the dependent claims.

Preferred embodiments of the invention provide a method for separating component acoustic sources from a single channel mixture of said component acoustic sources, with particular application to single-channel speech separation.

Preferred embodiments are referred to herein as CLOSE (Composition of LOngest SEgments). In the context of speech separation, given a single-channel mixed speech signal and training data for one or more relevant component speaker classes, the preferred CLOSE method finds the longest segment compositions between the mixed speech signal and the training data for performing separation. This maximizes the extraction of the temporal dynamics for separation, without requiring knowledge of the task vocabulary, grammar and transcribed training data. Two different possible ways to realize the CLOSE method: CLOSE- 1 and CLOSE-2, are disclosed hereinafter by way of illustration. In CLOSE- 1, all the acoustic sources are separated at the same time while in CLOSE-2 one source is separated at a time. In terms of computation speed, CLOSE-2 is faster than CLOSE- 1.

Brief Description of the Drawings

Embodiments of the invention are now described by way of example and with reference to the accompanying drawings in which:

Figure 1 is a block diagram illustrating a system for separating multiple acoustic Sources, in particular speech; Figure 2 is a block diagram illustrating a preferred acoustic source separation process embodying the invention;

Figure 3 is a schematic illustration of said preferred acoustic source separation process in the context of two-talker speech separation;

Figure 4 presents an algorithm suitable for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE- i);

Figure 5 presents an alternative algorithm for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE-2); and Figure 6 is a schematic illustration of a preferred reconstruction process suitable for use with embodiments of the invention.

Detailed Description of the Drawings Referring now to Figure 1 of the drawings, there is shown generally indicated as 10, a digital signal processing system in the form of an acoustic source separation system, and in particular a speech signal separation system. The system 10 receives acoustic signals 12 from multiple (K) acoustic sources Si,... ,S _K, which in this example are assumed to be human speakers although they could be any source of speech signals. The speech signals 12 are received by a microphone 14 (or other acousto-electric transducer or audio signal recording device) to produce a single channel mixture (SCM) of the speech signals (or more particularly of an electronic representation of the acoustic speech signals as produced by the microphone). The single channel mixture is fed into a separator A10, which separates the component acoustic sources to produce acoustic output signals

Oi, ... ,Οκ· The separation is performed on the electronic representation of the mixed signals and reproduced as acoustic signals by speakers 16, or other suitable transducers. Typically the SCM comprises digital audio data organized in frames. The output signals of the separator A10 typically also comprise digital audio data organized in frames.

The separation process performed by separator A10 is a multistage process which is illustrated in Figure 2. First, the SCM is subjected to a preprocessing stage Al 1. In this stage, each frame of the input data undergoes extraction of its spectral features (which may be referred to as spectral or frequency transformation). This results in a time sequence of spectral features (TSSF). Spectral features typically comprise data representing the frequency component(s) of the signal, e.g. data identifying the frequency component(s) and the respective magnitude and/or phase of those component(s). For the SCM the TSSF is denoted by TSSFSC _M- By way of example, in a specific implementation of the preprocessing stage Al 1 for speech separation, the SCM may be first divided into short-time frames of, for example 20 ms, with a frame period of, for example, 10 ms (or otherwise sampled). Each frame may then be represented in the form of log spectral magnitudes. The invention is not limited to any specific spectral feature set or time domain to frequency domain transformations. Some suitable known methods for representing the spectral features include: squared amplitude; cepstrum; Mel- frequency cepstral coefficients (MFCC); Log magnitude spectrum; and Linear predictive coding (LPC)

Further, the phase information of the SCM is also extracted and stored for the later use in a source reconstruction stage A14. This is conveniently achieved as part of the spectral transformation of the SCM. Typically, a Fourier transform is used to produce the respective amplitude and phase spectrum. The phase spectrum obtained for the test speech (SCM) is used during the reconstruction process to produce the separated component speech signals from the mixed speech signals. The separator A10 is operable with training data, which may be received from an external source or may be stored by the system 10 in any suitable storage device 20 accessible by the separator A10. The training data comprises audio data signals, or derivatives thereof, against which the SCM can be compared.

Typically, the training data comprises, or is derived from, digital audio data signals, preferably organized in frames.

The training data is organized into a plurality of acoustic source classes T _ls...,T _N. Each acoustic source class comprises one or more respective training audio data signals, or data derived therefrom, from one or more acoustic sources.

Advantageously, signals from acoustic sources regarded as being sufficiently similar to one another are included in the same acoustic source class (for example, a single speaker; a collection of speakers of the same gender; a collection of speakers for a particular ethnic group; a particular musical instrument; a particular music genre etc.). In typical embodiments, the acoustic source(s) for the training data are not the same as the acoustic sources of the signals that make up the SCM, i.e. the separator A10 does not require training data from the actual acoustic sources whose mixed signals it is separating. The training data for each acoustic source class are subjected to a modelling process A12, to yield, for each acoustic class, a respective model for the time sequences of spectral features for the respective training data: TSSF-Mi,...,TSSF- M . Typically, the training data comprises multiple audio training signals (also referred to as training utterances or training data components) for each class. In the preferred embodiment, the respective audio training signals (or more particularly a frequency representation thereof) for each class are modeled collectively (per class) to provide a single model for each class. The model for each class comprises a respective model component for each of the respective audio training signals. Typically, the modelling process A12 comprises a combination of statistical modelling and data driven modelling. The following description of the preferred modeling process A12 is set in the context of speech separation. However, those skilled in the art will recognize that the present invention can be employed for separating any kind of simultaneous acoustic sources from a single channel mixture containing multiple simultaneous acoustic sources.

In typical embodiments, the modeler A12 applies a mixture model, preferably a Gaussian Mixture Model (GMM), to the training data of each class. Alternatively, other models, especially probabilistic models may be used. The preferred modeling process involves subjecting the training utterances for each class to spectral (frequency) transformation (typically producing a respective spectral feature vector for each frame), and applying mixture modeling to all of the resulting spectral data to produce a respective model (in this case a GMM) for each acoustic training class. This typically involves fitting a set of multi- dimensional Gaussian, or other, functions to the spectral data for each class to produce the mixture model. This is the aforementioned statistical modeling. Then data driven modeling is used to produce a respective model component (utterance model) for each of the training utterances of the class. Let χ _λ = {x _{X t} : t = 1,2,..., T } represent a training utterance 34 for speaker class λ , where T is the number of frames in the utterance and x _{x t} is the spectral feature vector of the frame at time t . Denote by G _k the Gaussian mixture model (GMM) for speaker class λ , of Μ _λ Gaussian components, trained using all the training utterances χ _λ . This can be expressed as where _λ (x \ m) is the m Gaussian component and w _x (m) is the corresponding weight, for speaker class λ . By way of example, each speaker class GMM may contain 512 Gaussian components with diagonal covariance matrices. Next, based on G^ a data driven model for each training utterance χ _λ may be built by taking each frame from χ _λ and finding the Gaussian component in G _k that produces maximum likelihood for the frame, i.e. identifying the Gaussian component that is the most likely to match the spectral features of the frame. Thus, χ _λ can be alternatively represented by a corresponding time sequence of Gaussian components {g _x (x \m- _{k t}) : t = \,2,...,T } , (or other mixture model components) where m _{x t} is the index of the Gaussian component producing maximum likelihood for the frame x _{X t} . This Gaussian sequence representation of χ _λ , thus, can be fully characterized by the corresponding index sequence ιη _λ , m _x = {m _xy. t = i,2,..,T _x ( 2 )

Equation (2) is called an utterance model which can be considered as part of the TSSF-M _¾, of the training data of speaker class λ . In the modeling stage A12, an utterance model ιη _λ is created for each training utterance χ _λ of each speaker class λ . All the training utterances models ιη _λ for speaker class λ together form the TSSF-M _¾, for the speaker class.

Given the SCM, containing the mixture of K simultaneous signals from acoustic sources Si,... ,S _K, the TSSF-Mi,...,TSSF-M _K required for the source separation are extracted from the TSSF-M _I,...,TSSF-MN. Extraction can be manual or by means of algorithms. In the absence of knowledge of the acoustic source classes in the SCM, any source identification and clustering algorithms known in the art can be used to identify TSSF-Mi,...,TSSF-M _K. For example, known speaker clustering methods such as those disclosed in United States Patents US5983178 and US5598507 may be used.

The single channel mixture TSSFSC _M and the training data TSSF-Mi,...,TSSF-M _K are then analyzed in order to identify the longest segment compositions (of the training data), and the corresponding matching component segments (of the TSSFSC _M), for use in separation. In Figure 2, this analysis is performed by analyzer A13, which is labeled CLOSE (Composition of LOngest SEgments). As is described in more detail hereinafter, the identified matching component segments are used by a reconstruction process A14 along with the phase of the SCM to reconstruct the component acoustic sources Oi, ... ,Οκ·

There follows a description of a preferred embodiment of the analyzer A14, set in the context of speech separation from an SCM containing audio signals generated from simultaneous speech from two human speakers. It will be understood, however, that the invention may be employed for separating signals from any kind of simultaneous acoustic sources from a single channel mixture containing signals from the multiple simultaneous acoustic sources. This is achievable provided training data for each relevant acoustic source class is available. The preferred method for two-talker speech separation is illustrated in Figure 3 and, in the preferred embodiment, is implemented by analyzer A14. A single- channel mixed speech signal 30 (which may also be referred to as the test utterance) is composed of two simultaneous speech utterances (audio signals) from two speakers. Training data 32 comprising a respective plurality of training utterances 34 for speaker classes λ and γ is available. It is assumed that the speaker classes λ and γ are relevant to the speakers and have been selected in the manner above.

Figure 3 illustrates how a test segment y _tT of the test utterance 30 is subjected to signal separation by finding a respective segments ιη _{λ hj} and m _{y u}. _v from the training utterances 34 of speaker classes λ and γ such that the selected training segments, when combined, match (or substantially match, i.e. are deemed to constitute an acceptable match) the test segment. Preferably, matching in this context involves finding the training segments that, when combined, give the highest probability (or maximum likelihood) of representing the test segment. This is referred to as segment composition.

In preferred embodiments, each speaker is associated with a respective speaker class. Hence, N speaker classes are used for N speakers. However, the same speaker class can be used for more than one speaker and so N different speaker classes need not necessarily be used. In the general case, if there are N speakers, N respective training utterances are combined during segment composition, each training utterance belonging to the class of the respective speaker.

The preferred CLOSE method of A14 maximizes the length of the segment composition as a means of minimizing the uncertainty of the component segments in the composition and hence the error of separation. Advantageously, it is the length of the test segment y _x that is maximized, and this in turn maximizes the lengths of the composite training segments _j and m _{y u}. _v that match y _tT . (see equation (8) hereinafter). In preferred embodiments, a probabilistic approach is used as one possible way of finding the matching score for the longest segment composition. However, the method described for finding the matching score should not be considered exclusive.

By way of example, two different embodiments for determining longest segment compositions are now described, being referred to as CLOSE- 1 and CLOSE-2. These are described in the context of two-talker speech separation for illustrative purposes.

CLOSE- 1 is described first, set in the context of speech separation from SCM containing simultaneous speech from two speakers. Consider y = {y _t : t = 1,2,..., T) as a test utterance with T frames, composed of two simultaneous speech utterances spoken by two different speakers, and let λ and γ denote the corresponding speaker classes selected from the training data. Let _a = {y _e : ε = t, t + 1,...,τ } represent a test segment taken from the test utterance y and consisting of consecutive frames from time t to r. Let ιη _λ . _j = {m _{X t} : t = i,i + represent a training segment taken from the model ιη _λ and modelling consecutive frames from i to j in the training utterance χ _λ of speaker class λ . Similarly, let ^my ,u:v ⁼ i ^my ,t '■ t = u, u + \,...,v} represent a training segment taken from the model m _y and modelling consecutive frames from u to v in the training utterance χ _γ of speaker class γ . Given y _a , two matching training component segments m _;i, . and m _{y u}. _v are identified by using the posterior probability P(m _x . _j , m _{y u}. _v | y _iT ) . In this context the posterior probability is the (new) probability given a prior probability and some test evidence, in this case a test segment, and wherein the probability is a measure of the likelihood that a combination of the training component segments matches the test segment.

Assume an equal prior probability P for all possible component segments (s ,s ) from the two speaker classes. This posterior probability can be expressed as

where p(y _a | m _;i. ., m^ _;M:v) is the likelihood that the given test segment y _a is matched by combining the two training component segments , m _{y u}. _v .

Assuming, for convenience, independence between the frames within a segment, this segmental likelihood function can be written as

P(y _t-, \ ^m ,i:j ^ _i ,u:v ) = { p(y _s ,ς (ε ) , ^ _γ ,η (ε ) ) where p(y _E | ¾ _{;ς (ε )} , « _{γ η (ε )}) is the likelihood that the given test frame y _E is matched by combining the two training component frames m _{x (ε )} , m _{y (ε )} . In (4), ς (ε ) and η (ε ) represent time warping functions between the test segment _a and the two training segments ιη _{λ ;}, . and m _{y u}. _v in forming the match. The use of time warping functions is optional and allows for variation in the rate of speech.

In equation (3), the denominator is expressed as the sum of two terms. The first term is the average likelihood that the given test segment y _tx is matched by

combining two training speech segments; this likelihood is calculated over all possible training segment combinations between the two speaker classes. The second term, denoted by p(y _a | ) , represents the average likelihood that the test segment _a is matched by two speech segments which, either or both, are not seen in the training data. This likelihood associated with unseen component

segments can be expressed by using a mixture model, allowing for arbitrary, temporally independent combinations of the training frames to simulate arbitrary unseen speech segments. Combining the two speakers classes GMMs [i.e.,

equation (1)], the following expression can be used:

∑ ∑ ^wx ( ^mx K (m _y )p(y _s \ ηι , ηΐ _Ί )

m =1 m, _f =1

(5) The sums inside the brackets form a mixture likelihood for the test frame y _s , taking into consideration of all possible training frame combinations between the two speaker classes; equation (5) assumes independence between consecutive frames, to simulate arbitrary component/test segments. In other words, if the

segmental temporal dynamics are regarded as "text" dependence, then equation

(4) gives a "text-dependent" likelihood of the test segment, dependent on the

temporal dynamics of both training component segments, while (5) gives a "text- independent" likelihood of the test segment. In this context, "text-independence" may be regarded as matching single frames while "text-dependence" may be regarded as matching multi- frame segments. Test segments with mismatching training component segments will result in low "text-dependent" likelihoods [i.e., (4)] but not necessarily low "text-independent" likelihoods [i.e., (5)], and hence low posterior probabilities of match [i.e., (3)].

For matching composition of the test segment y _t with the training component segments m _;i, . and m _{y u}. _v it can be assumed that the "text-dependent" likelihood is greater than the "text-independent" likelihood, i.e.,

Ρ(Ύη I ^m,— ^{m )}≥ P( _t, Ι Φ* ) · ^This is because

P(y I Φ ) - Π ^{max max} )w (m )p(y _E \ m _x ,m )

_E=t * ^»■ ^

≤Y\ p(y _E \ m _{q (E )}, m _{y n(E)})

_E=t

(6)

The second approximation is based on the assumption that matching and hence highly likely test-training composition dominates the mixture likelihood.

Therefore, with (3) and (5), a larger posterior probability can be obtained for

composition between matching test/training segments, and a smaller posterior probability for composition between mismatching test/training segments.

The posterior probability formulation (3) has another advantageous characteristic:

it favours the continuity of match and produces larger probabilities for

composition between longer matching segments. Assume that the test segment

y _a and the two training component segments and m _{y u}. _v are matching, in the sense that the segmental likelihoods of composition

Ρ(Ύκ \ ^m i:j > ^my ,u-,)≥P(y _t, I ) for any

( m', _..- _: , ^m' ,u- _.v )≠ (n , , m _,u-, ) , and p(y _ίπ \ m _x . _:j , m _u. _v ) > p(y _fI | ) . Then the inequality concerning the posterior probabilities for compositions between matching segments with different lengths can be expressed as:

« ( ₆ ) - ^m _{y W} (a) l yJ≤ _JpK,, - ^m _y , l y _* ) where y _t. _E , with ε≤τ , is a test segment starting at the same time as y _a but not lasting as long and ηι _{λ ;:ς(ε)} and rn^ _(ε) are the corresponding training component subsegments matching the shorter test segment y _re . The inequality indicates that larger posterior probabilities are obtained when longer matching segments are being composed. At each frame time t of the test utterance y = {y _t: t = 1,2,..., T) , a longest segment composition can be found, denoted by the test segment y _t. starting at t and the corresponding matching training component segments

_;, . and _u. _v , by maximizing the posterior probability. That is „ _m > ( ^mLv > ¹¹¹ ,u:v ) = ^arB ^maX , _J ^m,. _j,, ^a V

τ m _X m ^X .

1 ^,. _v, ^m', , : · > ¹¹¹ ' ,u- I )

That is, at time t, the longest segment composition may be obtained by first finding the most-likely training component segments for each fixed-length test segment y _a , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability.

CLOSE- 1 for two-talker speech separation is outlined as an algorithm in Figure 4. In CLOSE- 1, both training component frames/segments for each test

frame/segment are constrained temporally by their corresponding longest matching training component segments. To find the two matching training component segments for a test segment, this system needs to search Ν _λ x N _y possible combinations, where Λ^ _λ and N _y represent the number of training segments from speaker class λ and speaker class γ , respectively. The second embodiment, CLOSE-2, is now described set in the context of speech separation from SCM containing simultaneous speech from two speakers.

Reconsider the segmental likelihood function (4) of a test segment _a , now

associated with a temporally constrained training component segment ιη _{λ ;}, . from speaker class λ , and a temporally unconstrained training component segment from speaker class γ . Denote the unconstrained component segment as * _{γ tx} . This likelihood function can be expressed as p(y Ι ^Λ^ ) = Π ™ ^χ p( * Κ (9)

The unconstrained component segment is formed by choosing the frames freely from the training data (mapped to GMM G _y ), to maximize the likelihood with the constrained component segment. Thus, * _ίπ = (m _{y t} , m _{y t+l} ,...,m _{y x} ) where each ill _., = argmax^ _≤M^ p(y _a \ ιη _{λ Λ (β )} , ιη _γ ) . Equation (9) gives a "text-dependent" likelihood of the test segment, dependent on the temporal dynamics of the training component segment ιη _{λ ;}. . . Substituting p{y _tx | m _;i. . ,* _tx ) into (3) we can obtain the posterior probability expressed as: m, * I v ) ⁾ =

Equation (10) is only a function of the temporally constrained training segment ιη _{λ ;}. . . Similar to (8), the longest segment compositions can be located between the test segments and the temporally constrained training segments, by

maximizing the posterior probabilities. At each frame time t, the longest matching test segment y _{rt max} and the corresponding temporally constrained training segment _;, . are obtained by first finding the most-likely training segment for each fixed-length test segment _tx , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability

(11)

Equation (11) shows the estimation of the temporally constrained component

segments for speaker λ . By switching the temporal constraint from speaker class λ to speaker class γ , the same system can be used to identify the temporally

constrained component segments for speaker class γ . Figure 5 outlines CLOSE-2 for two-talker speech separation. As described above, each time, only one of the matching training segments forming the segment composition is subjected to

temporal constraints. CLOSE-2 has a search complexity of about N _x + N _y

possible combinations, which can be significantly less than Ν _λ x N required for

CLOSE- 1, for large numbers of training segments N _x and N _y .

In the above approach, (4), (5), (9), the likelihood of a test frame associated with two training component frames, p(y _t \ m ,m ) is calculated (i.e. the likelihood that a combination of the training frames matches the test frame), where m _x and m _y each correspond to a Gaussian component in the appropriate speaker class's

GMM, i.e., _λ (x \m _x) and g _y (x \m ) , which model the probability distributions of the training component frames. Given the probability distributions of the training component frames, and given the assumption that the test frame y _t is an additive mixture of the training component frames, there can be several methods, for

example, log-max, Algonquin, lifted max, or parallel model combination, that can be used to derive the likelihood for the test frame, i.e. the likelihood that a

combination of the training frames matches the test frame.

Conveniently, a simple method, the log-max model, may be used to obtain this likelihood. However any algorithms known in the art can be used to calculate the likelihood p(y _t \ m ,m ) . For each frame, its log power spectrum is calculated as the spectral feature. Assume that the log power spectrum of y _t can be expressed in F distinct frequency channels, i.e., y = {y _t : f = ,2,...,F} where y _tf is the log power of the / th channel. Then p(y _t \m _x,m _y) can be expressed as

p(y _t\m _x,m _y) = Y[p(y _tf \m _x,m _y) where py _t \m _x,m _y) is the likelihood of the log power of the / th channel. For simplicity, in (12) independence between the frequency channels is assumed. Let x _Xf and x _f represent the log powers of the same channel of the two component frames, subject to probability distributions g _x(x \m _x) and g _y (x \m ) . Assume y _tf-max(x _Xf,x _{y f}) . Thus, py _t \m _x,m _y) can be written as

P(y, \™ _x,m _y) = g _x (y _tf \m _x )P _y (y _tf \m _y ) + g _y (y _tf \m _y )P _X (y _tf \m _x )

where P _y (y _{t f} \m _y ) = g _y (x _f \m _y )dx _f , and likewise for P _x (y _{t f} \m _x) .

Once the respective matching training segments are identified, they may be regarded as comprising, or being representative of, the separated speech signals. In this example, two training segments are identified (one for each speaker class), and these segments may be considered as constituting the separated speech signals (or more particularly as constituting data from which the respective audio source signal can be reconstructed).

For the purposes of separation, component speaker classes/frames are modelled with gains different from the training data. Rewrite the component-frame

Gaussians as g _x(x\m _x,a _x) and g _y (x \m ,a ), where a _x and a _y are the gain updates (in dB) for speaker class λ and speaker class γ , respectively, and _mi + _x,∑ _mi ), where μ _¾ and∑ _¾ are the training-data based mean vector and covariance matrix of the appropriate Gaussian. For any given test utterance, the gain updates a _x and a _y are calculated at the frame level on a frame-by- frame basis, by maximizing the test frame likelihood p(y _t \ m ,m ) against a set of predefined gain update values for each component frame. The

gain-optimized test frame likelihood can be expressed as

p(y _t \ m _x ,m _y ) = max Y[ p(y _{t f} \ m _x , m _y ,a _{x >}a _y ) (14) where g^ and g are the predefined gain-update value sets for speaker class λ and γ , and p y _t \ m _x , m _y , a _x , a _y ) is the local channel likelihood (13) with each component Gaussian including a corresponding gain update. It is to be noted

again that the above-described probabilistic approach (including gain

identification) is only one of the approaches that can be used for realising the

CLOSE method.

When separation is complete, the analyzer A13 produces, as an output, a

respective plurality, or set, of selected training segments for each audio source contributing to the SCM. Each set of training elements may be said to form a

sequence in that it comprises a respective training element for successive

segments, and in particular frames, of the SCM. This is illustrated schematically in Figure 6 where successive frames of the SCM (and in particular the TSSFSC _M) are represented by vertical lines 60. For each frame, there is a training segment 62 selected by the analyzer A13 as best matching segment starting from that frame.

As described above, each training segment is selected from one or more of the training audio signals 34 of the training data 32. In preferred embodiments, the respect set of training segments for each audio source comprises a sequence of training segments taken from at least one, and typically only one, acoustic class associated with the respective audio source. A respective training segment is

typically selected (in the "best matching" manner described above) for each

frame, or other segment, of the SCM (test utterance). Hence, in the preferred

embodiment, for each audio source the analyzer produces a respective sequence of training data segments taken from one or more the training audio signals of the acoustic class associated with the respective audio source, a respective training data segment being selected for each frame of the SCM by determining the best matching segment composition for that frame. In Figure 6, the training segments 62 for a given audio source are aligned temporally, conveniently by frame. As can be seen from Figure 6 each training segment 62 is aligned according to its start time, conveniently the start time (as indicated by the vertical lines 60) of the frame for which it is selected. The training segments 62 may be of different lengths and may, for example, extend across more than one frame. As a result, one or more of the training segments 62 may overlap with one another when aligned temporally, in this case by frame. The training segments 62 may therefore be said to overlap temporally when aligned. Accordingly, a respective portion of one or more of the training segments 62 may be aligned with each frame of the TSSFSC _M (as represented by the vertical slices between lines 60 in Figure 6).

The output from the analyzer A13 is fed to the reconstruction module A14 to estimate the individual audio component sources in the SCM. Figure 6 shows the preferred reconstruction process schematically. The best matching training segments 62 produced by the analyzer A13 are first aligned with one another, conveniently by starting frame. For each frame of the TSSFSC _M, the respective spectral feature(s) of the portion of the, or each, training segment 62 that is aligned with said frame are combined to produce corresponding composite spectral feature(s) for the corresponding frame of the separated audio signal (this is represented in Figure 6 by the merging module A50). Conveniently, the composite spectral feature(s) are produced by averaging the, or each, training segment spectral feature(s) across each frame, although other combining functions may be used. This results in a respective set of composite spectral feature(s) for each frame of the separated signal. The spectral features typically comprise data representing the frequency component(s) of the segment portion, e.g. data identifying the frequency component(s) and the respective magnitude of those component(s).

In order to reconstruct the separated audio signal in the time domain, frequency to time domain conversion is required (post-processing module A51). Conversion from the composite spectral features to a corresponding time domain audio signal conveniently uses the previously obtained phase information in the inverse spectral transformation corresponding to the spectral transformation used in the pre-processing module Al l .

The following description provides the details of a specific implementation of the reconstruction process, set in the context of speech separation for rebuilding component utterances (from the audio sources contributing to the SCM) based on the longest matching segments found at all the frame times corresponding to the source. Given test utterance y = { y _t : t = 1,2, ... , T) , after finding the matching segments m^, . and m' . _v at all t [i.e. (8) or (11)], m^, . and m' . _v are used respectively to form estimates of the two component speech utterances forming the test utterance. In the following, the algorithm that uses _;, . to estimate the component utterance from speaker class λ is described. The same algorithm can be used to estimate the component utterance from the other speaker class, by replacing m^. with m^ .

Let s _{x ε} represent the component frame of the component utterance of speaker class λ at time ε , ε = 1,2,...,Γ, and S _{X a} be the magnitude spectrum of the frame. An estimate of S _{x ε} can be obtained by taking all the matching training segments that contain s _{x ε} and averaging over the corresponding training frames (a way of merging the matching segments) A50. The following equation can be used: =∑ - Η ^η , .. (ε ) )J ex (fl^ _{(ε )} )P( ,i:j , ¹¹¹ ,„:ν I Υ , _{τ m} 'P t

(15) where the sum is over all test segments y _i that contain component frame s _{x ε} ; τη' χ,ς (ε ) , is the training frame and α _λ' _{ε) is the associated gain [obtained using

(1 )] corresponding to s _{x ε} , taken from the longest matching training segment

. j ; Α(ηι'λ,ς ( _ε ) ) represents a magnitude spectrum associated with training frame . As shown in (15), each component frame is estimated through

identification of a longest matching component segment, and each estimate is

smoothed over successive longest matching component segments. This improves both accuracy for frame estimation and robustness to imperfect segment match.

Frames within the same segment share a common confidence score which is the posterior probability of the segment. In (15), P is a normalization term. In typical embodiments, the following expression is suitable:

I ¹ tf ∑^ ,, , m ^ | y _rtmJ < 1

The last condition prevents small posterior probabilities being scaled up to give a false emphasis. If CLOSE-1 (11) is used, the posterior probabilities in (15) and

(16) should be replaced by Ρ ΤΆ' _Χ . _] * _ί^ \ γ,. _τ ) , as defined in (11). The merging process A50 for each component source results in the formation of the TSSF of the source which is then passed to a post-processing module A51 for reconstructing the source signal, as shown in Figure 6. By way of example, the

DFT magnitudes of the training frames may be used as the magnitude spectra

A(m _{x ;} ) and the phase of the SCM to form the estimate of the source signal.

However, the configuration mentioned here should not be considered exclusive.

One skilled in the art would recognize other possible configurations for the postprocessing stage to estimate the component source signal. The separator A10 may be implemented in hardware and/or computer program code as is convenient. In particular, any one or more of the pre-processing Al 1 , modeling A12, CLOSE A13 and reconstruction A14 may be implemented in hardware and/or computer program code as is convenient or best suits a given application.

The invention is not limited to the embodiments described herein which may be modified or varied without departing from the scope of the invention.

Previous Patent: SEARCH AND DISCOVERY SYSTEM

Next Patent: TWO-COMPONENT COMPOSITION BASED ON SILANE-FUNCTIONAL POLYMERS