Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A SOUND PROCESSING APPARATUS AND METHOD
Document Type and Number:
WIPO Patent Application WO/2020/083479
Kind Code:
A1
Abstract:
The invention relates to a sound processing apparatus (100) comprising a plurality of nodes (101a-f), wherein each node (101a-f) is configured to receive a sound signal from one or more sound sources. The plurality of nodes define a tree with a root node (101a), a plurality of non-root nodes (101b-f), including one or more leaf nodes, and a plurality of edges connecting the plurality of nodes, wherein each leaf node defines a branch of the tree and each node (101a-f) is configured to communicate via one or more of the plurality of edges with one or more other nodes of the plurality of nodes (101a-f). Each node (101a-f) is configured for each of the one or more sound sources (i) to determine a filtering coefficient on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and (ii) apply the filtering coefficient to the respective sound signal for obtaining a respective filtered sound signal. The root node (101a) is configured for each of the one or more sound sources to determine an output sound signal by aggregating the plurality of filtered sound signals from each node (101a-f). Moreover, the invention relates to a corresponding sound processing method.

Inventors:
JIN WENYU (DE)
SETIAWAN PANJI (DE)
SHERSON THOMAS (NL)
KLEIJN WILLEM BASTIAAN (NL)
Application Number:
PCT/EP2018/079141
Publication Date:
April 30, 2020
Filing Date:
October 24, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
JIN WENYU (DE)
International Classes:
H04R3/00
Domestic Patent References:
WO2016033364A12016-03-03
WO2017063706A12017-04-20
Other References:
None
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A sound processing apparatus (100) comprising a plurality of nodes (101 a-f), each node (101 a-f) being configured to receive a sound signal from one or more sound sources, the plurality of nodes defining a tree with a root node (101 a), a plurality of non root nodes (101 b-f), including one or more leaf nodes, and a plurality of edges connecting the plurality of nodes, each leaf node defining a branch of the tree, and each node being configured to communicate via one or more of the plurality of edges with one or more other nodes of the plurality of nodes (101 a-f), wherein each node (101 a-f) is configured for each of the one or more sound sources to (i) determine a filtering coefficient on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and (ii) apply the filtering coefficient to the respective sound signal for obtaining a respective filtered sound signal; and wherein the root node (101 a) is configured for each of the one or more sound sources to determine an output sound signal by aggregating the plurality of filtered sound signals from each node (101 a-f).

2. The apparatus (100) of claim 1 , wherein the root node (101 a) is configured to aggregate the plurality of filtered sound signals from each node (101 a-f) by combining a plurality of output sound signal contributions provided by each branch of the tree, wherein each output sound signal contribution is based on a sum of the respective filtered sound signals of the nodes located along the respective branch of the tree, starting at the respective leaf node of the respective branch of the tree.

3. The apparatus (100) of claim 1 or 2, wherein the root node (101 a) is further configured for each of the one or more sound sources to determine a plurality of global parameters and to provide the plurality of global parameters to each non-root node (101 b- f) and wherein each non-root node (101 b-f) is configured for each of the one or more sound sources to determine the filtering coefficient on the basis of the one or more relative transfer functions associated with the respective node and the one or more sound sources and on the basis of the plurality of global parameters provided by the root node (101 a).

4. The apparatus (100) of claim 3, wherein each node (101 a-f) is further configured for each of the one or more sound sources to determine a matrix M;- S on the basis of the following equation:

M ™;,s = lxj,sl Lx],s > wherein j denotes a node index, s denotes a sound source index, AH denotes the Hermitian transpose of the matrix A and the matrix Aj s is given by: wherein aj s denotes a vector for the node j and the sound source s and S denotes the total number of sound sources, wherein the plurality of components of the vector aj s are defined by the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k.

5. The apparatus (100) of claim 4, wherein the root node (101 a) is configured for each of the one or more sound sources to aggregate the matrices M;- S determined by all nodes (101 a-f) and to determine the plurality of global parameters on the basis of the following equation: wherein m* denotes a vector defining the plurality of global parameters, N denotes the total number of nodes (101 a-f), AT denotes the transpose of the matrix A, and 0r denotes the transpose of a column null vector of size 5 - 1.

6. The apparatus (100) of claim 4 or 5, wherein each non-root node j (101 b-f) is configured for each of the one or more sound sources to determine the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k, X denotes the sound signal received by the root node r in the frequency bin k and H S denotes the relative transfer function associated with the node i and the sound source s in the frequency bin k, wherein the node i is the parent node of node j.

7. The apparatus (100) of claim 4 or 5, wherein each non-root node j (101 b-f) is configured to determine the relative transfer function Hfs associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k, xf denotes the sound signal received by the root node r in the frequency bin k, Hfs denotes the relative transfer function associated with the node i and the sound source s in the frequency bin k, Hfs denotes a stored relative transfer function associated with the node i and the sound source s in the frequency bin k and a denotes a parameter in the range [0, 1], wherein the node i is the parent node of node j.

8. The apparatus (100) of claim 6 or 7, wherein, in case the parent node of node j is the root node (101 a), Hfs is equal to a constant, in particular equal to 1.

9. The apparatus (100) of any one of the preceding claims, wherein each non-root node j (101 b-f) is configured to determine a second cross moment, in particular correlation measure between the sound signal received by the node j and the sound signal received by a neighbouring node i and to make on the basis of the second cross moment a local decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

10. The apparatus (100) of claim 9, wherein the root node (101 a) is configured to aggregate the plurality of local decisions of the plurality of non-root nodes (101 b-f) and to make on the basis of the plurality of local decisions a global decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

1 1. The apparatus (100) of claim 10, wherein the root node (101 a) is further configured to distribute information about a result of the global decision through the tree to the non root nodes (101 b-f).

12. The apparatus (100) of any one of claims 9 to 1 1 , wherein each non-root node j (101 b-f) is configured to determine the second cross moment as a correlation measure Rj between the sound signal received by the node j and the sound signal received by a neighbouring node i on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k and xf denotes the sound signal received by the node i in the frequency bin k.

13. The apparatus (100) of any one of the preceding claims, wherein each node (101 a-f) comprises at least one microphone configured to receive the one or more sound signals from the one or more sound sources.

14. The apparatus (100) of any one of the preceding claims, wherein the root node (101 a) of the tree is configured to generate the tree having the plurality of nodes (101 a-f).

15. The apparatus (100) of any one of the preceding claims, wherein the tree having the plurality of nodes is a spanning tree.

16. A sound processing method (800) using a plurality of nodes (101 a-f), each node (101 a-f) being configured to receive a sound signal from one or more sound sources, the plurality of nodes (101 a-f) defining a tree with a root node (101 a), a plurality of non-root nodes (101 b-f), including one or more leaf nodes, and a plurality of edges connecting the plurality of nodes (101 a-f), each leaf node defining a branch of the tree, and each node (101 a-f) being configured to communicate via one or more of the plurality of edges with one or more other nodes of the plurality of nodes (101 a-f), wherein the method (800) comprises: by each node (101 a-f) for each of the one or more sound sources, determining (801 ) a filtering coefficient on the basis of one or more relative transfer functions associated with the respective node (101 a-f) and the one or more sound sources and applying the filtering coefficient to the respective sound signal for obtaining a respective filtered sound signal; and by the root node (101 a) for each of the one or more sound sources, determining (803) an output sound signal by aggregating the plurality of filtered sound signals from each node (101 a-f).

17. A computer program product comprising program code configured to control a sound processing apparatus to perform the method (800) according to claim 16 when executed on a computer or processor.

Description:
DESCRIPTION

A SOUND PROCESSING APPARATUS AND METHOD

TECHNICAL FIELD

The present invention relates to sound processing. In particular, the present invention relates to a sound processing apparatus comprising a plurality of spatially separated sound processing nodes or sensors as well as a corresponding method.

BACKGROUND

In the field of sound or acoustic processing, a major challenge is how to separate a particular source signal from interference. A natural approach to address this challenge is to combine multiple spatially diverse recordings to favour particular source locations over others. The advantage of such multi-channel processing over single-channel processing is that it can enhance the signal-to-interference ratio without imposing distortion on the source signal.

Two main approaches are known from the prior art, namely (a) beam-forming, which assumes knowledge of the desired source location, and (b) blind source separation, which assumes source independence. Blind source separation (BSS) approaches offer the advantage that they do not require prior knowledge of the target source location or the structure of the room.

State-of-the-art blind source separation approaches can be divided in different classes that are based on different assumptions, in particular independent component analysis (ICA) based approaches and sparse component analysis (SCA) based approaches.

ICA based approaches aim to separate a mixture of target sources observed by microphones for the acoustic application. The design of these filters is based on the underlying assumption that the signals are statistically independent.

SCA based approaches require the signals of interest to be sparse in some domain. For multichannel audio, it is common for SCA based approaches to require that the source signals are sparse in time or in time-frequency. A common example is that persons speak individually at least part of the time.

The strength of ICA based source separation approaches is that these perform well for instantaneous mixtures. However, audio signals are subject to convolutive mixing. The common approach is to transform the signals to the time-frequency domain and apply methods for instantaneous mixing. This leads to the problem of ambiguous permutation over frequency bins, which is difficult to solve. ICA based approaches are typically utilized in off-line contexts. SCA based approaches can perform exact separation for sparse signals by employing temporal masking methods. This results in low complexity methods. However, when multiple sources are active simultaneously, separation is not possible.

The above known BSS approaches are based on a centralized processed BSS. However, BSS is also possible in the context of distributed signal processing, where spatially separated nodes or sensors contribute computational effort in a collaborative effort to solve the BSS problem. Distributed BSS approaches can be applied, for instance, in scenarios, such as meetings, where the mobile phones of the participants define the distributed nodes. In such scenarios, the participants, i.e. speakers sometimes talk simultaneously, but they also often talk individually. In particular for such scenarios, there is a need for improved devices and methods implementing a distributed BSS scheme with low computational and communication requirements.

SUMMARY

It is an object of the invention to provide improved devices and methods implementing a distributed BSS scheme with low computational and communication requirements.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

Generally, embodiments of the invention can be based on a combination of a SCA based approach (having low computational complexity and also low communication

requirements) and an ICA based approach (allowing the simultaneous handling of several active sources) as well as distributed processing. Embodiments of the invention solve the BSS problem in a distributed fashion in a network of sensors, wherein each sensor forms a node of the network. The network can, for example, be defined by a plurality of microphone equipped devices, such as smartphones.

Embodiments of the invention are implemented to aggregate synchronized discrete-time observations in the network of sensors, i.e. nodes. According to embodiments of the invention, the observed sensor data are not all transmitted to a central node of a tree defined by the network of nodes, but instead are aggregated along the tree. This means that the overall number of transmissions required to transmit sensor information grows linearly with the number of nodes. For data blocks that do not originate from a single source, the number of transmissions required is N, where N is the number of nodes. According to embodiments of the invention each node can make one transmission per time-frequency point. According to embodiments of the invention the effort can be reduced further for single-source data blocks, where data of one or a small set of nodes is used only. For such embodiments, nodes that are not downstream in the tree of the selected nodes do not need to transmit.

More specifically, according to a first aspect the invention relates to a BSS sound processing apparatus comprising a plurality of spatially separated nodes or sensors, wherein each node is configured to receive a sound signal from one or more sound sources. The plurality of nodes define a tree with a root node, a plurality of non-root nodes, including one or more leaf nodes, and a plurality of edges, i.e. communication links connecting the plurality of nodes, wherein each leaf node defines a branch of the tree and each node is configured to communicate via one or more of the plurality of edges with one or more other neighbouring nodes of the plurality of nodes.

Each node is configured for each of the one or more sound sources (i) to determine at least one local, frequency-dependent filtering coefficient w * s on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and (ii) apply the local filtering coefficient w * s to the respective sound signal received by the respective node for obtaining a respective locally filtered sound signal.

The root node is configured for each of the one or more sound sources to determine a blind-source separated output sound signal by aggregating the plurality of locally filtered sound signals from each node. Thus, an improved sound processing apparatus is provided allowing to address the BSS problem efficiently in a distributed fashion, i.e. with a reduced number of overall transmissions (in comparison to the conventional growth as the square of the number of sensors/nodes).

In a further possible implementation form of the first aspect, the root node is configured to aggregate the plurality of locally filtered sound signals from each node by combining, in particular summing a plurality of output sound signal contributions provided by each branch of the tree, wherein each output sound signal contribution is based on a sum of the respective locally filtered sound signals of the nodes located along the respective branch of the tree, starting at the respective leaf node of the respective branch of the tree.

In a further possible implementation form of the first aspect, the root node is further configured for each of the one or more sound sources to determine a plurality of global parameters m 3 and to provide the plurality of global parameters to each non-root node, wherein each non-root node is configured for each of the one or more sound sources to determine the local, frequency-dependent filtering coefficient w * s on the basis of the one or more relative transfer functions associated with the respective node and the one or more sound sources and on the basis of the plurality of global parameters provided by the root node.

In a further possible implementation form of the first aspect, each node is further configured for each of the one or more sound sources to determine a matrix M ; s on the basis of the following equation:

M ™;,s = lx j,s l L x ],s > wherein j denotes a node index, s denotes a sound source index, A H denotes the

Hermitian transpose of the matrix A and the matrix A j>s is given by: wherein a ; s denotes a vector for the node j and the sound source s and S denotes the total number of sound sources, wherein the plurality of components of the vector a j s are defined by the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k.

In a further possible implementation form of the first aspect, the root node is configured for each of the one or more sound sources to aggregate, i.e. sum the matrices M ;- S determined by all nodes and to determine the plurality of global parameters m 3 on the basis of the following equation: wherein m * denotes a vector defining the plurality of global parameters, N denotes the total number of nodes, A T denotes the transpose of the matrix A, and 0 r denotes the transpose of a column null vector of size 5 - 1. Aggregation can comprise that each leaf node sends the matrix M J S to its parent node. Once an intermediate node has received the matrices from all of its child nodes, it can forward the aggregated sum matrix to its parent node.

In a further possible implementation form of the first aspect, each non-root node j of the tree is configured for each of the one or more sound sources to determine the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k, X denotes the sound signal received by the root node r in the frequency bin k and denotes the relative transfer function associated with the node i and the sound source s in the frequency bin k, wherein the node i is the parent node of node j.

In a further possible implementation form of the first aspect, each non-root node j, i.e. each node except the root node of the tree, is configured to determine the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k, X denotes the sound signal received by the root node r in the frequency bin k, //£ denotes the relative transfer function associated with the node i and the sound source s in the frequency bin k, Hf s denotes a stored relative transfer function associated with the node j and the sound source s in the frequency bin k and a denotes a parameter in the range [0, 1], wherein the node i is the parent node of node j. The parameter a can be chosen small for more robustness and large for rapid adaptation.

In a further possible implementation form of the first aspect, in case the parent node of node j is the root node, is equal to a constant, in particular equal to 1 .

In a further possible implementation form of the first aspect, each non-root node j, i.e. each node except the root node of the tree, is configured to determine a second cross moment, in particular correlation measure between the sound signal received by the node j and the sound signal received by a neighbouring node i and to make on the basis of the second cross moment a local decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

In a further possible implementation form of the first aspect, the root node is configured to aggregate the plurality of local decisions of the plurality of non-root nodes and to make on the basis of the plurality of local decisions a global decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

In a further possible implementation form of the first aspect, the root node is further configured to distribute information about a result of the global decision through the tree to the non-root nodes.

In a further possible implementation form of the first aspect, each non-root node j, i.e. each node except the root node of the tree, is configured to determine the second cross moment as a correlation measure R j between the sound signal received by the node j and the sound signal received by a neighbouring node i on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k and xf denotes the sound signal received by the node i in the frequency bin k.

In a further possible implementation form of the first aspect, each node comprises at least one microphone configured to receive the one or more synchronized sound signals from the one or more sound sources.

In a further possible implementation form of the first aspect, the root node of the tree is configured to generate the tree having the plurality of nodes.

In a further possible implementation form of the first aspect, the tree having the plurality of nodes is a spanning tree.

According to a second aspect the invention relates to a corresponding BSS sound processing method using a plurality of spatially separated nodes or sensors, wherein each node is configured to receive a sound signal from one or more sound sources. The plurality of nodes define a tree with a root node, a plurality of non-root nodes, including one or more leaf nodes, and a plurality of edges, i.e. communication links connecting the plurality of nodes, wherein each leaf node defines a branch of the tree and each node is configured to communicate via one or more of the plurality of edges with one or more other neighbouring nodes of the plurality of nodes. The method comprises the steps of: by each node for each of the one or more sound sources, determining at least one local, frequency-dependent filtering coefficient w * s on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and applying the local filtering coefficient w * s to the respective sound signal received by the respective node for obtaining a respective locally filtered sound signal; and by the root node for each of the one or more sound sources, determining a blind-source separated output sound signal by aggregating the plurality of locally filtered sound signals from each node. Thus, an improved sound processing method is provided allowing to address the BSS problem efficiently in a distributed fashion, i.e. with a reduced number of overall transmissions (in comparison to the conventional growth as the square of the number of sensors/nodes).

The sound processing method according to the second aspect of the invention can be performed by the sound processing apparatus according to the first aspect of the invention. Further features of the sound processing method according to the second aspect of the invention result directly from the functionality of the sound processing apparatus according to the first aspect of the invention and its different implementation forms described above and below.

According to a third aspect the invention relates to a computer program product comprising program code for performing the method according to the second aspect when executed on a computer.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following embodiments of the invention are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a schematic diagram showing an example of a sound processing apparatus according to an embodiment of the invention;

Fig. 2 is a schematic diagram illustrating in more detail different processing blocks and/or steps implemented in different nodes of a sound processing apparatus according to an embodiment of the invention;

Fig. 3 is a schematic diagram illustrating in more detail different aspects implemented in a sound processing apparatus according to an embodiment of the invention; Fig. 4 is a schematic diagram illustrating in more detail different aspects implemented in a sound processing apparatus according to an embodiment of the invention;

Fig. 5 is a schematic diagram illustrating in more detail different aspects implemented in a sound processing apparatus according to an embodiment of the invention;

Fig. 6 is a schematic diagram illustrating in more detail different aspects implemented in a sound processing apparatus according to an embodiment of the invention;

Fig. 7 is a schematic diagram illustrating in more detail different aspects implemented in a sound processing apparatus according to an embodiment of the invention; and

Fig. 8 is a flow diagram showing an example of a sound processing method according to an embodiment of the invention.

In the following identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the invention or specific aspects in which embodiments of the invention may be used. It is understood that embodiments of the invention may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the invention is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 shows a sound processing apparatus or system 100 according to an embodiment of the invention. The sound processing apparatus 100 comprises a plurality of spatially separated nodes or sensors 101 a-f. In the exemplary embodiment shown in figure 1 , the plurality of nodes 101 a-f of the sound processing apparatus 100 are implemented in the form of smartphones 101 a-f having a microphone and a processor as well as being configured to communicate with each other. The person skilled in the art, however, will appreciate that the sound processing apparatus 100 can comprise other types of electronic devices having one or more microphones and a processor as well as being configured to communicate with each other.

Each of the nodes 101 a-f of the sound processing apparatus 100 shown in figure 1 is configured to receive a sound signal from one or more sound sources, such as from one or more participants of a meeting. As indicated in figure 1 by the arrows and as will be described in more detail below in the context of figures 3 to 7, the plurality of nodes 101 a-f define a tree with the exemplary root node 101 a, a plurality of non-root nodes 101 b-f, including one or more leaf nodes 101 b,e,f, and a plurality of edges, i.e. communication links connecting the plurality of nodes 101 a-f. Each node 101 a-f is configured to communicate via one or more of the plurality of edges with one or more other

neighbouring nodes of the plurality of nodes 101 a-f. Each leaf node 101 b,e,f defines a branch of the tree (e.g. the leaf node 101 e defines a branch of the tree comprising the nodes 101 a, 101 c and 101 e).

In an embodiment, the root node 101 a of the tree is configured to generate the tree having the plurality of nodes 101 a-f. In an embodiment, the tree having the plurality of nodes 101 a-f is a spanning tree.

As will be described in more detail further below, each node 101 a-f is configured for each of the one or more sound sources (i) to determine one or more local, frequency- dependent filtering coefficients w* s on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and (ii) apply the one or more local filtering coefficients w * s to the respective sound signal received by the respective node for obtaining a respective locally filtered sound signal.

The root node 101 a is configured for each of the one or more sound sources to determine a blind-source separated output sound signal by aggregating the plurality of locally filtered sound signals from each node. As will be described in more detail further below, the root node 101 a can be configured to aggregate the plurality of locally filtered sound signals from each node by combining, in particular summing a plurality of output sound signal contributions provided by each branch of the tree, wherein each output sound signal contribution is based on a sum of the respective locally filtered sound signals of the nodes located along the respective branch of the tree, starting at the respective leaf node of the respective branch of the tree.

Embodiments of the invention are based on the following assumptions: the nodes/sensors 101 a-f are configured to make synchronized discrete-time observation and the

nodes/sensors 101 a-f have the ability to process data. The BSS problem can be initiated at a single node, in particular the root or query node 101 a, and the separate signals can be made available to that node 101 a. According to embodiments of the invention the total number of microphones implemented in the plurality of nodes 101 a-f is larger than the total number of sound sources. According to embodiments of the invention the sound sources should be active alone at least some of the time (typically satisfied in the meeting scenario described above). According to embodiments of the invention the signal processing can be done in the time-frequency domain where the window length used is advantageously set to be longer than the T-60 time of the acoustic environment.

Under further reference to figures 2 to 7 some further details, implementation forms and embodiments of the sound processing apparatus 100 will be described in the following.

Figure 2 is a schematic diagram illustrating in more detail different processing blocks and/or steps implemented in the root node 101 a and the exemplary non-root nodes 101 b- d of the sound processing apparatus 100 according to an embodiment of the invention.

According to embodiments of the invention, the sound processing apparatus 100 can be configured to perform one or more of the following steps illustrated in figure 2: (a) Receive a BSS request at the query node 101 a.

(b) In response to the BSS request, generate the tree having the plurality of nodes, advantageously a spanning tree. For generating the tree conventional tree building algorithms can be used.

(c) Obtain a data block, wherein a data block is a window of observations at each node (see 201 a, 201 b of figure 2).

(d) Locally perform FFT/DFT at each node for said data block (see 203a, 203b of figure 2).

(e) Globally detect if only one source is active (see 205a, 207a, 205b, 207b of figure 2; referred to as“Single Source Detector” and“Query Node Single Source Consensus” in figure 2).

If a single source state is detected:

(f1 ) Identify the source or designate it as new (see 209a, 21 1 a, 209b, 21 1 b of figure 2; referred to as“Source Labeler” and“Query Node Source Label Computer” in figure 2).

(f2) Locally refine or compute new relative transfer function, RTF, for each source (see 213a, 213b of figure 2).

(f3) Locally or globally refine or update the filter coefficient for the identified source (see 213a, 215a, 213b, 215b of figure 2).

(f4) Place only active source in transmission set, i.e. masking the other sources (see 217a, 217b of figure 2; referred to as“Masker” in figure 2).

If multiple sources are active:

(g1 ) For each source locally multiply observation with source filter (coefficient) to obtain local source contribution (see 219a, 219b of figure 2). Place all known sources in transmission set. (g2) For all sources in transmission set, separately for each source, starting from leaf nodes 101 b,e,f and moving toward root node 101 a aggregate, i.e. sum local contributions upstream (see 221 a, 221 b of figure 2).

(h) Perform inverse FFT/DFT and overlap-add for each source at root node 101 a (see 223 of figure 2).

(i) Output blind-source separated signals.

In the following the most important processing blocks/steps described above in the context of figure 2 will be described in more detail.

As already mentioned above, in step (b) of the above algorithm the tree comprising the plurality of nodes 101 a-f can be generated by the root or query node 101 a using conventional tree-building algorithms. Once the tree is established, the edges of the tree form the communication links between the nodes 101 a-f of the network. The query node advantageously forms the root node 101 a of the tree. Tree building algorithms are ubiquitous in the literature and hence these algorithms are known to those skilled in the art. The tree can be optimal with a minimum number of edges, i.e. a spanning tree, but can also be a non-optimal tree. Non-optimal trees that can be built with simple greedy algorithms without central processing may be advantageous from a communication and computational viewpoint.

In step (e) of the above algorithm the sound processing apparatus 100 is configured to detect whether only a single source is active or whether several sources are currently active (referred to as“single source detector” in figure 2).

In an embodiment, each non-root node j 101 b-f is configured to determine a second cross moment, in particular correlation measure between the sound signal received by the node j and the sound signal received by a neighbouring node i and to make on the basis of the second cross moment a local decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

In an embodiment, the root node 101 a is configured to aggregate the plurality of local decisions of the plurality of non-root nodes 101 b-f and to make on the basis of the plurality of local decisions a global decision whether the sound signal received by the node j originates from one sound source or from more than one sound sources.

In an embodiment, the root node 101 is further configured to distribute information about a result of the global decision through the tree to the non-root nodes.

In an embodiment, each non-root node j 101 b-f is configured to determine the second cross moment as a correlation measure Rj between the sound signal received by the node j and the sound signal received by a neighbouring node i on the basis of the following equation: wherein xf denotes the sound signal received by the node j in the frequency bin k and xf denotes the sound signal received by the node i in the frequency bin k.

An example for the above embodiments of the“single source detector” is shown in figure 3. Each node computes a local binary decision, namely n = 1 if Ri ³ g (single source), else = 0, where y is a threshold value. A two-step consensus algorithm can be used, which comprises the following steps:

(i) Calculate a minimum neighbor value: where N(n) denotes the set of nodes neighboring node n.

(ii) Aggregate, i.e. sum 1 - r™ in from each leaf node towards the root node 101 a.

(ii) A single source is considered active if the result of the previous step is zero.

As already described above, the“source labeller” shown in 209a, 209b of figure 2, identifies a known source or designates a new source.

An exemplary implementation form of the source labeller implemented by the nodes 101 a- f of the sound processing apparatus 100 can be based on the following algorithm, which is illustrated in figure 4. For each known source, e.g. speaker s each node i 101 a-f can initialize a figure of merit e s i = 0 and compute the following measure with a neighboring node j : wherein H[ denotes the relative transfer function associated with the node i and the frequency bin index /. Moreover, each node i can compute the cumulative sum of the measure:

In a next stage, a summation of the different e s i for each source s from the leaf node towards the root node is performed, i.e. å t e s i .

In a final stage, the root node 101 a is configured to determine the most likely source on the basis of the following equation: wherein S denotes the set of known sound sources. If å t e s* t £ f N, where f is a threshold and N the number of nodes in the network, then the known sound source s * is the current sound source. If, otherwise, å t e s* > f N, then the current sound source is new.

As already described above, in step f2 of the algorithm the sound processing apparatus 100 is configured to locally refine or compute a new relative transfer function for each node and sound source (see 213a, 213b of figure 2). Figure 5 illustrates in more detail how according to an embodiment the sound processing apparatus 100 is implemented to compute the relative transfer function for each node and sound source. For the root or query node 101 a it can be assumed that H = 1. For a new single sound source the respective relative transfer function for each node progressing downward from the root node 101 a to the leaves of the tree can be determined successively on the basis of the following equation: wherein i denotes the neighboring node of node j in the tree that is closer to the root node 101 a. In other words, in case the current sound source is a new sound source, i.e. not known already, each non-root node j 101 b-f can be configured for each of the one or more sound sources to determine the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein H S denotes the relative transfer function associated with the node i and the sound source s in the frequency bin k and wherein the node i is the parent node of node )

In case the current sound source is already known, i.e. a sound source already processed by the sound processing apparatus 100 and thus with a relative transfer function thereof stored in a memory of the sound processing apparatus 100, each non-root node j 101 b-f can be configured to determine the relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k on the basis of the following equation: wherein Hf S denotes a stored relative transfer function associated with the node and the known sound source s and a denotes a parameter in the range [0, 1] (wherein the node i is the parent node of node j). The parameter a can be chosen small for more robustness and large for rapid adaptation. In case the parent node of node j is the root node, # can be chosen to be equal to a constant, in particular equal to 1 . As already described above, in step f3 of the algorithm described above the sound processing apparatus 100 is configured to locally or globally refine or update the filter coefficient for each identified source (see 213a, 215a, 213b, 215b of figure 2). This computation is done on the basis of the determined or known RTFs. Each identified sound source has its own filter, as will be described in the following in the context of figures 6 and 7.

Using the set of RTFs, a set of separating filters can be generated which recover one source while suppressing the others. According to an embodiment, these filters can be constructed via an equivalent distributed optimization problem. To this end, a matrix A j s can be defined by the following equation: wherein a j s denotes a vector for the node j and the sound source s and S denotes the total number of sound sources, wherein the plurality of components of the vector a j s are defined by the estimated relative transfer function associated with the node j and the sound source s for a plurality of frequency bins indexed by the frequency bin index k. The design of a filter which preserves the target signal and suppresses the other sources can be rephrased as a quadratic optimization problem of the following form:

Here the components of the vectors w j correspond to the local filter coefficients or weights at node j for the given target source, as already described above. The computation of these filters can be equivalently expressed in the Lagrange dual domain. Specifically, for each constraint one can introduce the dual variables m (which herein are also referred to as global parameter(s)) so that the following Lagrangian function can be formed: Embodiments of the invention are configured to determine the local filter coefficient(s) solving the dual problem instead, namely:

The unique solution to this problem and the link between the dual variable and the filter coefficient(s) is given by: with the local filter coefficient vector for node j given by: w* = A.»m*

As already described above, the dual variables can be computed by the root node 101 a by aggregating the sum of A A j at the root node 101 a. Given the tree structure of the node network, this operation can be performed in finite time using message passing. Beginning at the leaf nodes (those nodes with only one neighbour; such as the nodes 101 d-f shown in figure 1 ) each node can initialize the message:

M j = A A j

Each leaf node (e.g. the nodes 101 b,e,f shown in figure 1 ) then transmits its message to its parent node (e.g. the nodes 101 c and 101 d shown in figure 1 ). Once a node has received messages from all its descendants it generates the updated message:

This message is then sent to its parent node and the process is repeated until all messages reach the root node 101 a (which, by definition, has no parent node). This process, which is illustrated in figure 6, therefore takes finite time to perform. The root node 101 a then computes the optimal dual variables, i.e. global parameters on the basis of the following equation:

These global parameters are then diffused back through the tree such that every node 101 a-f receives the global parameters, i.e. learns the optimal duals (as illustrated in figure 7). This can be performed via a broadcast transmission scheme. Once a node knows the global parameters, it can compute its filter coefficient for the given source as: w; = Af m *

The filtered output for any frame of audio can then be computed by applying these filter coefficients and using data aggregation to form the estimated sources at the query node.

In other words, according to an embodiment, the root node 101 a is configured for each of the one or more sound sources to determine a plurality of global parameters, i.e. the components of the dual variable/vector m 3 , and to provide the plurality of global parameters to each non-root node 101 b-f such that each non-root node 101 b-f is configured for each of the one or more sound sources to determine the local, frequency- dependent filtering coefficient w * s on the basis of the one or more relative transfer functions associated with the respective node and the one or more sound sources and on the basis of the plurality of global parameters provided by the root node 101 a.

Moreover, in an embodiment, each node 101 a-f is configured for each of the one or more sound sources to determine a (message) matrix M ; s on the basis of the following equation: wherein s denotes a sound source index. The root node 101 a. in turn, is configured for each of the one or more sound sources to aggregate, i.e. sum the message matrices M ; s determined by all nodes and to determine the plurality of global parameters, i.e. the components of the dual variable/vector m 3 , on the basis of the following equation: wherein denotes the vector defining the plurality of global parameters, N denotes the total number of nodes, A T denotes the transpose of the matrix A, and 0 r denotes the transpose of a column null vector of size 5 - 1.

As already described above, according to an embodiment in step f4 of the algorithm described above the sound processing apparatus 100 is configured to place only the active source in the transmission set, i.e. masking the other sources (see 217a, 217b of figure 2; referred to as“Masker” in figure 2). According to an embodiment, the masker can be a local processor that selects for transmission only the active source, thus“masking” the non-active sources. If the masker is omitted, filtering for all source signals should be performed in all data blocks. In further embodiments, the masker additionally can set the active signals to zero in all but one selected node. The selected node can be the node where the observation is the loudest. In an embodiment, the selected node can be found by transmitting a node-index and node-loudness pair from the leaf nodes 101 b,e,f to the root node 101 a. At each node one pair can be selected for onward transmission from the upstream and local pairs. The selected pair is the pair for which the loudness is the highest. The root node 101 a then can propagate back to the leaf nodes 101 b,e,f the index of the selected node, which has the observation with the highest loudness. Many other variations on the node selection theme are possible, where more than one node provides non-zero information to the root node 101 .

As already described above, according to an embodiment in step g1 of the algorithm described above the sound processing apparatus 100 is configured for each sound source to locally multiply the respective observation with the source filter coefficient to obtain the local source contribution (see 219a, 219b of figure 2). In other words, the filter operator is a local operator that, for each source, converts the observation in each time-frequency bin to a contribution for that source in that time-frequency bin. As mentioned, the conversion of the observation in a time-frequency bin to a contribution for that source in that time- frequency bin consists of the multiplication by a single complex filter coefficient. The filter operator for a particular source aims to increase the prominence of that source in the aggregated signal.

As already described above, according to an embodiment in step g2 of the algorithm described above the sound processing apparatus 100 is configured101 a to aggregate, i.e. sum the local contributions in an upstream direction starting from the leaf nodes 101 b,e,f and moving towards the root node 101 a (see 221 a, 221 b of figure 2). In an embodiment, this aggregation is performed for all sources in the transmission set and separately for each source. In other words, the data aggregation operation is a global operation that transfers the contributions of each node to each source signal to the root node 101 a. At each branching node of the tree the upstream contributions for each source are added to the local contribution for that source. At the root node 101 a for each source the sum of the contributions of all the nodes for that source are received. For data blocks that were identified as single-source, only the contributions for that source are received and the root node 101 a can set the other source signals to zero. Upon completion of the aggregation, the root node 101 a is now able to perform an inverse FFT/DFT and perform overlap to obtain the output signals (see 223a of figure 2).

Figure 8 is a flow diagram showing an example of a corresponding BSS sound processing method 800 according to an embodiment of the invention. The method 800 comprises a first step 801 : by each node 101 a-f for each of the one or more sound sources, determining at least one local, frequency-dependent filtering coefficient w * s on the basis of one or more relative transfer functions associated with the respective node and the one or more sound sources and applying the local filtering coefficient w * s to the respective sound signal received by the respective node for obtaining a respective locally filtered sound signal. Moreover, the method 800 comprises a second step 803: by the root node 101 a for each of the one or more sound sources, determining a blind-source separated output sound signal by aggregating the plurality of locally filtered sound signals from each node 101 a-f.

Embodiments of the invention provide a novel approach for performing BSS in wireless networks via a hybrid SCA/ICA approach. According to embodiments of the invention the observed sensor data are not all transmitted to a central node, but instead are aggregated along the tree. This means the overall number of transmissions required to transmit sensor information grows only linearly with the number of nodes. More specifically, the number of transmissions required is N, where N is the number of nodes. In contrast, for conventional (centralized) processing the number of transmissions grows with the square of the number of nodes as observations of each new node must be transmitted over many hops to the query node. The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the invention (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or

communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.