Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR REPRODUCING SOUND SIGNALS AT A FIRST LOCATION FOR A FIRST PARTICIPANT WITHIN A CONFERENCE WITH AT LEAST TWO FURTHER PARTICIPANTS AT AT LEAST ONE FURTHER LOCATION
Document Type and Number:
WIPO Patent Application WO/2017/211447
Kind Code:
A1
Abstract:
The present invention relates to a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants at at least one further location. The object of the present invention is to present such a method that permits a listener at a first location to better differentiate between the participants at the further locations during the conversation. In accordance with the invention, such method is proposed, wherein the sound signal of each further participant is recorded and reproduced at the first location, and wherein each participant who is not present at the first location is allocated a virtual position at the first location, and the sound signal of the respective participant is reproduced from this virtual position, and wherein each sound signal is reproduced along a principal radiation direction. The method is characterized in that • in case of one talker, the principal radiation direction of the reproduced sound signal of the respective talker is directed towards the participant at the first location, • in case of two simultaneous talkers, the principal radiation directions of the sound signals of both talkers are directed away from the participant in the first location, in particular in an angle of more than 40° between the principal radiation directions, and • in case of more than two simultaneous talkers the principal radiation direction of the sound signal of at least one talker, in particular of exactly one talker, is directed towards the participant at the first location, and the principal radiation directions of all other sound signals are directed away from the participant at the first location, in particular in different directions.

Inventors:
VALENZUELA CARLOS (DE)
Application Number:
PCT/EP2017/000648
Publication Date:
December 14, 2017
Filing Date:
June 06, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VALENZUELA HOLDING GMBH (DE)
International Classes:
H04L12/18; H04M3/56
Domestic Patent References:
WO2007062840A12007-06-07
Foreign References:
JPH0410744A1992-01-14
US20160105758A12016-04-14
US20160127846A12016-05-05
US20130170678A12013-07-04
Other References:
SHOJI SHIMADA ET AL: "A NEW TALKER LOCATION RECOGNITION THROUGH SOUND IMAGE LOCALIZATION CONTROL IN MULTIPOINT TELECONFERENCES SYSTEM", ELECTRONICS & COMMUNICATIONS IN JAPAN, PART I - COMMUNICATIONS, WILEY, HOBOKEN, NJ, US, vol. 72, no. 2, 1 February 1989 (1989-02-01), pages 20 - 27, XP000124912, ISSN: 8756-6621
Attorney, Agent or Firm:
MAHLER, Peter (DE)
Download PDF:
Claims:
CLAIMS

1. Method for reproducing sound signals at a first location LOC1 for a first

participant P within a conference with at least two further participants (P1 - P5) at at least one further location (LOC1 ' - LOC5'), wherein the sound signal of each further participant (P1 - P5) is recorded and is reproduced at the first location LOC1 , and wherein each participant (P1 - P5) who is not present at the first location LOC1 is allocated a virtual position (LS1 - LS5) at the first location, and the sound signal of the respective participant is reproduced from this virtual position, and wherein each sound signal is reproduced along a principal radiation direction, characterized in that in case of one talker (S3), the principal radiation direction of the reproduced sound signal of the respective talker is directed towards the participant (P) at the first location, in case of two simultaneous talkers (S1 , S3), the principal radiation directions of the sound signals of both talkers are directed away from the participant (P) at the first location, in particular in an angle of more than 40° between the principal radiation directions, and in case of more than two simultaneous talkers (S1 , S3, S4) the principal radiation direction of the sound signal of at least one talker, in particular of exactly one talker, is directed towards the participant at the first location, and the principal radiation directions of all other sound signals are directed away from the participant at the first location, in particular in different directions.

2. Method in accordance with claim 1 , characterized in that a visual

representation (V1 - V5) of at least one participant of one of the further locations, preferably of all participants of the further locations, is arranged at the first location.

3. Method in accordance with claim 2, characterized in that the visual

representation (V1 - V5) and the virtual position of the sound source (LS1 - LS5) which reproduces the sound signal of the corresponding participant are correlated with each other.

4. Method in accordance with claims 2 or 3, characterized in that factors

concerning the attention of the participant at the first location are detected, and the principal radiation directions of the sound signals, in particular in case of several simultaneous talkers, are controlled at the first location depending on the detected data.

5. Method in accordance with claim 4, characterized in that a viewing direction of the participant at the first location is detected as factor concerning the participant's attention.

6. Method in accordance with claim 5, characterized in that the sound signal of the talker at a further location, on whose visual representation the viewing direction of the participant at the first location aims, is directed towards the participant at the first location.

7. Method in accordance with claim 6, characterized in that the sound signals of all other talkers are directed into other directions, preferably into different directions.

8. Method in accordance with one of claims 1 to 7, characterized in that at a further location at least a visual representation (VP) of the participant of the first location is arranged, preferably visual representations (VP, VP2 - VP5) of all participants of the further locations including the first location, and factors concerning the attention of a talker at this further location are detected and the principal radiation directions of the sound signals, in particular in case of several simultaneous talkers, are controlled at the first location depending on the detected data.

Method in accordance with claim 8, characterized in that a viewing direction of the talker is detected as factor concerning the talker's attention.

Method in accordance with claim 9, characterized in that at the first location the principal radiation direction of the sound signal of the talker is controlled such that the sound signal is directed towards the participant at the first location when the attention of the talker is aimed at the representation of the participant of the first location.

Method in accordance with one of claims 1 to 10, characterized in that no more than one sound signal is directed towards the participant of the first location, preferably the sound signal of the talker on which the attention of the participant aims.

Method in accordance with one of claims 1 to 11 , characterized in that all sound signals which are not directed towards the participant of the first location are directed in different directions.

Description:
Method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants at at least one further location

[0001] The present invention relates to a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants at at least one further location. [0002] Telephone or video conferences with several participants at different locations are oftentimes subject to acoustical problems because the quality of the sound reproduction is limited and no natural conversation can be achieved. In the simplest case of a telephone conference via an ordinary telephone set with three participants at three different locations, a listener can only distinguish the two other participants if the voices of both are sufficiently different. In particular when both participants speak simultaneously, a differentiation between the two talkers is sometimes impossible and misunderstandings are inevitable.

[0003] Systems are known where the reproduced sound sources are separated in space, such that each talker is represented for example by an own loudspeaker or a virtual sound source that may be generated by stereophonic phantom source techniques or Wave-Field-Synthesis. The listening participant, however, will still miss the important acoustic cues relating to varying speaking directions of the talkers which are important for differentiation, so that a differentiation in case of several talkers is still difficult.

[0004] Only a few systems are known, that reproduce the speaking direction of a talker, like for example the system which is known from WO 2007/062 840 A1. All of these systems reproduce the voice direction in accordance with the detected facing angle of the talker at the recording site. Such systems, however, can neither be employed in teleconferencing systems, which do not have any video information at all, nor in typical videoconferencing systems, which have video information, but where - due to varying screen configurations at the different locations and the different camera locations - the detected facing angle is not directly correlated witta the information about who the talker is talking to. [0005] The object of the present invention is to present a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants in at least one further location that permits a listener at a first location to better differentiate between the participants at the further locations during the conversation.

[0006] In accordance with the invention, a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants in at least one further location is proposed, wherein the sound signal of each further participant is recorded and reproduced at the first location, and wherein each participant who is not present at the first location is allocated a virtual position at the first location, and the sound signal of the respective participant is reproduced from this virtual position, and wherein each sound signal is reproduced along a principal radiation direction. The term "sound signal" in singular form refers to all the sound signals produced by one participant whenever he is talking. Thus, the term "sound signals" in plural refers to the sound signals produced by different participants. The sound signals shall be recorded separately, but in case of several participants at the same location it is also possible to record the signals jointly and to separate them by appropriate means before they are reproduced.

[0007] The method in accordance with the invention shall be used for recording, transmission and reproduction of sound signals, in particular of spoken signals in the framework of a telephone or video conference. It has to be distinguished between a first location, where a first participant is located who is listener, and further locations where further participants are located who will become talker once they start to speak. The terms listener and talker are consequently used synonymously for the participants according to the context. The further participants at the further locations are hence potential talkers that may speak individually or simultaneously.

[0008] It is the goal to enable the listener at the first location to understand the individual talkers at the best. [0009] The described method provides an excellent intelligibility for a listener at the first location. It is obvious that in a telephone or videoconference all participants may be talkers and listeners, and that the described method may be applied to any of the different locations in order to achieve the best acoustic quality for all participants.

[0010] Each participant who is not present at the first location is allocated a virtual position at the first location, for example by means of the position of a loudspeaker or the position of a virtual sound source which reproduces the sound signal of the respective participant. Each reproduced sound signal shall have a directivity (herein also called sound signal direction or principal radiation direction of the sound signal) in order to simulate the speaking direction of a talker. A human talker has a distinctive principal radiation direction which corresponds to the facing direction of the talker. The sound signal directivity is generated by reproducing the sound signal with a directional sound source that has a principal radiation direction, i.e. by reproducing the sound signal along the principal radiation direction of the reproducing sound source. Thus, the sound signal directivity (or direction) corresponds to the principal radiation direction. For the purpose of generating a sound signal having a directivity, either a directional loudspeaker may be employed or any known method (see below under "Voice Direction Production Unit") for simulating the principal radiation direction of a sound source may be employed.

[0011] The method in accordance with the present invention is characterized in that

• in case of one talker, the principal radiation direction of the reproduced sound signal of the respective talker is directed towards the participant at the first location,

• in case of two simultaneous talkers, the principal radiation directions of the sound signals of both talkers are directed away from the participant in the first location, in particular in an angle of more than 40° between the principal radiation directions, and

• in case of more than two simultaneous talkers the principal radiation direction of the sound signal of at least one talker, in particular of exactly one talker, is directed towards the participant at the first location, and the principal radiation directions of all other sound signals are directed away from the participant at the first location, in particular in different directions.

[0012] The sound signals reproduced at the first location have directivity, i.e. a direction which is specified by the principal radiation direction of the reproduced sound signal. The principal radiation direction may be changed during time. To avoid any misunderstanding, the recorded signal may have directivity as well, but such directivity does not need to be detected and recorded. The directivity of the reproduced sound signal at the first location and the principal radiation direction may be determined independently from the original directivity and direction of emission.

[0013] As long as only one sound signal is emitted at the first location, i.e. as long as only one of the further participants of the other locations is speaking, the principal radiation direction of the reproduced sound signal is directed towards the participant at the first location.

[0014] When two participants are speaking simultaneously, the principal radiation directions of the sound signals of both talkers are directed away from the participant, preferably with an angle of more than 40°, and in particular of more than 90° between the principal radiation directions. In such a way, the listener can distinguish better between the two talkers as if both signals were directed towards the listener. [0015] In case of more than two simultaneous talkers, the direction of the sound signal of at least one talker, preferably of exactly one talker, is directed towards the participant at the first location, and the principal radiation directions of all other sound signals are directed away from the participant. Again, the listener can better distinguish between the different signals as in the case where all signals were directed towards the listener. Preferably, all sound signals which are not directed towards the listener are directed in different directions, advantageously with the largest possible angles between them. [0016] The preceding method may in particular be used in a telephone conference with no further information, e.g. about the interaction of the participants available.

[0017] Experiments have shown that the hearing impression of a listener may be enhanced if a visual representation of at least one participant of one further location, and in particular of all participants of all further locations is arranged at the first location. A visual representation may for example be a simple name tag, a photograph or preferably video images of a participant. [0018] In particular, the position of the visual representation of a participant and the position of the sound source reproducing this participant are correlated with one another.

[0019] The method of the invention may still be enhanced by detecting factors concerning the attention of the participant at the first location and by controlling the principal radiation directions of the sound signals at the first location depending on the detected data.

[0020] Controlling the direction of the sound signals depending on the detected data may result in a deviation from the rules described above or may lead to a definite setting in case several alternatives for the principal radiation direction of a sound signal are possible. E.g. the sound signal of the talker to whom the listener directs his attention may be directed towards the listener, and all other sound signals may be turned away from the listener, preferably in different directions.

[0021] The listener has in this way the possibility to influence the reproduction of the sound signals. As factor concerning the attention of the listener, for example, the viewing direction or facing angle of the listener or any other suitable factor may be used.

[0022] It may also be possible to control the principal radiation direction of a sound signal in case of one talker in a way that the signal is not directed towards the listener but is directed away from the listener in case the listener directs his attention elsewhere, e.g. is looking at another participant who is not talking. The same is possible for more than one talker, i.e. if the listener is not directing his attention to one of the talkers but elsewhere none of the signals is directed towards the listener but all signals are directed away from the listener, in particular in different directions. [0023] Alternatively or in addition, the principal radiation direction of the reproduced sound signals may be influenced by factors at the end of the talkers.

[0024] For this purpose, at least one visual representation of the participant of the first location may be arranged in at least one further location, and preferably representations of all participants of the respective other locations shall be arranged at this one further location. Factors regarding the attention of a talker may be detected, e.g., the viewing direction or facing angle of a talker. The principal radiation directions of the reproduced sound signals at the first location may then be controlled depending on the detected data.

[0025] The principal radiation direction of the reproduced sound signal of a talker at the first location can e.g. be controlled in a way that it is directed towards the listener at the first location when the attention of the talker is directed at the visual representation of the listener at the respective location.

[0026] The principal radiation direction for one or more talkers can also be controlled in a way that the signals are only directed towards the listener if the talker's attention is focused on the listener, and e.g. in case none of the talkers is looking at the listener, none of the signals is directed towards the listener, even in case of only one talker.

[0027] It is preferable to direct only one sound signals at a time towards the listener at the first location. [0028] Rules for the principal radiation directions of the reproduced sound signals at the first location may be defined in case that the previously described factors of attention of the listener and/or talkers lead to contradictory of unfavorable settings. It can e.g. be defined that the control of the principal radiation direction by the listener prevails, i.e. that the sound signal of the talker to whom the attention of the listener is directed is directed towards the listener. It could e.g. also be defined that only up to two signals are directed towards the listener, or that the sound signal of the loudest talker or of the talker who first started to speak is directed towards the listener.

[0029] All sound signals not directed towards the listener can advantageously be directed in different directions.

[0030] Several embodiments of the invention are described in more detail with reference to the attached drawings which show:

[0031] Figure 1 shows schematically the setting for a telephone conference with enhanced audio performances at one location;

[0032] Figures 2a to 2c show the system of figure 1 with one, two or three participants speaking;

[0033] Figure 3 shows the system of Figure 1 with additional

information at the first location;

[0034] Figure 4 shows the system of figure 3 in a "control by listener" mode;

[0035] Figure 5 shows the system of Figure 3 with additional video information at one further location;

[0036] Figure 6 shows the system of figure 5 in a "control by talker" mode; [0037] Figure 7 shows the system of figure 5 in a "control by listener" mode;

[0038] Figures 8 - 16 show further examples and embodiments of the present invention. [0039] Figure 1 shows schematically a system that may be used for a telephone conference. [0040] Participant P in the first location LOC1 is sitting in front of several loudspeakers LS1 to LS5 which reproduce the sound signals recorded separately from the five further participants P1 to P5, which are attending the conference form different remote locations LOC1 ' to LOC5'. The loudspeakers LS1 to LS5 may also be virtual sound sources that are generated by e.g. stereophonic phantom source techniques or Wave-Field-Synthesis. The loudspeakers LS1 to LS5 may also be virtual sound sources that are reproduced via headphones by using Head-Related- Transfer-Functions (HRTF) or other spatialization techniques. Data transmission between the different locations is made by means of a well-known data network NW. The sound signals are recorded at locations LOC1 ' to LOC 5' without taking care of the directivity of the sound. The loudspeakers LS1 to LS5, however, have directivity and a principal radiation direction which is controllable by the system.

[0041] Figure 2a shows the setup of figure 1 wherein participant P3, and only participant P3, is talking and becomes hence talker S3. The sound signal of talker S3, but not the speaking direction, is recorded. The sound signal is transmitted to the first location LOC1. The sound signal is reproduced via loudspeaker LS3 which is assigned to participant/talker P3/S3. The reproduced sound signal has directivity and the signal is turned directly towards participant P in the first location LOC1 who becomes listener L.

[0042] In Figure 2b, participants P1 and P3 are both talking, becoming talkers S1 and S3. The sound signals transmitted to the first location LOC1 are reproduced via loudspeakers LS1 and LS3, respectively. The principal radiation directions of both signals are turned away from the listener L, with an angle of more than 40° between the two signals. The listener can better distinguish between the two signals compared to a setting where both signals are directed towards the listener.

[0043] In Figure 2c, participants P1 , P3 and P4 are all talking simultaneously, becoming talkers S1 , S3 and S4. One of the reproduced sound signals, in the present example the sound signal of talker S4, the talker who was the first of the three talkers to start speaking, reproduced by means of loudspeaker LS4, is directed towards the listener L. Both other signals are directed in different directions away from the listener, again with the preferred angle of more than 40° between them.

[0044] Figure 3 shows the setting of figure 1 , wherein additional video information is available. Participant P in the first location LOC1 is sitting in front of video images V1 to V5 of each of the further participants. [0045] As information regarding the participant's attention, the viewing direction D of participant P is detected.

[0046] The situation of figure 2c is shown in figure 4, where participants P1 , P3 and P4 are talking. Participant P / listener L in the first location is looking at the video image of participant P1. The signal of participant P1 who is talker S1 is directed to participant P / listener L, since the listener's attention is directed to this talker. Both other sound signals are directed away from the listener.

[0047] Figure 5 shows a setup similar to the one of figure 3 where information regarding a talker's attention is recorded. The further participants P1 to P5 also have video images VP of each of the further participants in front of them, as is schematically shown for participant P1 only, in order to keep the drawing clear, with video images VP and VP2 to VP5. [0048] In this example, information concerning participant P1 is recorded, but the same may apply to all other participants. However, this is for clarity reasons not represented in the present figure.

[0049] In figure 6, again participants P1 , P3 and P4, respectively talkers S1 , S3 and S4 are talking simultaneously, wherein participant P1 is looking at the screen with the video images of participant P, respectively listener L in the first location, and participants P3/S3 and P4/S4 discuss between the two of them and are looking at each other. The sound signal of talker S1 reproduced by loudspeaker LS1 is directed to the listener L, and the signals of talkers S3 and S4, reproduced by loudspeakers LS3 and LS4, are turned away from the listener L in different directions.

[0050] In a "control by listener" mode or in a combined mode where the control by the listener prevails, the listener can influence the reproduction of the sound signals by looking at one of the talkers so that this signal is directed to the listener.

[0051] Rules may be implemented to limit changes in the signal direction. It is e.g. possible to change the direction of the reproduced signals only if the listener L focuses his viewing direction on one talker for a certain time.

[0052] Figure 7 shows a combination of the previously described setups wherein information regarding the listener's and the talker's attention is available. [0053] As in figure 6, talker S1 is looking at the video images as a representation of the listener L, and talkers S3 and S4 are looking at each other. The listener is looking at talker S1. Consequently, the sound signals of talker S1 are directed to the listener L. [0054] However, after a certain time the attention of the listener is drawn to the discussion between talkers S3 and S4. The Listener L turns his view towards talker S3. In accordance with the rules underlying the process, the sound signal of talker S3 which is reproduced by loudspeaker LS3 is turned towards the listener whereas the sound signal of S1 reproduced by LS1 is turned away from the listener.

[0055] Depending on the defined rules, only one or several reproduced sound signals may be turned towards a listener. If only one reproduced sound signal may be directed to the listener, the viewing direction of the listener prevails and the sound signal of a talker looking at the listener is only directed towards the listener if the listener looks at this talker or looks at none of the talkers.

[0056] Further embodiments of the present invention are described in the following sections by means of example. The embodiments shall in no way limit the scope of the invention as described in the whole description and as claimed in the claims. [0057] The main purpose of the following embodiments of the invention is to improve communication in a videoconference or a teleconference by providing the participants with

• spatially separated audio source positions and

• artificial speaking directions that enhance speech intelligibility.

[0058] In case of a typical videoconference, a sophisticated detection and .

transformation algorithm is necessary to determine which participant is being addressed by a talker and to transform this information to an artificial speaking direction; such a detection and transformation algorithm is part of the present embodiments of the invention. [0059] The audio-enhancement system of the present embodiment of the invention differs from the state of the art in that artificial speaking directions are provided which are independent of the facing angle of the corresponding talkers. The technical effect due to this difference is an improved communication in conferencing systems, which goes beyond simple audio source separation improvements, even when no information about the facing angle of the talkers is available.

[0060] The audio-enhancement system can be operative at one or more receiving sites. This means that the audio-enhancement system is autonomous in the sense, that its operation is independent of whether or not other participating sites are employing such a system.

[0061] The audio-enhancement system comprises the following means to provide the virtual audio source positions which are spatially separated: · a Position Setting Unit, which specifies the spatial position from which a

talker shall be perceived at a specific listener location, wherein the

specification is based upon either the number of remote participants or the screen configuration, i.e. the specific configuration of the remote participants' representations on the display, at the specific listener location, and • a Position Production Unit, which produces the virtual sound source at the specified spatial positions at the specific listener location. [0062] Furthermore, the audio-enhancement system is characterized by the following means to provide artificial speaking directions that enhance speech intelligibility:

• a Voice Direction Setting Unit, which specifies the voice direction of a talker that shall be perceived at a specific listener location, wherein the

specification is based upon either the number of remote talkers at the specific listener location, a control input from the listener, or a control input from the talker, and · a Voice Direction Production Unit, which produces the artificial voice

direction at the specific listener location.

[0063] The audio-enhancement system can be operated in different modes. It is noted that the mode of operation can be selected individually at a receiving site, i.e. independent of the other remote sites. The following modes of operation are possible which have an effect on the Voice Direction Setting Unit:

• "Control by number of talkers"

• "Control by talker"

· "Control by listener"

[0064] Depending on the selected mode, the Voice Direction Setting Unit will specify the voice directions of the talkers based upon the chosen control input. The mode of operation can be selected by the local participant, i.e. the user of the audio- enhancement system, or it can be selected automatically by the system itself. The automatic selection by the system is based upon whether or not visual

representations of the remote participants are available at the receiving location. If visual representations are available, the mode "control by talker" is automatically selected by the system. If no visual representations are available, the system automatically selects the mode "control by number of talkers". The mode of operation "control by listener" can only be selected by the user and overwrites the automatic mode selection. [0065] Furthermore, the Position Setting Unit specifies the virtual audio source positions based on the screen configuration of the specific listener location whenever visual representations of the remote participants are present on the screen. If no visual representations are available, the Position Setting Unit automatically switches to specifying the virtual audio source positions based on the number of remote participants or the number of connected sites.

[0066] The necessary processing for the audio-enhancement system may be implemented in a centralized, in a distributed or in a hybrid manner. In a centralized implementation, the processing of the audio-enhancement system takes place at a central location, such as in the Cloud, a centralized Multipoint Control Unit (MCU), a central server, etc. In a distributed implementation, the necessary processing takes place at the local site that uses the system, i.e. at one or more locations of participants who have the audio-enhancement system implemented at their site. In a hybrid implementation the necessary processing is distributed between a central location and the local site(s) in order to optimize network delays, errors, dropped frames etc.

[0067] In Embodiment 1 , communication enhancement is based on control input from talker.

[0068] In Figure 8A, the voice direction of talker T is identified by the system and at the remote locations the voice direction setting unit sets the voice direction of the talker T according to the control input by the talker. The voice direction of the talker T at the remote locations is assigned according to the distribution of the positions of the remote listeners on the screen.

[0069] According to figure 8B, the voice direction of talker T is identified by the system and at the remote locations the Voice Direction Setting Unit sets the voice direction of the talker T according to the control input by the talker. Only three possible voice directions per location are used.

[0070] When the audio-enhancement system is operated in the "control by talker" mode, a talker T of a videoconference, as shown in Fig. 8A, can address different remote participants (L1 , L2, to L5) which are displayed on his screen according to the screen configuration that he chose. In the example shown in Fig. 8A, the chosen screen configuration shows all remote participants on the left side of the screen, positioned in a vertical row to leave space on the right side of the screen to show e.g. a presentation. As shown by the arrows from the talker T to the different remote listeners Ln, the talker T can address the different remote listeners Ln individually by either directing his voice to the person he wants to address, just like he would do in a real meeting or by manually selecting the person he wants to address. [0071] The manual selection may be accomplished by selecting the image of the target person on the screen (with a cursor, via touch-screen, by typing the name, or any other suitable manual selection method that is known in the art.) The selection is maintained until the talker selects another target person. The audio-enhancement system transmits the selection, i.e. the object identifier that identifies the "current target listener", as meta-data.

[0072] The selection of a listener by directing the talker's voice to the image of the person he wants to address is accomplished as follows: [0073] In a first step, the audio-enhancement system detects (a) the visual gazing direction of the talker by means of known gaze tracking techniques (methods for measuring the point where one is looking), and/or (b) the acoustic speaking direction of the talker by means of known acoustically-based tracking techniques (see e.g. WO2007/062840). For the visually-based detection of the gaze direction, the most popular SW-based techniques use 2D-video images from which the eye position or the facing direction is extracted. Other techniques are based on analyzing 3D- camera video images. Any state of the art technique for detecting the visual gazing direction of the talker may be employed. Also head-tracking techniques may be employed to detect the speaking direction of the talker. [0074] In a second step, the audio-enhancement system transforms the detected visual gazing direction and/or the detected acoustic speaking direction to an object identifier which identifies the person at whom the talker is talking to, i.e. the "current target listener" Lx. The transformation is accomplished by matching the detected visual gazing direction and/or the detected acoustic speaking direction, i.e. the measured point where the talker is looking, with the visual distribution of the remote participant images on the screen of the talker. The visual distribution of the remote participants' images depends on the screen configuration that the talker has chosen.

[0075] In a third step, the audio-enhancement system transmits the object identifier "current target listener" as meta-data.

[0076] If there is no audio-enhancement system at the talker's site that can detect the control input by the talker, it is still possible to provide a listener at a remote location with the control input from the talker. For this purpose, the audio- enhancement system, which may be either implemented at the listener's site or in a centralized manner, executes the following processing steps: [0077] In a first step, the audio-enhancement system detects the visual gazing direction of the talker from the video image received at the listener's location or at the central processing location. As mentioned earlier this is done by means of known gaze tracking techniques (e.g. SW-based techniques using 2D-video images).

[0078] In a second step, the audio-enhancement system transforms the detected visual gazing direction to an object identifier which identifies the person at whom the talker is talking to, i.e. the "current target listener" Lx. For the transformation, some kind of information concerning the screen configuration chosen by the talker has to be available to the audio-enhancement system. The following scenarios to gain that information are possible:

[0079] a. Information about the chosen screen configuration (conference layout and size of screen) at the talker's site is known because it is transmitted to the listener's site. For example, the information about the screen configuration at the talker's site can be transmitted as meta-data from the talker's site, or it is known to a central communication control unit, which organizes the audio and video streams to the connected sites, that also transmits this information as meta-data.

[0080] b. Information about the visual distribution of the remote participants' images on the screen at the talker's site is known based on a calibration process which is performed at all participating locations by a central control unit at the beginning of the meeting.

[0081] c. If no information about the screen configuration or the visual distribution of the remote participants' images on the screen at the talker's site is known, then the following indirect method to acquire the necessary information is employed: During the course of the meeting the meeting scene at the listener's site is evaluated statistically, using fuzzy logic speech recognition, to determine over the course of the meeting the most probable distribution of the remote participants' images on the screens of the remote talkers.

[0082] The transformation is accomplished by matching the detected visual gazing direction, i.e. the measured point where the talker is looking, with the acquired information about the chosen screen configuration or about the visual distribution of the remote participants' images on the screen at the talker's site.

[0083] In a third step, the audio-enhancement system transmits the object identifier "current target listener" as meta-data and uses it to set the voice direction of the talker at the listener's site.

[0084] If there are two or more simultaneous talkers, the above explanations referring to the one talker T apply in the same manner to all the other simultaneous talkers. Accordingly, the audio-enhancement system will transmit an object identifier "current target for talker Tx" for each simultaneous talker. Specification of audio source position based on screen configuration and specification of speaking direction based on input from talker. [0085] At the talker's site, i.e. at the recording site, a camera captures the image of the talker T and a microphone or microphone array captures the voice of the talker T. The captured voice and image of the talker T are transmitted via a network (e.g. VoIP, Internet, Cloud-based network, telephone network, computer network, or any other kind of communication network) to the remote participants Ln.

[0086] In order to improve communication, the system of the present invention reproduces the voice of the talker T at the remote sites as follows: [0087] (1 ) The spatial location from which the voice of the talker is perceived, i.e. the virtual audio source position at a remote site, is mapped with the position of the image of the talker T on the screen at the remote location (which may vary according to the chosen screen configuration), so that the voice of the talker T comes from the location that corresponds to his location on the screen. For example, at the remote location of listener L5 the talker is displayed in the middle of the screen. The system of the invention will therefore reproduce the transmitted voice of the talker T in such a way, that the voice appears to originate from the middle of the screen. [0088] (2) The artificial voice direction (speaking direction) at a remote location, i.e. the production of a directional characteristic of the virtual sound source at a remote location, is mapped according to the control input of the talker, namely in such a way as to reproduce the information to whom the talker is directing his speech to. [0089] a. At the remote location of the participant who is being addressed by the talker, the voice direction of the talker is set to point to this addressed participant. For example, if the talker is addressing the listener L1 by directing his voice to the image of L1 on his screen, then the talker's voice at the remote location of listener L1 will be reproduced such, that the sound source directivity pattern, i.e. the directional characteristic of the reproduced voice of the talker is directed towards the listener L1.

[0090] b. At all the other remote locations, the directivity pattern of the reproduced voice of the talker is set to point away from the remote listener. Alternatively, the voice direction of the talker may be set in such a way as to correspond

approximately (i.e. within perceivable direction variations) to the direction where the addressed participant is positioned on the screen. That means, if the talker is, for example, addressing the listener L1 , then the talker's voice at all remote locations but the remote location of L1 , will be directed towards the direction in which the image of L1 is displayed at the corresponding remote location. For example, as shown in Fig.11A, at the remote location of listener L3, the voice of the talker T is directed to the right, where the listener L1 is positioned on the screen configuration of listener L3.

[0091] c. In order to provide the listeners, who are not addressed, with further cues to distinguish at whom the talker is directing his voice, the following alternative is provided by the system: Instead of providing only left and right voice directions according to the 2D representation of the participants on the screen, also voice directions in between the left and right direction, excluding a range of +/- 7° around the direction which points to the local listener position, are used by the system. As shown in Fig. 8A, the directions in between (shown by dashed lines and labeled with the listener who is being addressed by this direction) are assigned according to the distribution of the positions of the remote listeners on the screen. For example, at the location of listener L2, the voice direction that addresses the listener L1 is shown by the dashed arrow with the label L1. The number of such additional voice directions can be limited to the number of separately perceivable voice directions.

[0092] In Fig. 8A and Fig. 8B, the solid arrows pointing to the listener of the respective location represent the voice direction which is set when the listener of that location is being addressed by the talker. All other arrows represent the set voice direction when the listener of the respective location is not being addressed, but any one of the other listeners is being addressed. The labeling of the arrows indicates which listener is the addressed participant.

[0093] In order to simplify the voice direction production, it is possible to use only three different voice directions for any talker per location. Fig. 8B differs from Fig. 8A in that only three possible voice directions are provided at any location by the system: [0094] If the local participant is being addressed by a talker, the voice of the talker is set to point to the local participant (shown by the solid arrows in Fig. 8B). For example, if the talker T addresses the listener L4, the voice of the talker is set to point to the listener L4 at the remote location of L4.

[0095] If the local participant is not being addressed by a talker, the voice of the talker is set to point away from the remote listener. Depending on where on the screen the person, who is being addressed by the talker, is displayed, the voice direction of the talker is set to correspond to the direction where the addressed participant is positioned on the screen. If the person is positioned to the right of the talker the voice direction of the talker will be set to the right (e.g. listener L3 in the remote location of listener L4 in Fig. 8B), and vice versa if the person is positioned to the left of the talker (e.g. listeners L1 , L2 and L5 in the remote location of listener L4 in Fig.11 B).

[0096] In the exceptional case that the talker is positioned at either the left or the right edge of the screen (e.g. at the remote location of listener L3 and of listener L2 in Fig. 8B), instead of providing a left and a right voice direction, only two left or two right voice directions with different angles pointing away from the remote participant are provided (e.g. the left-pointing arrow (L4) and the left-pointing arrow (L3, L5, L1) at the remote location of listener L2 in Fig. 8B), since all displayed participants will be positioned only to one side of the talker. [0097] If there are two or more simultaneous talkers, the above explanations referring to the one talker T apply in the same manner to all the other simultaneous talkers.

[0098] The audio production unit of the system produces the virtual sound source positions and the artificial voice directions at each remote location. The reproduction methods and accuracy may vary between the remote locations depending on such factors as the available hardware at the remote locations, the employed

reproduction technology, the preferences of the user, etc. [0099] For the reproduction of a virtual sound source position any known method may be employed, such as for example: (a) stereophonic sound source reproduction techniques including normal two-channel stereo systems as well as multi-channel surround sound systems, (b) wave-field synthesis, (c) ambisonics; (d) 2D- and 3D- loudspeaker cluster or loudspeaker arrays (e.g. multi-speaker display systems), or (f) spatial reproduction techniques for headphones.

[0100] The accuracy with which the virtual audio source location is reproduced by the audio reproduction unit of the system, can be varied between one of the following possibilities: (a) the perceived position of the virtual sound source matches the image of the corresponding talker in azimuth, elevation and distance, (b) the position matches the image only in azimuth and elevation, and a generic distance which remains the same for any talker is used, and (c) the position matches the image only in azimuth, and a generic distance and elevation which remain the same for any talker are used. The generic elevation would most preferably be chosen such that it would match the middle of the screen in vertical direction.

[0101] Furthermore, it is often sufficient to provide a limited number of different possible azimuths and/or elevations for different talkers. For example, it may be enough to provide 3 to 5 distinctly perceivable acoustic positions along the azimuth. If more than 3 to 5 remote participants, which are potential talkers, are displayed along the azimuth on the screen, then a dynamic mapping of the current talker or talkers to the closest possible perceivable acoustic position is employed by the system. The same applies to the distribution of perceivable acoustic positions along the elevation where, for example, it is often sufficient to provide only 1 -3 distinctly perceivable acoustic positions along the elevation.

[0102] The dynamic distribution of currently talking participants to a limited number of perceivable azimuth positions and elevation positions that can be produced by the system is accomplished by the Dynamic Position Distribution Unit as follows: Each visual position on the screen is assigned to the closest possible perceivable audio source position that the system can create. If, however, two participants of two visual positions, which are assigned to the same audio source position, are talking simultaneously, then the talker that started first is mapped to its assigned audio source position, and the second talker, who started talking later (even if only a millisecond later) is assigned to the next closest audio source position that is possible. The same applies if even more than two participants are talking

simultaneously.

[0103] For the reproduction of an artificial voice direction any known method for simulating the principal radiation direction of a sound source may be employed, such as for example: (a) 2D- and 3D-loudspeaker cluster or loudspeaker arrays (e.g. multi-speaker display systems), (b) wave-field synthesis which either employs monopole synthesis or appropriate directivity filters, (c) directivity reproduction as described in WO2007/062840, or (d) two-channel-based directivity reproduction for loudspeakers or headphones (as disclosed in a parallel. patent application), or any other known method. [0104] The principal radiation direction of a sound source is specified as the main direction of emission, i.e. the direction whereto in average for all relevant

frequencies (e.g. for all frequencies below 1 kHz) the strongest sound signal is emitted. [0105] Figure 9 shows the definition of the voice direction angle of a talker T with respect to the listener L.

[0106] As shown in Fig. 9, the voice direction angle of a talker with respect to the listener is defined as the angle between the connection line "talker to listener" and the arrow representing the voice direction of the talker, with the talker being the vertex. This means that a voice direction with an angle of 0° corresponds to the voice direction pointing at the listener. A voice direction with an angle of +45° corresponds to the voice direction of a talker which points with +45° to the right of the listener as shown in Fig. 9.

[0107] The accuracy with which the artificial voice directions have to be produced depends on the sensitivity of the human ear. It is often sufficient to provide a limited number of different possible voice direction angles for one talker as perceived by the listener. For example, it may be enough to provide 12 to 18 distinctly perceivable voice directions in a range of 360° around the talker, i.e. to provide a voice direction every 20° to 30°.

[0108] In Embodiment 2, communication enhancement is based on the number of remote participants and the number of remote talkers. In conference settings which do not provide any visual information, as for example in teleconferences, or in which no additional video processing shall be employed, the current invention improves communication by detecting the number of remote participants as well as the number of active remote talkers in order to provide the spatially separated virtual audio source positions and the artificial speaking directions.

[0109] The virtual audio source positions at the site of a specific listener are set as follows: [0110] If no visual information is available, such as in a teleconference, the Position Setting Unit determines the number of remote participants or the number of connected sites, and assigns one virtual audio source position to each remote participant. The assigned virtual audio source positions remain the same throughout the conversation. Whenever a remote participant becomes a talker, the Position Production Unit produces the voice of the talker from the virtual audio source position that is assigned to this remote participant. If only a limited number of virtual audio source positions are available along the azimuth or elevation, the Dynamic Position Distribution Unit can be used (as described earlier) to dynamically distribute the current talkers to the best possible virtual audio source positions. The possible virtual audio source positions along the azimuth may be evenly distributed on a semicircle in front of the listener or may be distributed according to the minimum audible angle in azimuth, leading to uneven spacing between the virtual audio source positions. [0111] If visual information is available, such as in a normal videoconference, the Position Setting Unit sets the virtual audio source position for each remote participant to correspond with the position of his image, so that the voice of any talking remote participant comes from the location that corresponds to his location on the screen (in azimuth and elevation). If only a limited number of virtual audio source positions are available along the azimuth or elevation, the Dynamic Position Distribution Unit can be used (as described earlier) to dynamically distribute the current talkers to the best possible virtual audio source positions. [0112] The artificial voice directions at the site of a specific listener are set based on the number of active remote talkers. The Voice Direction Setting Unit, in a first step, determines the number of active remote talkers while keeping track of which talker was first in time, which second, and so on. Even small time deviations between two talkers that start talking at the same time, are used by the system to determine which talker was first and which second. Based on the determined number of active remote talkers and their time order, the voice directions are set as follows:

[0113] As shown in figure 10, the voice direction of the talker T is set based on the number of talking participants. If only one participant is talking, his voice direction at all remote listener sites is set to point to the listener at the remote site.

[0114] If only one remote participant is talking, i.e. if there is only one active remote talker T, then the voice direction of that talker T is set to point to the listener at the specific site. Fig. 10 shows, for example, that the voice direction of the talker T is set to point to the listener L1 and also to the listener L2, who are both at different locations. This means that both listeners, L1 and L2, who are at different remote sites, will perceive the voice direction of the talker as being directed to them. The same applies to all other remote listeners. This means, if only one remote participant is talking, the voice direction of this talker will be set at all other sites to be pointing to the remote listener. Thus, all remote listeners will perceive the voice direction of the talker as being directed to them, as if the talker were addressing each one of them at the same time. If any of the other remote participants L2, L3, L4 or L5 shown in Fig. 10 becomes the talker T, the voice direction will be pointed, at all remote listener sites, to the corresponding listener, as shown by the dashed arrows.

[0115] Figure 11 shows the setting of the voice direction of two remote participants who are talking simultaneously based on the number of talking participants. [0116] If two remote participants are talking at the same time, the voice direction of each talker will be set to point away from the listener at all remote listener locations. This means, that all remote listeners will perceive the voice directions of both talkers as being directed away from them. For example, in Fig. 11 the voice direction of talker T1 and the voice direction of simultaneous talker T2 are set to point away from the listener L1 in location 1 and the listener L2 in location 2.

[0117] Figure 12 shows the setting of the voice direction, of three remote participants who are talking simultaneously based on the number of talking participants. Talker T1 was the first in time to start talking, T2 the second and T3 the third.

[0118] If a third remote participant T3 joins the two talkers T1 and T2, the voice direction of the first and second talkers will be set to point to the listener, and the voice direction of the third talker will be set to point away from the listener at all remote listener locations, as shown in Fig.15. If one of the three talkers stops talking, the voice directions of the remaining two simultaneous talkers will be reset to point again away from the listener at all remote listener locations (see Fig. 12). [0119] Figure 13 shows the setting of the voice direction of multiple remote participants who are talking simultaneously based on the number of talking participants. Talker T1 was the first in time to start talking, T2 the second, T3 the third, and so on. [0120] If a fourth, fifth, etc. remote participant joins the three talkers T1 T2, and T3, the voice direction of the fourth, fifth simultaneous talker will be set to point away from the listener at all remote listener locations, as shown in Fig.13. If one of the first two talkers stops talking, the third talker becomes the second talker and thus the voice direction is changed from pointing away from the listener towards pointing to the listener at all remote listener locations. Note that a participant, who is talking, is at the same time always a listener, if also other participants are talking. [0121] Figure 14 shows an alternative for setting the voice direction of multiple remote participants who are talking simultaneously based on the number of talking participants. Preferred voice directions are represented in the following table:

[0122] In order to further enhance the communication, especially when more than three participants are talking at the same time, the system provides the following alternative for setting the voice directions of the third, fourth, fifth, etc. simultaneous talker at all remote listener locations:

[0123] As shown in Fig.14, the voice direction of the third simultaneous talker is set to point away from the listener by an angle that is smaller than the angle chosen for the fourth simultaneous talker. The angle chosen for the fourth talker is smaller than the angle chosen for the fifth talker, etc.

[0124] The table above shows two possible examples for voice direction angles that fulfill this requirement. The middle column shows an example where the angle of talker T3 is chosen to be either +20° or -20°. Accordingly, the angle of talker T4 is chosen to be larger, namely either +40° or -40°. In the second example, the angle of talker T3 is chosen to be either +30° or -30°, and the angle of talker T4 is, therefore, chosen to be either +60° or -60°. It does not matter if the positive or the negative angle is chosen, i.e. talker T3 may have any one of the signs, and talker T4 may also have any one of the signs. For more than six talkers, the voice direction of the seventh, eight, etc. simultaneous talker is set to be +90° or -90°.

[0125] The above algorithm for specifying the speaking direction based on the number of simultaneous remote talkers may also be employed in an alternative embodiment of the mode "control by talker" whenever two or more simultaneous talkers are addressing the same one listener. Alternatively, it is also possible to apply only the algorithm rules mentioned for three and more simultaneous remote talkers in the mode "control by talker" whenever three or more simultaneous talkers are addressing the same one listener.

[0126] In Embodiment 3, communication enhancement is based on control input from listener. In situations, in which more than one remote participant is talking, the system of the present invention provides the listener at any remote location with the option to choose for himself which talker he wants to have enhanced in order to hear that talker better than all the other talkers. This mode of operation is called "control by listener" and can be selected individually at any remote location.

[0127] The system provides this option not only when multiple simultaneous talkers are present, but also in situations where only one talker is talking. A listener, who is not being addressed by the talker, has then the option to enhance the talker's voice by choosing the "control by listener" mode, and thus forcing the Voice Direction Setting Unit to set the voice direction of the talker to point to himself, independent of any control input by the talker or the number of talkers.

[0128] In the "control by listener" mode, the listener is allowed to select a talker as a preferred talker. The selection can be done by any one of the following input methods: [0129] Manual selection of a talker: The preferred talker is selected by a manual input from the listener, for example, by selecting the video image of the talker, or selecting an avatar or other visible representation of the talker, whereby the selection may be accomplished with a touch screen, by pointing the curser to the selected talker, etc.

[0130] Visual selection of a talker: The preferred talker is selected by detecting at which talker the listener is gazing at. For this purpose, the viewing direction of the listener is determined by means of gaze-tracking. Whenever the listener focuses his gaze within a given spatial range for a preset time span (e.g. 3-5 seconds for one talker or for multiple simultaneous talkers, or e.g. 3-5 seconds for one talker, 3-5 seconds for two simultaneous talkers, and 5-10 seconds for three and more simultaneous talkers), this spatial range is interpreted as the facing angle of attention. The detected facing angle of attention is then translated into the information at which remote participant the listener is focusing his attention. This translation is accomplished by correlating the detected facing angle of attention with information about the chosen screen configuration, i.e. with information about the spatial position of the remote participants on the screen. [0131] Head-tracking selection of a talker: The preferred talker is selected by tracking the head-orientation of the listener and correlating the detected head- orientation with virtual audio source positions to determine at which remote talker the listener is focusing his attention. This method is especially useful for

teleconferences where no video or other visual representations of the remote participants are provided to the listener.

[0132] For the purpose of enhancing communication based on the control input from the listener, the Voice Direction Setting Unit detects the control input from the listener, determines which talker is selected by the listener, and adjusts the voice directions of all the talkers as follows:

[0133] The voice direction of the selected talker is set to point to the listener.

[0134] The voice directions of all other talkers are set to point away from the listener.

[0135] Alternatively, the voice directions of all other talkers may be set to point away from the listener in such a way, that each voice direction of a non-selected talker has a different voice direction angle with respect to the listener.

[0136] In an alternative embodiment, the talker may select two preferred talkers whose voice directions are set by the Voice Direction Setting Unit to point to the listener. The selection of the preferred talkers is accomplished by (a) a manual selection of both talkers, (b) a visual selection of one talker and a manual selection of the other talker, or (c) a head-tracking selection of one talker and a manual selection of the other talker.

[0137] Specification of audio source position based on screen configuration and specification of speaking direction based on input from listener

[0138] Figure 15 shows the setting of the voice directions of multiple simultaneous talkers based on the input of the listener. [0139] Fig.15 shows an example for a videoconference setup in which two listeners L1 and L2, who are at different locations, have chosen the option " control by listener" to control the voice direction of a preferred talker out of a multitude of simultaneous talkers. [0140] The virtual sound source position of the five simultaneous talkers T1-T5 is set by the Position Setting Unit to correspond with the images on the chosen screen configuration. At the location 1 , the listener L1 has chosen a screen configuration that displays all remote participants in a horizontal line next to each other. The starting points of the vectors representing the voice direction of each remote talker are mapped to the corresponding position on the screen. At the location 2, the listener has chosen a different screen configuration which places the remote participants in both horizontal and vertical direction. Again the virtual sound source positions are set by the Position Setting Unit to correspond with the images on the screen.

[0141] In this example, the control input by the listeners L1 and L2 is accomplished with the visual selection method explained earlier. The dashed arrows at location 1 and location 2 show at which remote participant the respective listener is gazing at. Listener L1 is gazing at talker T4, listener L2 is focusing his attention on talker T3. If the listener's gaze is focused for longer than a preset time-span on a specific talker, e.g. on talker T4 in location 1 or on talker T3 in location 2, then this talker is selected as the preferred talker at the respective location, and the voice direction of the preferred talker is set to point to the listener. As shown in Fig.15, the Voice Direction Setting Unit at location 1 sets the voice direction of talker T4 to point to the listener L1 , and at location 2 the voice direction of talker T3 is set to point to the listener L2. The voice directions of all other talkers are set to point away from the respective listener L1 and L2. As shown in Fig.15, the voice direction angles of all other talkers at one location are chosen to be different.

[0142] Figure 16 shows the setting of the voice direction of the remote talker, who is the only currently talking participant, based on the input of the listener at location 1.

[0143] Fig.16 shows an example of a videoconference where only one remote participant is talking and where one remote listener, who is not being addressed by the talker, has chosen the "control by listener" mode to better hear the talking participant. The talker T is the only participant currently talking. He is addressing listener L4 who is at location 4. Listener L1 at location 1 is focusing his attention at the talker T by gazing at the image of talker T (shown by the dashed arrow pointing from the listener L1 to the talker T).

[0144] At location 4, the mode of operation is set to "control by talker". Accordingly, the voice direction of the talker T is set to point to the listener L4 at this location. At location 1 , the voice direction of the talker T would be pointing to the listener L4, as shown by the dashed arrow pointing towards L4, if the mode of operation "control by talker" were selected. However, because the user L1 has selected the mode "control by listener", the Voice Direction Setting Unit rotates the voice direction vector of the talker T towards the listener L1 (shown by the dashed curved arrow) and sets it to point towards the listener L1 (shown by the solid voice direction vector pointing to the listener L1).

[0145] Remarks

[0146] The following expressions are used synonymously: voice direction, speaking direction, principle radiation direction of the sound signal, sound signal directivity, sound signal direction, talker's or speaker's orientation, talker's facing or gazing direction, talker's facing or gazing angle, sound source orientation, voice directivity, principal radiation direction of the talker or the sound source, directivity. [0147] Any listener can be also a talker. And any listener can be at the same time listener and talker.