Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS, SYSTEM AND COMMUNICATION DEVICE FOR HANDLING DIGITALLY REPRESENTED SPEECH FROM USERS INVOLVED IN A TELECONFERENCE
Document Type and Number:
WIPO Patent Application WO/2022/008075
Kind Code:
A1
Abstract:
Disclosed is a method performed by a system (120) of a communication network (100) for handling digitally represented speech from users involved in a teleconference. The method comprises obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (101, 102, 103) connected to the teleconference; and determining conversation discussions for the digital representations of speech based on speech analysis. The method further comprises determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device (101) based on conversation discussion preferences of a user of the first communication device (101) and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device (101) whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

Inventors:
ARNGREN TOMMY (SE)
ÖKVIST PETER (SE)
FALK TOMMY (SE)
KRISTENSSON ANDREAS (SE)
Application Number:
PCT/EP2020/069579
Publication Date:
January 13, 2022
Filing Date:
July 10, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
H04M3/56
Domestic Patent References:
WO2016205296A12016-12-22
WO2016126819A12016-08-11
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS

1. A method performed by a system (120) of a communication network (100) for handling digitally represented speech from users involved in a teleconference, the method comprising: obtaining (202) digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (101 , 102, 103) connected to the teleconference; determining (204) conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech; determining (212) gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device (101) based on conversation discussion preferences of a user of the first communication device (101 ) and the determined conversation discussions of the digital representations of speech, controlling (214) the digital representations of speech based on the determined gain control values, and sending (216) the controlled digital representations of speech to the first communication device (101) whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

2. Method according to claim 1, wherein the determining (204) of conversation discussions for the received digital representations of speech comprises determining one or more features of each speech and grouping the digital representations of speech in one or more conversation discussions based on the one or more features.

3. Method according to claim 2, wherein the one or more features comprises conversation topics.

4. Method according to any of claims 1-3, further comprising: sending (205), to the first communication device (101), information on the determined conversation discussions for the received digital representations of speech, and receiving (206), from the first communication device, in response to the sent information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device (101 ).

5. Method according to any of the preceding claims, further comprising: receiving (207) user metadata of users of one or more of the plurality of communication devices, the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, conversation topics preferences of the user, wherein the determining (212) of gain control values is performed based on the received user meta data.

6. Method according to any of the preceding claims, wherein the determined (212) gain control values for the digital representations of speech are dependent on a priority of the conversation discussions so that the digital representations of speech determined to be involved in a high priority conversation discussion receives higher gain control values than the digital representations of speech determined to be involved in a conversation discussion having a lower priority than the high priority conversation discussion.

7. Method according to any of the preceding claims, further comprising: determining (208) a first group and a second group of the plurality of communication devices based on the determined (204) conversation discussions for the received digital representations of speech, and sending (209), to the first communication device (101), information on the determined first and second group of communication devices.

8. Method according to claim 7, further comprising: receiving (210), from the first communication device, information that the first communication device wants to join the first group of communication devices, wherein the determining (212) of gain control values is performed based on the received information that the first communication device wants to join the first group of communication devices.

9. A method performed by a first communication device (101) connected to a teleconference provided by a communication network (100), for handling digitally represented speech from users involved in the teleconference, the method comprising: receiving (302), from a system (120) of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (102, 103) connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device (101 ) and on conversation discussions of the digital representations of speech, and playing back (308), on a user interface, the received digital representations of speech.

10. Method according to claim 9, further comprising: receiving (303), from the system (120), information on the conversation discussions for the received digital representations of speech, and sending (304), to the system (120), in response to the received information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device.

11 . Method according to claim 9 or 10, further comprising: sending (301 ), to the system (120), user metadata of the user of the first communication device (101), the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, the conversation discussion preferences of the user. 12. Method according to any of claims 9-11 , wherein the teleconference is illustrated as a virtual meeting on a user interface of the first communication device (101 ), the method comprising: receiving, from the system (120), information on determined first and second group of the plurality of communication devices (102, 103), the first and second groups being formed based on the conversation discussions for the digital representations of speech of the plurality of communication devices, presenting, on the user interface of the first communication device, the virtual meeting so that the first group of communication devices are positioned in a first area on the user interface and the second group of communication devices are position in a second area on the user interface.

13. Method according to claim 12, further comprising: when receiving input from the user that its avatar is moved on the screen towards the first group, sending information to the system (120) that the first communication device (101) wants to join the first group, and receiving from the system (120) updated digital representations of speech gain-controlled with higher gain control values for the second communication devices (102, 103) of the first group than before the update.

14. Method according to claim 12 or 13, further comprising selecting the first group as the group to join for the user of the first communication device, based on user input, and giving the first group more focus on the user interface of the first communication device than the second group, based on the selection.

15. Method according to any of claims 9-14, further comprising: receiving (305), on a user interface of the first communication device (101 ), input on updated gain control values for individual of the digital representations of speech, sending (306), to the system (120), the updated gain control values, and receiving (307), from the system (120), updated individually gain- controlled digital representations of speech, based on the sent updated gain control values.

16. A system (120) operable in a communication network (100) for handling digitally represented speech from users involved in a teleconference, the system (120) comprising a processing circuitry (603) and a memory (604), said memory containing instructions executable by said processing circuitry, whereby the system (120) is operative for: obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (101 , 102, 103) connected to the teleconference; determining conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech; determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device (101) based on conversation discussion preferences of a user of the first communication device (101) and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device (101) whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

17. System (120) according to claim 16, operative for the determining of conversation discussions for the received digital representations of speech by determining one or more features of each speech and grouping the digital representations of speech in one or more conversation discussions based on the one or more features. 18. System (120) according to claim 17, wherein the one or more features comprises conversation topics.

19. System (120) according to any of claims 16-18, further being operative for: sending, to the first communication device (101 ), information on the determined conversation discussions for the received digital representations of speech, and receiving, from the first communication device, in response to the sent information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device (101).

20. System (120) according to any of claims 16-19, further being operative for: receiving user metadata of users of one or more of the plurality of communication devices, the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, conversation topics preferences of the user, and wherein the system is operative for the determining of gain control values based on the received user meta data.

21 . System (120) according to any of claims 16-20, operative for determining the gain control values for the digital representations of speech dependent on a priority of the conversation discussions so that the digital representations of speech determined to be involved in a high priority conversation discussion receives higher gain control values than the digital representations of speech determined to be involved in a conversation discussion having a lower priority than the high priority conversation discussion.

22. System (120) according to any of claims 16-21 , further being operative for: determining a first group and a second group of the plurality of communication devices based on the determined conversation discussions for the received digital representations of speech, and sending, to the first communication device (101 ), information on the determined first and second group of communication devices.

23. System (120) according to any of claims 16-22, further being operative for: receiving, from the first communication device, information that the first communication device wants to join the first group of communication devices, wherein the system is operative for the determining of gain control values based on the received information that the first communication device wants to join the first group of communication devices.

24. A first communication device (101 ) operable to be connected to a teleconference provided by a communication network (100), the first communication device being arranged for handling digitally represented speech from users involved in the teleconference, the first communication device (101) comprising a processing circuitry (703) and a memory (704), said memory containing instructions executable by said processing circuitry, whereby first communication device (101) is operative for: receiving, from a system (120) of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (102, 103) connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device (101 ) and on conversation discussions of the digital representations of speech, and playing back, on a user interface (706), the received digital representations of speech.

25. First communication device (101) according to claim 24, further being operative for: receiving, from the system (120), information on the conversation discussions for the received digital representations of speech, and sending, to the system (120), in response to the received information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device.

26. First communication device (101 ) according to claim 24 or 25, further being operative for: sending, to the system (120), user metadata of the user of the first communication device (101), the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, the conversation discussion preferences of the user.

27. First communication device (101 ) according to any of claims 24-26, wherein the teleconference is illustrated as a virtual meeting on a user interface (706) of the first communication device (101), the first communication device (101) further being operative for: receiving, from the system (120), information on determined first and second group of the plurality of communication devices (102, 103), the first and second groups being formed based on the conversation discussions for the digital representations of speech of the plurality of communication devices, presenting, on the user interface (706) of the first communication device, the virtual meeting so that the first group of communication devices are positioned in a first area on the user interface and the second group of communication devices are position in a second area on the user interface.

28. First communication device (101) according to claim 27, further being operative for: when receiving input from the user that its avatar is moved on the user interface (706) towards the first group, sending information to the system (120) that the first communication device (101) wants to join the first group, and receiving, from the system (120), updated digital representations of speech gain-controlled with higher gain control values for the second communication devices (102, 103) of the first group than before the update. 29. First communication device (101) according to claim 27 or 28, further being operative for: selecting the first group as the group to join for the user of the first communication device, based on user input, and giving the first group more focus on the user interface (706) of the first communication device than the second group, based on the selection.

30. First communication device (101 ) according to any of claims 24-29, further being operative for: receiving, on a user interface (706) of the first communication device (101 ), input on updated gain control values for individual of the digital representations of speech, sending, to the system (120), the updated gain control values, and receiving, from the system (120), updated individually gain-controlled digital representations of speech, based on the sent updated gain control values.

31 . A computer program (605) comprising instructions, which, when executed by at least one processing circuitry of a system (120) of a communication network (100), configured for handling digitally represented speech from users involved in a teleconference, causes the system (120) to perform the following steps: obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (101 , 102, 103) connected to the teleconference; determining conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech; determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device (101) based on conversation discussion preferences of a user of the first communication device (101) and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device (101) whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

32. A carrier containing the computer program (605) according to claim 31 , wherein the carrier is one of an electronic signal, an optical signal, a radio signal, an electric signal or a computer readable storage medium.

33. A computer program (705) comprising instructions, which, when executed by at least one processing circuitry of a first communication device (101 ) operable in a communication network, configured for handling digitally represented speech from users involved in a teleconference, causes the first communication device (101 ) to perform the following steps: receiving, from a system (120) of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices (102, 103) connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device (101 ) and on conversation discussions of the digital representations of speech, and playing back, on a user interface, the received digital representations of speech.

34. A carrier containing the computer program (705) according to claim 33, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, an electric signal or a computer readable storage medium.

Description:
METHODS, SYSTEM AND COMMUNICATION DEVICE FOR HANDLING DIGITALLY REPRESENTED SPEECH FROM USERS INVOLVED IN A

TELECONFERENCE

Technical Field

[0001] The present disclosure relates generally to methods, systems, and communication devices for handling digitally represented speech from users involved in a teleconference. The present disclosure further relates to computer programs and carriers corresponding to the above methods, systems, and devices.

Background

[0002] Many meetings today are held remotely via teleconference solutions. “Teleconference” in this disclosure also comprises video conference. There are several different video conferencing solutions existing today, such as Microsoft ® Teams™, Skype ® , Zoom ® etc. They all offer ways to have virtual meetings that are fully distributed or connecting groups of participants in conference rooms equipped with cameras, screens and microphones, or combinations thereof.

[0003] Further, the cocktail party effect is well-known, i.e. the phenomenon of the human brain ' s ability to focus auditory attention on a particular stimulus while filtering out a range of other stimuli which allows humans to sort out interesting discussions in a noisy room.

[0004] The video conferencing systems of today make use of advanced techniques in the fields of audio processing, video and image processing and Artificial Intelligence (Al). Examples of audio processing that are common today are acoustic echo cancellation, noise suppression, dynamic range control and automatic gain control. This processing is typically needed to improve the intelligibility and provide a consistent speech level when users of the video conference system are in noisy environments and/or are using equipment of different quality. However, even with this audio processing, the audio experience is often not good enough for the users to follow a discussion without difficulties. This might be due to excessive noise, room reverberation, echoes, or the problem of making out individual voices when there are several persons speaking at the same time.

[0005] Today’s video and phone conference system may use floor control or participant prioritization in order to avoid participants talking at the same time.

[0006] It is known today to make use of image processing in different visual application using techniques for face recognition, detection and gaze detection.

[0007] Also, there are systems that can interpret spoken words using e.g.

Natural Language Processing (NLP) or Automatic Speech Recognition (ASR) and associate who says what in a teleconference. In a more general and related context, Amazon ® Transcribe™ and its underlying to-text interpretation Amazon ® Comprehend™ make use of natural language processing to classify language of a conversation or text; extract key phrases, places, people, brands, or events; understand how positive or negative a text is; analyze text using tokenization and parts of speech; and to automatically organize a collection of text files by topic.

[0008] From that, typical capabilities of today’s state of the art language understanding tools may provide further classifications on:

• Identify different speakers in an audio clip/session, i.e. speaker diarization or speaker identification, where each fragment with the speaker that is identified can be labeled;

• Dominant language; examine speech/text to determine the dominant language;

• Entities; detect textual references to the names of people, places, and items as well as references to dates and quantities;

• Key phrases; find key phrases such as "good morning", or a discussion or transcribed document about a basketball game might return the names of the teams, the name of the venue, and the final score;

• Sentiment; determine a dominant sentiment, i.e. feeling, sense, sensation, emotion; and

• Topic modeling; determine common themes and conversation topics. [0009] Amazon® Comprehend™ may with topic modelling given a set of e.g. news articles or transcribed discussions, determine subjects such as sports, politics, or entertainment, and more specifically also who that are in a discussion discussing what with whom.

[00010] https://glue.work/ is an example of a service where users can meet in a virtual environment, move around and speak to people they meet. Spatial audio rendering is sometimes used so that a position of other users is reflected also in the perceived sound from them. Spatial audio makes it easier to hear individual voices in a crowd, both since the direction of each voice will be different and because distance gain will make voices from users close to the listener be heard stronger. This makes it possible to move around in a virtual crowd and find a group that you want to discuss to, much in the same way as in a real meeting situation.

[00011 ] One example of such a service is Entropia Universe™ by Mindark AB, which is a so called Massive Multiplayer Online Role-playing Game (MMORPG). Back in 2011 , the present applicant provided a solution for in-game voice communication for this game, where spatial audio rendering was used in order to present the user with an immersive audio experience with low latency. The product was called Ericsson ® In-Game Communication™.

[00012] The above presented systems allow multiple simultaneous speakers, but in practice it is often hard to handle more than one individual speaker. Further, the systems of today have difficulties to manage the “coffee break room” situation where you have multiple simultaneous meetings in the same physical/virtual room, which creates an environment with crosstalk noise from several different 1-1 discussions and group discussions. Further, it is difficult today to be active in one conversation and still have the possibility to eavesdrop any neighboring conversations at a pleasant sound level. Also, the systems of today have problems handling the cocktail party effect depending on the number of people speaking and amount of background noise. As shown, there is a need of an improved system for handling speech of users involved in a teleconference. Summary

[00013] It is an object of the invention to address at least some of the problems and issues outlined above. It is possible to achieve these objects and others by using methods, systems and communication devices as defined in the attached independent claims.

[00014] According to one aspect, a method is provided that is performed by a system of a communication network for handling digitally represented speech from users involved in a teleconference. The method comprises obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices connected to the teleconference, and determining conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech. The method further comprises determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device based on conversation discussion preferences of a user of the first communication device and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

[00015] According to another aspect, a method is provided that is performed by a first communication device connected to a teleconference provided by a communication network, for handling digitally represented speech from users involved in the teleconference. The method comprises receiving, from a system of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device and on conversation discussions of the digital representations of speech, and playing back, on a user interface, the received digital representations of speech.

[00016] According to another aspect, a system is provided that is operable in a communication network for handling digitally represented speech from users involved in a teleconference. The system comprises a processing circuitry and a memory. Said memory contains instructions executable by said processing circuitry, whereby the system is operative for obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices connected to the teleconference, and determining conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech. The system is further operative for determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device based on conversation discussion preferences of a user of the first communication device and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device, whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

[00017] According to another aspect, a first communication device is provided that is operable to be connected to a teleconference provided by a communication network. The first communication device is further operable for handling digitally represented speech from users involved in the teleconference. The first communication device comprises a processing circuitry and a memory. Said memory contains instructions executable by said processing circuitry, whereby first communication device is operative for receiving, from a system of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device and on conversation discussions of the digital representations of speech, and playing back, on a user interface, the received digital representations of speech.

[00018] According to other aspects, computer programs and carriers are also provided, the details of which will be described in the claims and the detailed description.

[00019] Further possible features and benefits of this solution will become apparent from the detailed description below.

Brief Description of Drawings

[00020] The solution will now be described in more detail by means of exemplary embodiments and with reference to the accompanying drawings, in which:

[00021] Fig. 1 is a schematic block diagram of a communication system in which the present invention may be used.

[00022] Fig. 2 is a flow chart illustrating a procedure performed by a system, according to possible embodiments.

[00023] Fig. 3 is a flow chart illustrating a procedure performed by a communication device, according to possible embodiments.

[00024] Fig. 4 is a block diagram of a communication system and the inventive system according to possible embodiments.

[00025] Fig. 5 is a block diagram in more detail of an embodiment of the inventive system.

[00026] Fig. 6 is a block diagram illustrating a system in more detail, according to further possible embodiments.

[00027] Fig. 7 is a block diagram illustrating a communication device in more detail, according to further possible embodiments. Detailed Description

[00028] Fig. 1 shows a communication network 100 in which embodiments of the present invention can be used. The communication network 100 comprises a teleconference system 110 and a system 120 for handling digitally represented speech from users involved in a teleconference provided by the teleconference system 110. The system 120 may be part of the teleconference system 110 or the system 120 may be separate from the teleconference system 110. As shown in fig. 1 , there are a plurality of communication devices 101 , 102, 103 with the ability to connect to the communication network 100 and to connect to the teleconference system 110. The communication network 100 may also comprise an automatic speech recognition (ASR) system 130 arranged for obtaining digital representations of speech detected from sound captured at a microphone of individual of the plurality of communication devices 101 , 102, 103. Alternatively, the system 120 may be arranged for obtaining digital representations of speech detected from sound captured at a microphone of individual of the plurality of communication devices 101 , 102, 103.

[00029] The communication network 100 may be any kind of wireline or wireless communication network that can provide access to communication devices. Example of such wireless communication networks are Global System for Mobile communication (GSM), Enhanced Data Rates for GSM Evolution (EDGE), Universal Mobile Telecommunications System (UMTS), Code Division Multiple Access 2000 (CDMA 2000), Long Term Evolution (LTE), LTE Advanced, Wireless Local Area Networks (WLAN), Worldwide Interoperability for Microwave Access (WiMAX), WiMAX Advanced, as well as fifth generation wireless communication networks based on technology such as New Radio (NR). Examples of wireline communication networks are telephone networks, cable television networks, internet access networks, and fiber-optic communication networks.

[00030] The communication devices 101, 102, 103 may be any type of wireline or wireless communication device capable of communicating with the communication network 100 and to connect to the teleconference system 110. Examples of wireless and wireline communication devices are a User Equipment (UE), a machine type UE or a UE capable of machine to machine (M2M) communication, a sensor, a tablet, a mobile terminal, a smart phone, a laptop embedded equipped (LEE), a laptop mounted equipment (LME), a USB dongle, a Customer Premises Equipment (CPE), local teleconference equipment etc.

[00031] Fig. 2, in conjunction with fig. 1 , describes a method performed by a system 120 of a communication network 100 for handling digitally represented speech from users involved in a teleconference. The method comprises obtaining 202 digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices 101, 102, 103 connected to the teleconference, and determining 204 conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech. The method further comprises determining 212 gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device 101 based on conversation discussion preferences of a user of the first communication device 101 and the determined conversation discussions of the digital representations of speech, controlling 214 the digital representations of speech based on the determined gain control values, and sending 216 the controlled digital representations of speech to the first communication device 101 whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

[00032] Hereby it is possible to play back the speech of the users of a teleconference to a user of the first communication device so that the user of the first communication device will better hear conversation discussions of its own interest compared to conversation discussions that the user of the first communication device is not interested in. Further; the user of the first communication device may participate in and move between discussions without the real-life demanded physical mobility and may according to preferences emulate being close/distant to interesting/boring discussions. The method is performed for a plurality of the devices involved in the teleconference, wherein the plurality of devices may be all devices of the teleconference or a subset of the devices. The system 120 may be a part of the teleconference system 110 of fig. 1 or a separate system (as shown in the example of fig.1). Alternatively, the system 120 may be spread out over different physical or virtual nodes in the communication network 100, a so-called cloud solution. The system 120 receives the digital representations of speech from the plurality of communication devices directly and/or via the ASR 130 of fig. 1. The first communication device 101 may or may not be a part of the plurality of communication devices. In case the first communication device is entering an ongoing teleconference, it is normally not a part of the plurality of communication devices which digital representations of speech are analysed, as the first communication device just entered the teleconference. In case the first communication device is already part of the ongoing teleconference it may or may not be part of the plurality of communication devices which digital representations of speech are analysed. The conversation discussions that are determined relate to which users in the teleconference that are involved in one and the same discussion. One feature that can be used to find out to which conversation discussions a user is involved in, is to analyze conversation topics of each speech, another feature is to analyze time-domain aspects, i.e. who speaks after who, who speaks at the same time etc., another feature is who addresses who in their speech, yet another feature is spatial placement in a virtual room in case such a room is used. The system may detect matches of one or more of the above features to determine the conversation discussions. Determining conversation discussions may comprise grouping the digital representations of speech based on the matching detection so that users involved in the same discussion are grouped in the same conversation discussion. The determining 204 of conversation discussions may be accomplished based on speech analysis that search for key words in a digital representation of speech from a user of one of the plurality of communication devices. The conversation discussion preferences may be which conversation topics the user of the first communication device is interested in, and/or which other users the user of the first communication device wants to listen to etc. The conversation discussion preferences may have been sent from the first communication device 101 to the system 120 when the user subscribed to the service, or the conversation discussion preferences may be updated during use of the service.

[00033] According to an embodiment, the determining 204 of conversation discussions for the received digital representations of speech comprises determining one or more features of each speech and grouping the digital representations of speech in one or more conversation discussions based on the one or more features. The one or more feature may be one or more of conversation topics, which may be found out from determined key words, time- domain aspects of the digital representations of speech, who addresses who in their speech, spatial placement in a virtual room.

[00034] According to an alternative of this embodiment, the one or more features comprises conversation topics. To determine that two digital representations of speech concerns the same conversation topic is a good indication that the two users are involved n the same conversation discussion.

[00035] According to an embodiment, the method further comprises sending 205, to the first communication device 101 , information on the determined conversation discussions for the received digital representations of speech, and receiving 206, from the first communication device, in response to the sent information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device. The information on the determined conversation discussions may comprise information of conversation topics and/or which users are involved in which conversation discussions. Hereby the user of the first device can give real-time conversation discussion preferences based on the existing conversation discussions in the teleconference, i.e. which of the existing conversation he/she would like to listen to. For example, the first communication device can as information send to the system 120 which of the other users the user of the first communication device would like to listen to, or which conversation topic the user of the first communication device would like to listen to. Further, the user of the first communication device can get information about ongoing discussions in the teleconference directly when entering the teleconference, e.g. a virtual environment of the teleconference.

[00036] According to another embodiment, the method further comprises receiving 207 user metadata of users of one or more of the plurality of communication devices, the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, conversation topics preferences of the user, wherein the determining 212 of gain control values is performed based on the received user meta data. Hereby, speech from individual users of the plurality of communication devices can be better controlled for the first communication device.

[00037] According to a further embodiment, the determined 212 gain control values for each of the digital representations of speech are dependent on a priority of the conversation discussions. Digital representations of speech determined to be involved in a high priority conversation discussion receives higher gain control values than the digital representations of speech determined to be involved in a conversation discussion having a lower priority than the high priority conversation discussion. A high priority of a conversation discussion may be due to the conversation discussion dealing with an important conversation topic, e.g. a work- related conversation topic. By increasing the gain control difference between high priority conversation discussions compared to lower prioritized conversation discussions, important conversation discussions can be heard better for the user of the first communication device. The priority may be pre-set.

[00038] According to another embodiment, the method further comprises determining 208 a first group and a second group of the plurality of communication devices based on the determined 204 conversation discussions for each of the received digital representations of speech, and sending 209, to the first communication device 101 , information on the determined first and second group of communication devices. Hereby, when the teleconference is e.g. illustrated as a virtual meeting on a screen of the first communication device, the first communication device can illustrate the virtual meeting as there are groups of people involved in different discussions at different parts of the screen. For example, the communication devices determined to be in the first group (and their users) can be positioned on a first part of the screen, and the communication devices determined to be in the second group (and their users) can be positioned on a second part of the screen different from the first part.

[00039] According to another embodiment, the method further comprises receiving 210, from the first communication device, information that the first communication device wants to join the first group of communication devices. Further, the determining 212 of gain control values is performed based on the received information that the first communication device wants to join the first group of communication devices. In case the user of the first communication device would like to join a certain group communication, here called the first group of communication devices, the user can indicate that to its communication device and the first communication device informs the system, which then will calculate new gain control values preferably giving higher gain to speech from communication devices of the first group. The user of the first communication device can indicate such a wish to join the first group when a virtual meeting room is used by moving its avatar to the first group.

[00040] Fig. 3, in conjunction with fig. 1 , describes a method performed by a first communication device 101 connected to a teleconference provided by a communication network 100, for handling digitally represented speech from users involved in the teleconference. The method comprises receiving 302, from a system 120 of the communication network 100, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices 102, 103 connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device 101 and on conversation discussions of the digital representations of speech, and playing back 308, on a user interface, the received digital representations of speech. Hereby, the user of the first communication device can here better conversations of the teleconference that he/she is more interested to hear compared to conversations that he/she is less interested in.

[00041] According to an embodiment, the method further comprises receiving 303, from the system 120, information on the conversation discussions for the received digital representations of speech, and sending 304, to the system 120, in response to the received information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device.

[00042] According to another embodiment, the method further comprises sending 301 , to the system 120, user metadata of the user of the first communication device 101 , the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, the conversation discussion preferences of the user.

[00043] According to yet another embodiment, the teleconference is illustrated as a virtual meeting on a user interface of the first communication device 101 .

Further, the method comprises receiving, from the system 120, information on determined first and second group of the plurality of communication devices 102, 103, the first and second groups being formed based on the conversation discussions for each of the digital representations of speech of the plurality of communication devices, and presenting, on the user interface of the first communication device, the virtual meeting so that the first group of communication devices are positioned in a first area on the user interface and the second group of communication devices are position in a second area on the user interface.

[00044] According to another embodiment, the method further comprises, when receiving input from the user that the user ' s avatar is moved on the user interface towards the first group, sending information to the system 120 that the first communication device 101 wants to join the first group, and receiving, from the system 120, updated digital representations of speech gain-controlled with higher gain control values for the second communication devices 102, 103 of the first group than before the update. Hereby the user of the first communication device will automatically hear the speech originating from the devices of the first group when the avatar of the user of the first communication device is moved to the first group.

[00045] According to yet another embodiment, the method further comprises selecting the first group as the group to join for the user of the first communication device, based on user input, and giving the first group more focus on the user interface of the first communication device than the second group, based on the selection. The first group may be given more focus on the user interface, i.e. a screen of the first device, than the second by positioning the first group more central on the screen than the second group or by presenting the first group in a larger scale than the first group.

[00046] According to still another embodiment, the method further comprises receiving 305, on the user interface of the first communication device 101 , input on updated gain control values for individual of the digital representations of speech, sending 306, to the system 120, the updated gain control values, and receiving 307, from the system 120, updated individually gain-controlled digital representations of speech, based on the sent updated gain control values. Such an embodiment can be realized by having a user interface slider so that the user can control volume of the individual digital representations of speech according to his/her own preference. For example, in case groups of communication devices are formed based on ongoing conversation discussions, the user can adapt the volume of crosstalk from other groups. The user interface may be a touch screen or similar. The input may be received from actions performed by the user onto the touch screen.

[00047] Embodiments of the present invention describe a teleconference solution, i.e. audio/video conference solution that can detect, and when needed group audio streams based on which conversation discussion the individual audio stream belongs to using audio stream characteristics such as conversation topics, persons speaking or being mentioned, type of communication device, device capabilities, context etc. Embodiments of the teleconference solution can then suppress or amplify the individual audio streams, i.e. individual digital representations of speech, based on the determined conversation discussion and on conversation discussion preferences of individual participants. A concept is described where multiple discussions occurring simultaneously can be detected, analyzed and presented to the participants, using visual or audio representations, to aid the participant which preferred discussion(s) to join.

[00048] Also, according to an embodiment, the degree of audible crosstalk, i.e. how much a participant aka user hears from other discussions that he/she has not chosen to take part in, can be controlled. For example, the user can determine and balance the degree of audible crosstalk via a user interface e.g. between 0 - 100 %.

[00049] Another aspect is for a participant speaking to control the degree of her voice contribution to the crosstalk, i.e. other participant ' s speech depending on said criteria. It is also advantageous if there is a correlation between listening and speaking degrees of crosstalk; e.g. listen_overhearing + your_crosstalk = 100%, i.e. a normal person cannot manage full duplex and talk and listen simultaneously. In a further aspect, the amount of incoming overhearing or outgoing crosstalk contribution may depend on e.g. ambient noise situation; if you have a noisy background, your recommended crosstalk contribution may be smaller.

[00050] According to an embodiment, depending on device capabilities key words from crosstalk may be represented visually and/or audible at the communication device.

[00051 ] Fig. 4 shows an overview of an embodiment of a teleconference solution. In this embodiment, N communication devices 401 , 402, 403, 404 are connected to a teleconference via a teleconference system 410. Digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices are sent from each device to the teleconference system 410 as the user/users of each communication device 401 , 402, 403, 404 speaks. An Automatic Speech Recognition (ASR) unit 411 extract words from each user’s digital representation of speech. This is fed into a Discussion Detection and Grouping (DDG) unit 415. There may be one common ASR unit 411 for all digital representations of speech, one ASR unit 411 per digital representation of speech, or one ASR unit per plurality of digital representations of speech.

[00052] The DDG unit 415 determines conversation discussions for the received digital representations of speech/extracted words from the ASR, based on speech analysis. The DDG unit 415 further determines gain control values for each digital representation of speech for reproducing each digital representation of speech at individual of the communication devices 401 , 402, 403, 404 based on conversation discussion preferences of a user/users of the individual communication devices and the determined conversation discussions of the digital representations of speech. For determining the gain control values, the DDG may also receive conversation discussion preferences and possibly other data relevant for determining gain control values from the individual communication devices as user metadata 417. The individual gain control values are sent to a gain control unit 418 that suppresses or amplifies the individual digital representations of speech according to the individual gain control values. Thereafter, a spatial audio mixer 420 mixes and renders the digital representations of speech before they are sent to the individual communication devices 401 , 402, 403, 404. The gain control values, which are individual for the digital representation of speech are also individual for each communication device 401 , 402, 403, 404 that is to receive the digital representations of speech. This is illustrated by “To device 1” in fig. 4. In the same way individually gain controlled digital representations of speech may be sent to Device 2, Device 3... Device N according to each conversation discussion preferences of user/users of each of those communication devices. After receiving the gain-controlled digital representations of speech, the individual devices may act on individual of the received digital representations of speech, e.g. to suppress or decrease individual digital representations according to the user’s listening experience. This is illustrated by the arrow 422, which stands for user interaction, in this example from Device 1 .

[00053] Fig. 5 shows an embodiment of the DDG unit 415 in more detail. The DDG unit 415 receives as input: digital representations of speech from individual of the communication devices involved in a teleconference; user metadata including conversation discussion preferences and possibly also user interactions. Based on this input, the DDG determines control data to be used to control the audio experience for the users of the teleconference according to the individual users’ preferences. The DDG may also determine digital representations of ongoing discussions, which will be presented to the users. The user metadata may further comprise device type, device/user position, device/user orientation, discussion settings, user preferences, parameters, etc. The user interaction may comprise user interface interactions such as graphical gestures, volume changes, mute of certain user etc. The DDG unit 415 may comprise one or more of a metadata analyzing subunit 432 that analyzes received user metadata, a meeting parameter analyzing subunit 434 that analyzes meeting parameters such as crosstalk level etc. The DDG unit 415 may comprise a speech analyzer 436 that analyzes the digital representations of speech in order to determine feature such as conversation topics and which user that says what, and possibly also to determine keyword representation, i.e. if a certain keyword exists in the digital representation of speech and possibly also how frequent the usage of the keyword. The speech analyzer 436 may group the digital representations of speech into one or more conversation discussions based on the determined features. The speech analyzer 436 may be based on a Machine Learning (ML) model. The metadata analyzing subunit 432, the meeting parameter analyzing subunit 434 and the speech analyzer 436 feeds their results to a determining subunit 438 that determines control data including gain values to be used to control the audio experience for the users of the teleconference according to the individual users preferences. The determining subunit 438 may also group the individual digital representations of speech into conversation groups depending on the received input.

[00054] According to an embodiment, the DDG unit 415 creates a digital representation of at least one ongoing discussion, based on one or more of user metadata and application data. User metadata may comprise one or more of: device type or device capabilities such as whether the device comprises or is one or more of Extended reality/Mixed reality (XR) functionality, tablet PC, smartphone, and headphones. User metadata may further comprise user preferences such as topics of interest, liked persons/friends, etc., position/orientation of a user in a virtual meeting room and direction to other users, speech data and ASR data, i.e. spoken words from each participant, and timing of speech and/or words aspect. Application data may comprise one or more of: user data related to user interaction such as interaction with a virtual meeting application, input from User Interface (Ul) sliders to control heard crosstalk and own speech’s contribution to crosstalk, captured gestures, touch, etc. on a Ul. Application data may adapt to changes in volume, equalizer (EQ) input, or mute/un-mute input.

[00055] According to an embodiment, the DDG unit 415 creates a digital representation of the discussions, using a trained ML model, based on received meta data and application data. Said digital representation of the discussion includes information about users, i.e. participants of the teleconference, their locations and the different discussions. The DDG unit 415 generates steering/control data, based on the digital representation (visually and/or spatial audio), which will be used to control the audio experience, i.e. the audio suppression and spatial audio, for the participants.

[00056] According to an embodiment, each participant may store its personal discussions and associated participants in a personal digital representation logbook at the DDG unit 415.

[00057] According to an embodiment, a machine learning model (MLM) may be used for discussion detection and topic grouping. Such an MLM would identify, determine and use meeting metadata and user’s metadata to determine and use a selected audio suppression feature, to increase/decrease user audial focus in respect to alignment of a first user’s personal preferences and what topics that are available for interaction within a teleconference aka digital meeting to be attended and/or already attended. In this context, the MLM may include one or more of:

- A cloud server that manages users’ connections, user inbound/outbound data, and attributes associated with digital/virtual meetings; - Managing server may also operate meeting-associated machine learning models, and may in interaction with other servers that operate at least part of required MLMs, receive ML-model data from said other servers;

- A user application alike Microsoft ® Teams, Skype ® , or similar, for attending digital/virtual meetings;

- A communication device that runs the user application associated with the digital meeting, is equipped with a microphone, is equipped with a speaker/headphone, may be combined with or into a Head Mounted Display (HMD), XR glasses, etc. that renders the meeting’s digital environment, and can carry out head-position and gazing direction detection, and may furthermore be equipped with a user-facing camera e.g. enabling classification of user’s facial expressions/emotions.

[00058] According to an embodiment, an MLM may be used operative at one or more of the managing servers and/or in the user application where: at least one instance of the MLM may cater for ASR and speech transcription in general, at least one instance of the MLM may cater for to-textual interpretations and content classifications in general, and at least one instance of the MLM may cater for per participating user, classification user behavior patterns, likes/dislikes, preferred discussion subject, and preferred other users to interact with in context of a digital session. Here an Instance of MLM may be regarded as one or more logical or physical entities, residing in at least one physical or logical node in the communication network.

[00059] The MLM(s) may be trained. The MLM training may include to train ASR and to-text content classification so that the MLM(s) learn from multiple users’ input, to separate and classify speech and possibly transcribe into textual form, in terms of e.g. labeling which speaker said what, which languages were used. Further the MLM may label content with respect to names of people, places, and items as well as references to dates and quantities. Further, the MLM may learn to tag key phrases. For example, from a discussion about a basketball game, the MLM might tag the names of the teams, the name of the venue, and the final score as key phrases. Further, the MLM may learn to detect what sentiment, i.e. feeling, sense, sensation, emotion that was used, etc. This may be achieved using Amazon® Transcribe and Amazon® Comprehend.

[00060] The MLM training may also include to train at least one instance for per user personalization, i.e. establishing a personal MLM that reflects user’s personal behavior patterns, likes/dislikes, preferred discussion subjects, and preferred other persons to interact with in context of a digital session during every-day activities. This may be accomplished via interactions with Amazon® Alexa®, Apple® Siri®, or Google® Assistant® or similar services. Further, per-user personalization training may accomplish e.g. mapping of a user’s purchase patterns, media subscriptions in terms of e.g. sport event, taste of music, type of new media consumed, web pages visited, etc. The per-user personalization training may also comprise deriving user behavior/preferences/etc. in respect to: persons/individuals interacted with such as gender, relation to and age; environment, e.g. work, home, indoors, outdoors, public environment, domestic, abroad, silent or noisy; time such as hour of day, month, winter or summer; personal mood.

[00061] According to another embodiment, a method is provided operative in the managing server and/or in the communication device, or in the user application, or combinations thereof. The method may include the following steps:

- determining user device capabilities, such as capabilities of microphone, speakers, smartphone/tablet screen, XR, HMD, etc.,

- determining whether a user is present in a digital meeting, or is being inbound for a digital meeting,

- determining from targeted meeting, upon readout from MLMs trained on multiple users’ input, which the present users in the meeting are e.g. whether they are known or unknown to the user, conversational patterns, i.e. whom speaking to who, within conversational patterns, classify spoken topics, and readout content tagging.

- Further, the method may comprise determining, upon matching of “MLMs on Personal Preference of topics” (MLMpp) with meeting tagging based on “MLMs trained on Multiple Users” (MLM mu), a metric that comprises one or more of: ranking of match between MLMpp with MLMmu, e.g. in respect of Interesting content and topics, given involved users, users’ current locality, and environment variables.

[00062] According to another embodiment, a method is provided operative in the managing server and/or in the communication device, or in the user application, or combinations thereof. The method may be based on the above mentioned MLMpp/MLMmu matching metric. The method comprises determining if metric value is higher or lower than a threshold value for a certain digital representation of speech. If the metric value is higher than the threshold value, trigger a first audio processing method, where the first audio processing method may provide to put more attention to speech associated with topic of content spoken by persons in said meeting that generates a favorable metric value. If the metric value is lower than the threshold value, trigger a second audio processing method, where the second audio processing method may provide to put less attention to speech associated with topic of content spoken by persons in said meeting that generates an unfavorable metric value. To “put attention to’’ may signify to amplify speech having a metric value higher than the threshold value and suppress speech having a metric value lower than the threshold value. Speech found neutral, i.e. with metric value similar to the threshold, may be given a neutral amplification.

[00063] In Figures 4 and 5, steps associated with the DDG 415 may be executed in MLMs in one or more physical and/or logical node.

[00064] In a basic first approach, it may be practical to reuse current state of the art methods for establishing digital tagging & mapping of speech content. In that scenario it may be typical that a MLMmu resides in one network node and that a MLMpp resides in the communication device, in the user application or in the managing server for the digital meeting session, but likely in a combination thereof whereas personal preferences are gathered cooperatively by user’s interaction with her communication device(s) and the cloud-based services, and further processed in either of her communication devices/ user applications or said managing servers. [00065] In a further approach, the MLMmu and the MLMpp may be catered for within a same server eco system; i.e. that both the MLMmu and the MLMpp are hosted as incarnations in the same general MLM that both classifies and identifies meeting content spoken topics and involved users, as well as training of the per user model representing personal preferences, and the step of determining a metric describing the match between personal preferences and meeting topics.

[00066] According to an embodiment, the feature of using the personal preference matching metric to determine an audio suppression feature may reside in any of the network nodes.

[00067] In an embodiment where the MLMmu and the MLMpp are separate entities, determination of audio suppression metric may be done in the MLMpp or in any other node communicating with said MLMpp.

[00068] In another embodiment where the MLMmu and the MLMpp are parts of same general MLM entity, determination of audio suppression metric may be done in said general MLM entity or in any other node communicating with said general MLM.

[00069] In the following embodiment, user interaction with the user application is described. The user application, i.e. teleconference application, may provide means to control an over-hearing/suppression slider for a user to steer/tune other/own (vice versa own-to-other) crosstalk; depending on other speaker’s topic/voice characteristics. A user's own crosstalk-contribution to the overall crosstalk levels may be dependent on different aspects, such as selected degree of add-on to overall crosstalk (0-100%), and distance/direction from different discussion clusters. A user may experience crosstalk levels from other participants, based on e.g. other users’ own selected contribution or distance/direction from other users. Also, certain topics may be more audible due to user preferences, etc. As an example, in a normal face-to-face coffee room discussion there is quite a high level of crosstalk and it is most common to discuss with people sitting close to you. There is also a tendency in such scenarios that the crosstalk level increases both if more people join the coffee room but also as the discussions progress. In radio theory this of often denoted as the “party or noise rise effect”, which is related also to the Lombard effect. In another example, imagine a situation where a person only focuses on one discussion at a time. In this scenario the crosstalk level is very low, but it is less plausible that you manage to identify or join other interesting discussions.

[00070] Also, according to an embodiment, the level of crosstalk may be controlled by a central server, such as the DGD, e.g. to adapt to type of discussions. For example, highly technical discussions e.g. at work requires very low crosstalk level and good audibility whereas casual discussions may allow higher level of crosstalk as the typical information density may be lower compared to the technical discussion.

[00071] Also, different device capabilities, e.g. XR glasses versus headphone (which has no GUI), provide different abilities to digitally represent the discussions. The XR glasses may provide detailed information graphically and audible about discussion topics, clusters, position of participants, etc., while the headphone is solely dependent on audible information. Hence, useful means for suppression may differ.

[00072] In the following, topic selection using audio is described. Spatial representation of virtual rooms may be auto created based on detected topics, and topic preferences, using e.g. keywords, etc. For example, in a virtual XR-room the most relevant virtual room may be positioned in front of the user, whereas less relevant virtual rooms are provided to left or right of the user. By the user moving her head to left or right more of audio from the less relevant virtual rooms will be picked up. If the user indicates that another virtual room is of more interested, the virtual room positioned in front of the user will be replaced with the room indicated by the user. A user with headphones may receive indications of interesting discussions (key words) represented in 3D audio. If the user turns his/her head towards a discussion with a certain topic in 3D then the system may be arranged so that the user can join that discussion. [00073] According to another embodiment, a user may prefer to follow a person, a conversation topic, or a group of people. The system may then support the user in selecting which topics, clusters or persons to join when entering a teleconference, such as to provide recommendations, predict and prepare a first discussion to join and to predict and prepare (suppress) what discussion(s) not to participate in.

[00074] Fig. 6, in conjunction with fig. 1 , shows a system 120 operable in a communication network 100 for handling digitally represented speech from users involved in a teleconference. The system 120 comprises a processing circuitry 603 and a memory 604. Said memory contains instructions executable by said processing circuitry, whereby the system 120 is operative for obtaining digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices 101 , 102, 103 connected to the teleconference, and determining conversation discussions for the received digital representations of speech, based on speech analysis of the received digital representations of speech. The system 120 is further operative for determining gain control values for each of the digital representations of speech for reproducing each digital representation of speech at a first communication device 101 based on conversation discussion preferences of a user of the first communication device 101 and the determined conversation discussions of the digital representations of speech, controlling the digital representations of speech based on the determined gain control values, and sending the controlled digital representations of speech to the first communication device 101 , whereby the first communication device is able to play back the digital representations of speech according to the gain control values.

[00075] According to an embodiment, the system 120 is operative for the determining of conversation discussions for the received digital representations of speech by determining one or more features of each speech and grouping the digital representations of speech in one or more conversation discussions based on the one or more features. [00076] According to another embodiment, the one or more features comprises conversation topics.

[00077] According to another embodiment, the system 120 is further operative for sending, to the first communication device 101, information on the determined conversation discussions for the received digital representations of speech, and receiving, from the first communication device, in response to the sent information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device 101.

[00078] According to another embodiment, the system 120 is further operative for receiving user metadata of users of one or more of the plurality of communication devices, the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, conversation topics preferences of the user. Further, the system is operative for the determining of gain control values based on the received user meta data.

[00079] According to another embodiment, the system 120 is operative for determining the gain control values for the digital representations of speech dependent on a priority of the conversation discussions so that the digital representations of speech determined to be involved in a high priority conversation discussion receives higher gain control values than the digital representations of speech determined to be involved in a conversation discussion having a lower priority than the high priority conversation discussion.

[00080] According to another embodiment, the system 120 is further operative for determining a first group and a second group of the plurality of communication devices based on the determined conversation discussions for the received digital representations of speech, and sending, to the first communication device 101 , information on the determined first and second group of communication devices.

[00081] According to another embodiment, the system 120 is further operative for receiving, from the first communication device 101 , information that the first communication device wants to join the first group of communication devices. Further, the system is operative for the determining of gain control values based on the received information that the first communication device wants to join the first group of communication devices.

[00082] According to other embodiments, the system 120 may further comprise a communication unit 602, which may be considered to comprise conventional means for communication with the communication devices 101 , 102, 103. The communication unit 602 may also comprise conventional means for communication with other units or systems of the communication system, such as the teleconference system 110 in case the system 120 is separate from the teleconference system, and the ASR 130. The instructions executable by said processing circuitry 603 may be arranged as a computer program 605 stored e.g. in said memory 604. The processing circuitry 603 and the memory 604 may be arranged in a sub-arrangement 601. The sub-arrangement 601 may be a micro processor and adequate software and storage therefore, a Programmable Logic Device, PLD, or other electronic component(s)/processing circuit(s) configured to perform the methods mentioned above. The processing circuitry 603 may comprise one or more programmable processor, application-specific integrated circuits, field programmable gate arrays or combinations of these adapted to execute instructions.

[00083] The computer program 605 may be arranged such that when its instructions are run in the processing circuitry, they cause the system 120 to perform the steps described in any of the described embodiments of the system 120 and its method. The computer program 605 may be carried by a computer program product connectable to the processing circuitry 603. The computer program product may be the memory 604, or at least arranged in the memory. The memory 604 may be realized as for example a RAM (Random-access memory), ROM (Read-Only Memory) or an EEPROM (Electrical Erasable Programmable ROM). In some embodiments, a carrier may contain the computer program 605. The carrier may be one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or computer readable storage medium. The computer-readable storage medium may be e.g. a CD, DVD or flash memory, from which the program could be downloaded into the memory 604. Alternatively, the computer program may be stored on a server or any other entity to which the system 120 has access via the communication unit 602. The computer program 605 may then be downloaded from the server into the memory 604.

[00084] Further, the system 120 may be realized e.g. as a separate node or as a cloud solution, i.e. the system 120 may comprise functionality spread out over different nodes or networks.

[00085] Fig. 7, in conjunction with fig. 1 , describes a first communication device 101 operable to be connected to a teleconference provided by a communication network 100. The first communication device is arranged for handling digitally represented speech from users involved in the teleconference. The first communication device 101 comprises a processing circuitry 703 and a memory 704. Said memory 704 contains instructions executable by said processing circuitry, whereby first communication device 101 is operative for receiving, from a system 120 of the communication network, digital representations of speech detected from sound captured at a microphone of each of a plurality of communication devices 102, 103 connected to the teleconference, the digital representations of speech being individually gain controlled based on gain control values determined based on conversation discussion preferences of a user of the first communication device 101 and on conversation discussions of the digital representations of speech, and playing back, on a user interface (Ul) 706, the received digital representations of speech. The Ul 706 may comprise one or more of any kind of loudspeaker, and any kind of screen.

[00086] According to an embodiment, the first communication device 101 is further operative for receiving, from the system 120, information on the conversation discussions for the received digital representations of speech, and sending, to the system 120, in response to the received information on conversation discussions, information on the conversation discussion preferences of the user of the first communication device. [00087] According to another embodiment, the first communication device 101 is further operative for sending, to the system 120, user metadata of the user of the first communication device 101 , the user metadata comprising one or more of: conversation discussion preferences, user ID, communication device type; user position in a virtual meeting room, the conversation discussion preferences of the user.

[00088] According to another embodiment, the teleconference is illustrated as a virtual meeting on a user interface 706 of the first communication device 101. Further, the first communication device 101 is operative for receiving, from the system 120, information on determined first and second group of the plurality of communication devices 102, 103, the first and second groups being formed based on the conversation discussions for the digital representations of speech of the plurality of communication devices, and presenting, on the user interface 706 of the first communication device, the virtual meeting so that the first group of communication devices are positioned in a first area on the user interface 706 and the second group of communication devices are position in a second area on the user interface 706.

[00089] According to another embodiment, the first communication device 101 is further operative for sending information to the system 120 that the first communication device 101 wants to join the first group, when receiving input from the user that its avatar is moved on the user interface 706 towards the first group and receiving, from the system 120, updated digital representations of speech gain-controlled with higher gain control values for the second communication devices 102, 103 of the first group than before the update.

[00090] According to yet another embodiment, the first communication device 101 is further operative for selecting the first group as the group to join for the user of the first communication device, based on user input, and giving the first group more focus on the user interface 706 of the first communication device than the second group, based on the selection. [00091] According to another embodiment, the first communication device 101 is further operative for receiving, on the user interface 706 of the first communication device 101 , input on updated gain control values for individual of the digital representations of speech, sending, to the system 120, the updated gain control values, and receiving, from the system 120, updated individually gain-controlled digital representations of speech, based on the sent updated gain control values.

[00092] According to other embodiments, the first communication device 101 may further comprise a communication unit 702, which may be considered to comprise conventional means for communication with the system 120, as well as with the teleconference system 110, and the ASR 130, if needed. The instructions executable by said processing circuitry 703 may be arranged as a computer program 705 stored e.g. in said memory 704. The processing circuitry 703 and the memory 704 may be arranged in a sub-arrangement 701 . The sub-arrangement 701 may be a micro-processor and adequate software and storage therefore, a Programmable Logic Device, PLD, or other electronic component(s)/processing circuit(s) configured to perform the methods mentioned above. The processing circuitry 703 may comprise one or more programmable processor, application- specific integrated circuits, field programmable gate arrays or combinations of these adapted to execute instructions.

[00093] The computer program 705 may be arranged such that when its instructions are run in the processing circuitry, they cause the first communication device 101 to perform the steps described in any of the described embodiments of the first communication device 101 and its method. The computer program 705 may be carried by a computer program product connectable to the processing circuitry 703. The computer program product may be the memory 704, or at least arranged in the memory. The memory 704 may be realized as for example a RAM (Random-access memory), ROM (Read-Only Memory) or an EEPROM (Electrical Erasable Programmable ROM). In some embodiments, a carrier may contain the computer program 705. The carrier may be one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or computer readable storage medium. The computer- readable storage medium may be e.g. a CD, DVD or flash memory, from which the program could be downloaded into the memory 704. Alternatively, the computer program may be stored on a server or any other entity to which the first communication device 101 has access via the communication unit 702. The computer program 705 may then be downloaded from the server into the memory 704.

[00094] Although the description above contains a plurality of specificities, these should not be construed as limiting the scope of the concept described herein but as merely providing illustrations of some exemplifying embodiments of the described concept. It will be appreciated that the scope of the presently described concept fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of the presently described concept is accordingly not to be limited. Reference to an element in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more." All structural and functional equivalents to the elements of the above- described embodiments that are known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed hereby. Moreover, it is not necessary for an apparatus or method to address each and every problem sought to be solved by the presently described concept, for it to be encompassed hereby. In the exemplary figures, a broken line generally signifies that the feature within the broken line is optional.