Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FIRST NODE, SECOND NODE, THIRD NODE, AND METHODS PERFORMED THEREBY, FOR HANDLING AUDIO INFORMATION
Document Type and Number:
WIPO Patent Application WO/2018/231106
Kind Code:
A1
Abstract:
Method, performed by a first node (101) for handling audio information. The first node (101) operates in a computer system (10). The first node (101) determines (201), automatically, a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech. The determining (201) is based on a first analysis of a linguistic content of the conversation. The first node (101) initiates (202) providing, to a second node (102) operating in the computer system (10), a first indication of the determined change in topic of conversation.

Inventors:
MOKRUSHIN LEONID (SE)
GRANCHAROV VOLODYA (SE)
VULGARAKIS FELJAN ANETA (SE)
Application Number:
PCT/SE2017/050629
Publication Date:
December 20, 2018
Filing Date:
June 13, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G10L15/18; G06F40/20; G10L15/183; G10L15/26; G10L17/26
Domestic Patent References:
WO2016089594A22016-06-09
Foreign References:
US20120290950A12012-11-15
US20160055533A12016-02-25
US20060020473A12006-01-26
US20090222257A12009-09-03
US6236968B12001-05-22
US20010021909A12001-09-13
US20150332168A12015-11-19
Attorney, Agent or Firm:
BLIDEFALK, Jenny (SE)
Download PDF:
Claims:
A method, performed by a first node (101), for handling audio information, the first node (101) operating in a computer system (10), the method comprising:

- determining (201), automatically, a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech, the determining (201) being based on a first analysis of a linguistic content of the conversation, and

- initiating (202) providing, to a second node (102) operating in the computer system (10), a first indication of the determined change in topic of

conversation.

The method according to claim 1 , wherein the first analysis of the linguistic content is performed on a text transcript of the conversation, the text transcript being based on the audio signals.

The method of claim 2, wherein the first analysis comprises:

a) building a first semantic model as a first formal representation of a first meaning of a first set of one or more sentences,

b) building a second semantic model as a second formal representation of a second meaning of a subsequent second set of one or more sentences, and

c) detecting whether or not the second semantic model has a number of common elements with the first semantic model.

The method according to any of claims 1-3, further comprising:

- obtaining (203), based on the first analysis of the linguistic content, or on a second analysis of audio signals of the conversation, at least one of:

a) a name for a first topic of the conversation,

b) a set of keywords describing the conversation, and

c) an identification of at least one of the one or more speakers. The method according to claim 4, wherein the initiating (202) providing further comprises initiating providing, to at least one of the second node (102) and a third node (103) operating in the computer system (10), a second indication of at least one of:

a) the obtained name for a first topic of the conversation,

b) the obtained set of keywords describing the conversation, and

c) the obtained identification of at least one of the one or more speakers.

A computer program (708), comprising instructions which, when executed on at least one processor (704), cause the at least one processor (704) to carry out the method according to any one of claims 1 to 5.

A computer-readable storage medium (709), having stored thereon a computer program (708), comprising instructions which, when executed on at least one processor (704), cause the at least one processor (704) to carry out the method according to any one of claims 1 to 5.

A method, performed by a second node (102), for handling audio information, the second node (102) operating in a computer system (10), the method comprising:

- obtaining (401), from a first node (101) operating in the computer system (10), a first indication of a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech, the first indication being based on a first analysis of a linguistic content of the conversation,

- determining (402) a name for a first topic of the conversation, based on the obtained first indication, and

- initiating (403) providing, to at least one of the first node (101) and a third

node (103) operating in the computer system (10), a third indication of the determined name for the first topic of conversation.

The method of claim 8, wherein the first indication comprises a first semantic model as a first formal representation of a first meaning of a first set of one or more sentences, and wherein the determining (402) of the name for the first topic is based on performing automated logical reasoning about the first semantic model, based on knowledge stored in a database.

10. A computer program (808), comprising instructions which, when executed on at least one processor (804), cause the at least one processor (804) to carry out the method according to any of claims 8-9.

1 1. A computer-readable storage medium (809), having stored thereon a computer program (808), comprising instructions which, when executed on at least one processor (804), cause the at least one processor (804) to carry out the method according to any of claims 8-9.

12. A method, performed by a third node (103), for handling audio information, the third node (103) operating in a computer system (10), the method comprising:

- facilitating (502) searching, in an interface (105) of the third node (103), by a user of the third node (103), audio signals of a conversation by one or more speakers, the searching being based on at least one of:

a) a name for a first topic of the conversation,

b) a set of keywords describing the conversation, and

c) an identification of at least one of the one or more speakers.

13. The method according to claim 12, further comprising:

- obtaining (501), based on a first analysis of a linguistic content of the

conversation by the one or more speakers at least one of:

a) the name for the first topic of the conversation,

b) the set of keywords describing the conversation, and

c) the identification of at least one of the one or more speakers.

14. The method according to any of claims 12-13, wherein the searching is facilitated in a segment of audio signals comprising speech.

15. A computer program (907), comprising instructions which, when executed on at least one processor (903), cause the at least one processor (903) to carry out the method according to any one of claims 12 to 14.

16. A computer-readable storage medium (908), having stored thereon a computer program (907), comprising instructions which, when executed on at least one processor (903), cause the at least one processor (903) to carry out the method according to any one of claims 12 to 14.

17. A first node (101), configured to handle audio information, the first node (101) being further configured to operate in a computer system (10), the first node (101) being further configured to:

- determine, automatically, a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech, wherein to determine is configured to be based on a first analysis of a linguistic content of the conversation, and

- initiate providing, to a second node (102) configured to operate in the

computer system (10), a first indication of the change in topic of conversation configured to be determined.

18. The first node (101) according to claim 17, wherein the first analysis of the linguistic content is configured to be performed on a text transcript of the conversation, the text transcript being configured to be based on the audio signals.

19. The method of claim 18, wherein the first analysis is configured to comprise:

a) to build a first semantic model as a first formal representation of a first meaning of a first set of one or more sentences,

b) to build a second semantic model as a second formal representation of a second meaning of a subsequent second set of one or more sentences, and

c) to detect whether or not the second semantic model has a number of

common elements with the first semantic model.

20. The first node (101) according to any of claims 17-19, being further configured to:

- obtain, based on the first analysis of the linguistic content, or on a second analysis of audio signals of the conversation, at least one of:

a) a name for a first topic of the conversation,

b) a set of keywords describing the conversation, and

c) an identification of at least one of the one or more speakers. The first node (101) according to claim 20, wherein to initiate providing is further configured to comprise to initiate providing, to at least one of the second node (102) and a third node (103) configured to operate in the computer system (10), a second indication of at least one of:

a) the name, configured to be obtained, for a first topic of the conversation, b) the set of keywords, configured to be obtained, describing the

conversation, and

c) the identification, configured to be obtained, of at least one of the one or more speakers. A second node (102), configured to handle audio information, the second node (102) being further configured to operate in a computer system (10), the second node (102) being further configured to:

- obtain, from a first node (101) configured to operate in the computer system (10), a first indication of a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech, the first indication being configured to be based on a first analysis of a linguistic content of the conversation,

- determine a name for a first topic of the conversation, based on the obtained first indication, and

- initiate providing, to at least one of the first node (101) and a third node (103) configured to operate in the computer system (10), a third indication of the name, configured to be determined, for the first topic of conversation. The method of claim 22, wherein the first indication is configured to comprise a first semantic model as a first formal representation of a first meaning of a first set of one or more sentences, and wherein to determine the name for the first topic is configured to be based on performing automated logical reasoning about the first semantic model, based on knowledge stored in a database.

24. A third node (103), configured to handle audio information, the third node (103) being further configured to operate in a computer system (10), the third node (103) being further configured to:

- facilitate searching, in an interface (105) of the third node (103), by a user of the third node (103), audio signals of a conversation by one or more speakers, the searching being configured to be based on at least one of:

a) a name for a first topic of the conversation,

b) a set of keywords describing the conversation, and

c) an identification of at least one of the one or more speakers.

25. The third node (103) according to claim 24, being further configured to:

- obtain, based on a first analysis of a linguistic content of the conversation by the one or more speakers, at least one of:

a) the name for the first topic of the conversation,

b) the set of keywords describing the conversation, and

c) the identification of at least one of the one or more speakers.

26. The third node (103) according to any of claims 24-25, wherein to search is

configured to be facilitated in a segment of audio signals comprising speech.

Description:
FIRST NODE, SECOND NODE, THIRD NODE, AND METHODS PERFORMED THEREBY, FOR HANDLING AUDIO INFORMATION

TECHNICAL FIELD

The present disclosure relates generally to a first node and methods performed thereby for handling audio information. The present disclosure also relates generally to a second node, and methods performed thereby for handling audio information. The present disclosure additionally relates generally to a third node, and methods performed thereby for handling audio information. The present disclosure further relates generally to a computer program product, comprising instructions to carry out the actions described herein, as performed by the first node, the second node, or the third node. The computer program product may be stored on a computer-readable storage medium.

BACKGROUND

Computer systems may comprise one or more nodes. A node may comprise one or more processors which, together with computer program code may perform different functions and actions, a memory, a receiving and a sending port. Nodes may be comprised in a telecommunications network.

Nodes within a telecommunications network may be wireless devices, e.g., stations (STAs), User Equipments (UEs), mobile terminals, wireless terminals, terminals, and/or Mobile Stations (MS). Wireless devices are enabled to communicate wirelessly in a cellular communications network or wireless communication network, sometimes also referred to as a cellular radio system, cellular system, or cellular network. The

communication may be performed e.g. between two wireless devices, between a wireless device and a regular telephone, and/or between a wireless device and a server via a Radio Access Network (RAN) , and possibly one or more core networks, comprised within the telecommunications network. Wireless devices may further be referred to as mobile telephones, cellular telephones, laptops, or tablets with wireless capability, just to mention some further examples. The wireless devices in the present context may be, for example, portable, pocket-storable, hand-held, computer-comprised, or vehicle-mounted mobile devices, enabled to communicate voice and/or data, via the RAN, with another entity, such as another terminal or a server.

The telecommunications network may cover a geographical area which may be divided into cell areas, each cell area being served by another type of node, a network node or Transmission Point (TP), for example, an access node such as a Base Station (BS), e.g. a Radio Base Station (RBS), which sometimes may be referred to as e.g., evolved Node B ("eNB"), "eNodeB", "NodeB", "B node", or BTS (Base Transceiver Station), depending on the technology and terminology used. The base stations may be of different classes such as e.g. Wide Area Base Stations, Medium Range Base Stations, Local Area Base Stations and Home Base Stations, based on transmission power and thereby also cell size. A cell is the geographical area where radio coverage is provided by the base station at a base station site. One base station, situated on the base station site, may serve one or several cells. Further, each base station may support one or several communication technologies. The telecommunications network may also be a non-cellular system, comprising network nodes which may serve receiving nodes, such as wireless devices, with serving beams.

In 3rd Generation Partnership Project (3GPP) Long Term Evolution (LTE), base stations, which may be referred to as eNodeBs or even eNBs, may be directly connected to one or more core networks. All data transmission in LTE is controlled by the radio base station.

The amount of produced and available media content in the modern days is enormous and production rates are only accelerating, making the problem of finding relevant parts of the media intractable for humans. Therefore, the ability to search for and navigate through audio-visual content, such as e.g., movies, sport games, lecture recordings or news programs, may be very useful. There is also a lot of commercial interest in extracting semantical information from the media for various purposes, e.g., for gathering statistics, predicting market demands or customizing advertisements to different target groups. Currently, however, the ability to search and navigate through audio or audio-visual content is rather limited.

SUMMARY

It is an object of embodiments herein to improve the handling of audio information in a computer system.

According to a first aspect of embodiments herein, the object is achieved by a method, performed by a first node. The method is for handling audio information. The first node operates in a computer system. The first node determines, automatically, a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech. The determining is based on a first analysis of a linguistic content of the conversation. The first node initiates providing, to a second node operating in the computer system, a first indication of the determined change in topic of conversation.

According to a second aspect of embodiments herein, the object is achieved by a method, performed by a second node. The method is for handling audio information. The second node operates in a computer system. The second node obtains, from the first node operating in the computer system, the first indication of the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech. The first indication is based on a first analysis of a linguistic content of the conversation. The second node then determines a name for a first topic of the

conversation, based on the obtained first indication. The second node finally initiates providing, to at least one of the first node and a third node operating in the computer system, a third indication of the determined name for the first topic of conversation.

According to a third aspect of embodiments herein, the object is achieved by a method, performed by a third node. The method is for handling audio information. The third node operates in the computer system. The third node facilitates searching, in an interface of the third node, by a user of the third node, audio signals of a conversation by the one or more speakers. The searching is based on at least one of: a) a name for a first topic of the conversation, b) a set of keywords describing the conversation, and c) an identification of at least one of the one or more speakers.

According to a fourth aspect of embodiments herein, the object is achieved by the first node, configured to handle audio information. The first node is further configured to operate in the computer system. The first node is further configured to determine, automatically, the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech. To determine is configured to be based on the first analysis of the linguistic content of the conversation. The first node is also configured to initiate providing, to the second node configured to operate in the computer system, the first indication of the change in topic of conversation configured to be determined.

According to a fifth aspect of embodiments herein, the object is achieved by the second node, configured to handle audio information. The second node is further configured to operate in the computer system. The second node is further configured to obtain, from the first node configured to operate in the computer system, the first indication of the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech. The first indication is configured to be based on a first analysis of a linguistic content of the conversation. The second node is also configured to determine the name for the first topic of the conversation, based on the obtained first indication. The second node is further configured to initiate providing, to at least one of the first node and the third node configured to operate in the computer system, the third indication of the name, configured to be determined, for the first topic of conversation.

According to a sixth aspect of embodiments herein, the object is achieved by the third node, configured to handle audio information. The third node is further configured to operate in the computer system. The third node is further configured to facilitate searching, in the interface of the third node, by the user of the third node, the audio signals of the conversation by the one or more speakers. The searching is configured to be based on at least one of: a) the name for a first topic of the conversation, b) the set of keywords describing the conversation, and c) the identification of at least one of the one or more speakers.

According to a seventh aspect of embodiments herein, the object is achieved by a computer program. The computer program comprises instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the first node.

According to an eighth aspect of embodiments herein, the object is achieved by computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the first node.

According to a ninth aspect of embodiments herein, the object is achieved by a computer program. The computer program comprises instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the second node.

According to a tenth aspect of embodiments herein, the object is achieved by computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the second node.

According to an eleventh aspect of embodiments herein, the object is achieved by a computer program. The computer program comprises instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the third node. According to a twelfth aspect of embodiments herein, the object is achieved by computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method performed by the third node.

By the first node determining the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech, and then initiating providing the first indication of the determined change to the second node, the first node enables the second node to determine the first topic of the conversation. The second node is further enabled to initiate providing the third indication of the determined first topic of the conversation to the third node. The third node is then enabled to facilitate the searching by the user of the audio signals of the conversation based on the name for the first topic of the conversation, the set of keywords describing the conversation, and/or the identification of at least one of the one or more speakers. Therefore, the ability to search for and navigate through audio or audio-visual content is improved, becoming a less time consuming operation, taking less capacity and processing resources from the network, reducing power consumption of the devices involved, e.g., battery consumption in wireless devices, providing more accurate searches, and overall enhancing the satisfaction of the one or more users. Moreover, fully automated analysis of the audio content, in terms of statistics, etc... , is greatly facilitated.

BRIEF DESCRIPTION OF THE DRAWINGS

Examples of embodiments herein are described in more detail with reference to the accompanying drawings, and according to the following description.

Figure 1 is a schematic diagram illustrating two non-limiting examples in a) and b),

respectively, of a computer system, according to embodiments herein. Figure 2 is a flowchart depicting embodiments of a method in a first node, according to embodiments herein.

Figure 3 is schematic diagram illustrating some terms related to embodiments herein. Figure 4 is a flowchart depicting embodiments of a method in a second node, according to embodiments herein.

Figure 5 is a flowchart depicting embodiments of a method in a third node, according to embodiments herein.

Figure 6 is a schematic diagram illustrating an example of the different components of the computer system and their interactions, according to embodiments herein. Figure 7 is a schematic block diagram illustrating an example of a first node, according to embodiments herein.

Figure 8 is a schematic block diagram illustrating an example of a second node, according to embodiments herein.

Figure 9 is a schematic block diagram illustrating an example of a third node, according to embodiments herein.

DETAILED DESCRIPTION

As part of the development of embodiments herein, a problem will first be identified and discussed.

During a search and navigation through audio-visual content, one of the open problems is automated detection and separation of parts of content that cover certain topics and identification of those topics.

Practice shows that often more semantical information is contained in the audio part than in the video. For example, a clip where a person is talking on the camera for a long period has most of the semantical data in the audio track comparing to almost static picture. However, searching in audio is challenging and currently unsolved problem. For comparison, manual search in video content with scrolling-like navigation gives quite satisfactory results, as at any given time instant, it is sufficient to display the

corresponding video frame. It is not obvious how to mimic this functionality with audio content, as a particular sound or a word at a given time instant contain too little information. Even the keywords, which otherwise may nicely describe a large document, have little meaning in short-time audio segments.

In order to address this problem, several embodiments are comprised herein, which may be understood to relate to semantic scene detection and annotation of audio-visual content. As an overview, embodiments herein may be understood to relate to. a system and a method that may allow to search or scroll through a collection of audio content by topics, keywords and a set of speakers. As such, the system may be understood to be able to perform the following two functions:

1. Automatically detect boundaries and create basic audio units, which may be referred to as "audio scenes", that are best suited for content search; and

2. Annotate the identified audio scenes with searchable metadata, such as topic, keywords and speakers. Embodiments herein may be applicable, but not limited to, areas such as films, TV series, sports programs, music, news, books, online computer games and even placement of advertisements.

Embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which examples are shown. In this section, the embodiments herein will be illustrated in more detail by a number of exemplary embodiments. It should be noted that the exemplary embodiments herein are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present in another embodiment and it will be obvious to a person skilled in the art how those components may be used in the other exemplary embodiments.

Figure 1 depicts two non-limiting examples, in panels "a" and "b", respectively, of a computer system 10, in which embodiments herein may be implemented. In some example implementations, such as that depicted in the non-limiting example of Figure 1 a, the computer system 10 may be a computer network. In the non-limiting example of Figure 1 b, the computer system 10 is implemented in a telecommunications network 100, sometimes also referred to as a cellular radio system, cellular network or wireless communications system The telecommunications network 100 may for example be a network such as a Long-Term Evolution (LTE), e.g. LTE Frequency Division Duplex (FDD), LTE Time Division Duplex (TDD), LTE Half-Duplex Frequency Division Duplex (HD-FDD), LTE operating in an unlicensed band, WCDMA, Universal Terrestrial Radio Access (UTRA) TDD, GSM network, GERAN network, Ultra-Mobile Broadband (UMB), EDGE network, network comprising of any combination of Radio Access Technologies (RATs) such as e.g. Multi-Standard Radio (MSR) base stations, multi-RAT base stations etc., any 3rd Generation Partnership Project (3GPP) cellular network, Wreless Local Area Network/s (WLAN) or WFi network/s, Worldwide Interoperability for Microwave Access (WiMax), 5G system or any cellular network or system. In some examples, the

telecommunications network 100 may support Information-Centric Networking (ICN).

The computer system 10 comprises a plurality of nodes, whereof a first node 101 , a second node 102, and a third node 103 are depicted in Figure 1.

Each of the first node 101 , the second node 102, and the third node 103 may be understood, respectively, as a first computer system, a second computer system, and a third computer system. In some examples, each of the first node 101 , the second node 102, and the third node 103 may be implemented, as depicted in the non-limiting example of Figure 1 b for the first node 101 and the second node 102, as a standalone server in e.g., a host computer in the cloud. Each of the first node 101 , the second node 102, and the third node 103 may in some examples be a distributed node or distributed server, with some of their respective functions being implemented locally, e.g., by a client manager, and some of its functions implemented in the cloud, by e.g., a server manager. Yet in other examples, each of the first node 101 , the second node 102, and the third node 103 5 may also be implemented as processing resources in a server farm. Each of the first node 101 , the second node 102, and the third node 103 may be under the ownership or control of a service provider, or may be operated by the service provider or on behalf of the service provider.

As depicted in the Figure 1 b, the third node 103 may be understood as e.g., a client 10 computer, in the telecommunications network 100. The third node 103 may be for

example, as depicted in the example of Figure 1 , a wireless device as described below, e.g., a UE.

The third node 103 has an interface 105, such as e.g., a touch-screen, a remote control, etc...

15 In some embodiments, any or all of the first node 101 , the second node 102, and the third node 103 may be co-located or be the same device. All the possible

combinations are not depicted in Figure 1 to simplify the Figure.

When comprised in the telecommunications network 100, any of the first node 101 , the second node 102, and the third node 103, may be a network node such as a Remote

20 Radio Unit (RRU), a Remote Radio Head (RRH), a multi-standard BS (MSR BS), or a core network node, e.g., a Mobility Management Entity (MME), Self-Organizing Network (SON) node, a coordinating node, positioning node, Minimization of Driving Test (MDT) node, etc...

The telecommunications network 100 covers a geographical area which, which in 25 some embodiments may be divided into cell areas, wherein each cell area may be served by a radio network node 110, although, one radio network node may serve one or several cells. In the example of Figure 1 b, the radio network node 110 serves a cell 120. The radio network node 1 10 may be a transmission point such as a radio base station, for example an eNB, an eNodeB, or a Home Node B, a Home eNode B or any other network 30 node capable to serve a wireless device, such as a user equipment or a machine type node in the telecommunications network 100. The radio network node 1 10 may be of different classes, such as, e.g., macro eNodeB, home eNodeB or pico base station, based on transmission power and thereby also cell size. In some examples wherein the telecommunications network 100 may be a non-cellular system, the radio network node 35 1 10 may serve receiving nodes with serving beams. The radio network node 1 10 may support one or several communication technologies, and its name may depend on the technology and terminology used. In 3GPP LTE, any of the radio network nodes that may be comprised in the telecommunications network 100 may be directly connected to one or more core networks.

A plurality of wireless devices may be located in the wireless communication network 100. Any of the wireless devices comprised in the telecommunications network

100 may be a wireless node such as a UE which may also be known as e.g., mobile terminal, wireless terminal and/or mobile station, a mobile telephone, cellular telephone, or laptop with wireless capability, just to mention some further examples. Any of the wireless devices comprised in the telecommunications network 100 may be, for example, portable, pocket-storable, hand-held, computer-comprised, or a vehicle-mounted mobile device, enabled to communicate voice and/or data, via the RAN, with another entity, such as a server, a laptop, a Personal Digital Assistant (PDA), or a tablet computer, sometimes referred to as a surf plate with wireless capability, Machine-to-Machine (M2M) device, device equipped with a wireless interface, such as a printer or a file storage device, modem, or any other radio network unit capable of communicating over a wired or radio link in a communications system. Any of the wireless devices comprised in the

telecommunications network 100 is enabled to communicate wirelessly in the

telecommunications network 100. The communication may be performed e.g., via a RAN and possibly one or more core networks, comprised within the telecommunications network 100.

Any of the first node 101 , the second node 102, and the third node 103 may also comprise a receiving device capable of detecting and collecting audio signals, such as a microphone.

The first node 101 is configured to communicate within the computer system 10 with the second node 102 over a first link 131 , e.g., a radio link or a wired link. The first node

101 is configured to communicate within the computer system 10 with the third node 103 over a second link 132, e.g., another radio link or another wired link. The second node

102 may be further configured to communicate within the computer system 10 with the third node 103 over a third link 133, e.g., another radio link, a wired link, an infrared link, etc...

Any of the first link 131 , the second link 132 and the third link 133 may be a direct link or it may go via one or more computer systems or one or more core networks in the telecommunications network 100, which are not depicted in Figure 1 , or it may go via an optional intermediate network. The intermediate network may be one of, or a combination of more than one of, a public, private or hosted network; the intermediate network, if any, may be a backbone network or the Internet; in particular, the intermediate network may comprise two or more sub-networks (not shown). In general, the usage of "first", "second", and/or "third", herein may be understood to be an arbitrary way to denote different elements or entities, and may be understood to not confer a cumulative or chronological character to the nouns they modify.

Embodiments of a method, performed by a first node 101 , for handling audio information, will now be described with reference to the flowchart depicted in Figure 2. As stated earlier, the node 101 operates in the computer system 10.

The method may comprise the following actions. In Figure 2, an optional action is indicated with dashed lines. In some embodiments all the actions may be performed. One or more embodiments may be combined, where applicable. All possible

combinations are not described to simplify the description.

Action 201

The method performed by the first node 101 may be understood to be aimed at ultimately enabling an ability of a user to search for and navigate through audio-visual content, such as e.g., movies, sport games, lecture recordings or news programs, video recorded with a video capturing device such as a video camera, or audio only content, such as digital media streamed from the Internet, based on topic of conversation. Herein, audio content is used to refer to either audio only content or audio-visual content. The audio content, either in audio-only form, or in audio-visual form may have been obtained from an audio content source, such as a media storage, or the receiving device described earlier, either in the first node 101 itself, or in another node, such as e.g., the third node 103. The audio content, e.g., a radio content available online, may comprise or encode audio signals. The audio signals, understood as a representation of sound, may comprise speech, music, noises, etc... The audio content may be obtained by the first node 101 and may comprise audio signals comprising speech. The speech may comprise a conversation by one or more speakers, that is, a monologue or conversation with one self, a dialogue, or a conversation among more than two speakers.

The audio signals in, e.g., an original audio stream, may have been converted, e.g., by another node, such as a transcription node, into a set of textual utterances and time codes for the corresponding segments of audio signals, also referred to herein as audio segments. For example, when a large audio recording may need to be transcribed, that is, to generate the corresponding text, it may not be the entire audio recording which is sent to, e.g., a speech recognition engine. First, smaller audio segments may be extracted, e.g., by an audio segmenter, which may be active speech between larger pauses, for example, one sentence spoken by one person, no longer than 30 seconds. These audio segments may then be sent or provided to be transcribed. A textual utterance may be understood as a transcript generated from an audio segment, as just described. The generated textual utterances, which may also be known as audio transcription utterances, may be obtained by the first node 101. It may be noted that these textual utterances may be based on an audio segment from a particular speaker.

In this Action, the first node 101 determines, automatically, a change in topic of conversation by one or more speakers in a segment of audio signals comprising speech. The determining in this Action 201 is based on a first analysis of a linguistic content of the conversation.

The topic of conversation may be understood as a short written description of the subject of the speech of an audio scene. The topic of conversation may be for example, football, movies, climate change, a political election, etc... The topic of conversation may be set to have different levels of granularity: e.g., politics as opposed to a particular topic within politics.

The change in topic of conversation may be understood as a switch to a new topic of conversation from a former topic of conversation. The new topic of conversation may be referred to herein as a first topic of conversation. The part of a set of one or more segments of audio signals comprising speech relating to a same topic of conversation may be referred to as a "scene" or an "audio scene", which herein may be understood as a "semantic scene". In general "audio scenes" may span a few minutes of audio, while an initial audio segmenter may normally produce blocks of 10-30 seconds of audio.

The segment of audio signals comprising speech may be referred to as well as an active speech segment. That is, a part or section of the audio signals during a time period, wherein speech is detected. Voice Activity Detectors (VAD) may be used to identify portions of the recorded signal where the speech may be present.

That the change in topic of conversation is determined in a segment of audio signals comprising speech may be understood as that the first analysis is based on one or more segments of audio signals comprising speech, excluding the remaining audio signals that may be comprised in the obtained audio content. Hence, wherever the change in topic of conversation takes place, it will be in a segment of audio signals comprising speech. In other words, the first analysis is not conducted based on an entire waveform

corresponding to the obtained audio content.

The linguistic content may be understood to refer to the words, sentences, expressions, etc... comprised in the conversation. The first analysis of the linguistic content may be performed on a text transcript of the conversation, the text transcript being based on the audio signals. The text transcript of the conversation may be performed, by known methods, by the first node 101 itself based on the audio signals, the second node 102, the third node 102, or by another node, e.g., a "transcription node" in the computer system 10. That the text transcript is based on the audio signals may be understood as that the text transcript is of the audio signals.

The determining in this Action 201 may be understood as detecting, based on the first analysis. The first analysis may be understood as a linguistic analysis. The first analysis may be performed either by the first node 101 itself, or by another node, such as e.g., the second node 102, or the third node 103.

The first analysis may comprise gradually building a semantic model of the scene. A model may be understood herein as a formal representation of the meaning of a set of sentences. Formal may be understood herein as a representation that may be processed and analysed by a computer. An example of such a formalism may be Discourse

Representation Structures (DRS), within the Discourse Representation Theory (DRT). With the first analysis, the first node 101 may take written language sentences as input and build DRS from them, as, for example, described in

http://www.let.rug.nl/bos/comsem/book2.html. The detection of change of conversational topic may be performed by analysing if a new incoming sentence fits with DRS

constructed so far. If it does not, the first node 101 may generate a textual description of a current DRS, and may output it as a conversation topic. It may then empty the DRS and start over. The DRS in this case may be understood as a formal model that various automated analysis techniques may be applied to. As a particular non-limiting example, from the text transcript, by performing the first analysis, the first node 101 may generate a semantic graph of concepts being spoken about, for example, by relying on an Automatic Speech Recognition (ASR) engine. In the graph, the concepts may be represented as graphic nodes, and relations between them may be represented as links between the graphic nodes. By combining the generated concept graphs, which may also be referred to as semantic graphs, based on the matched concepts and relations from subsequent textual utterances, the first node 101 may find the boundary of the scene, e.g., by detecting that concepts from the current textual utterance have never been mentioned before, that is, corresponding graphic nodes do not exist in the graph generated from the previous textual utterances.

According to the foregoing, the first analysis may comprise: a) building a semantic model, e.g., DRS, as a formal representation of a meaning of a first set of one or more sentences, and b) detecting whether or not a subsequent second set of one or more sentences, e.g., one new incoming sentence, fits with the built semantic model. Fits may be understood here as that, e.g., the DRS, built from the subsequent second set of one or more sentences, e.g., an incoming sentence, has at least a number, e.g., N, of overlapping common elements with the, e.g., DRS built from the first set of one or more sentences, where N may be a configurable parameter. Alternatively, the knowledge base may contain special rules, such as if-then rules, which may express the conditions under which "fits" holds for two given DRSs.

That is, the first analysis may comprise a) building a first semantic model as a first formal representation of a first meaning of a first set of one or more sentences, b) building a second semantic model as a second formal representation of a second meaning of a subsequent second set of one or more sentences, and c) detecting whether or not the second semantic model has a number of common elements with the first semantic model.

To describe Action 201 in other words, the first node 101 in this Action may detect one or more boundaries of an "audio scene", understood herein as a "semantic scene" in audio content, which may be understood to exclude from the first analysis noise, music or other non-semantic audio content. The first node may therefore also be referred to herein as a "scene detection node".

Action 202

Once a scene boundary may have been detected by the first node 101 having determined the change in topic of conversation, in this Action 202, the first node 101 initiates providing, to the second node 102 operating in the computer system 10, a first indication of the determined change in topic of conversation. The first indication may comprise information about the boundary of the scene, e.g., the time codes of the beginning and the end, and/or the semantic model of the scene, such as the first semantic model, e.g., the concept graph, generated so far. The first node 101 may also initiate providing the first indication to a memory device, e.g., a metadata database node, for storage.

The providing in this Action 202 may be performed via the first link 131. In some embodiments, the second node 102 may be the third communication device 103. That is, the first indication may be directly provided on the device where the user may eventually search or navigate the audio content. Therefore, in some examples, the first indication may also be provided to the third node 103, e.g., via the second link 132, and presented to the user on the interface 105, e.g., a screen on a telephone or a computer.

Initiating providing may be understood as beginning or triggering outputting, or sending.

After performing Action 202, the first node 101 may then be reset to the initial state, ready to receive new textual utterances. Otherwise, or in addition, the first node101 may perform Action 203.

Action 203

In this Action, the first node 101 may obtain, based on the first analysis of the linguistic content, or on a second analysis of audio signals of the conversation, at least one of: a) a name for a first topic of the conversation, b) a set of keywords describing the conversation, and c) an identification of at least one of the one or more speakers.

Obtaining in this Action 203 may be understood as deriving, or determining, itself or as receiving from the second node 102 via the first link 131 , or the third node 103 via the second link 132, or from another node, such as e.g., a "speaker identifier node".

The second analysis may be understood as a second type of processing of the audio signals, which is different than the first analysis. The second analysis may comprise an analysis of the audio signals by a speaker change detector to detect the various one or more speakers.

The name for the first topic, or newly detected topic of conversation, may be understood as a word that best describes the first topic. In some embodiments, the first node 101 may obtain an indication of the name for the first topic of conversation from the second node 102, which will be described later in Action 302 as a "third indication".

The keywords in the set of keywords may have been used or uttered during the conversation, or otherwise they may best describe the first topic of conversation even if they may not have been used during the conversation. In parallel with the described above operations, the first node 101 may continuously generate a set of keywords from the text transcript, based entirely on a statistical deviation of words used.

The identification of the one or more speakers may be performed by relying on a speaker change detector, and may be understood to comprise any of providing a name in e.g., a subtitle, a picture, or any alternative type of identification for each of the identified speakers, e.g., associate different labels/color codes to the corresponding parts of the audio transcription, in case the true speaker identity may not be known. In some examples, each of the speakers may be identified. The identification of the one or more speakers may have been performed in the second analysis by analysing the text transcript of the conversation, the text transcript being based on the audio signals.

Any of the obtained name for a first topic of the conversation, the obtained set of keywords describing the conversation, and the obtained identification of the at least one of the one or more speakers may then be stored in a memory in the first node 101 itself or in another node, e.g., a "metadata database node".

In some embodiments wherein Action 203 may have been performed, the initiating providing in this Action 202 may further comprise initiating providing, to at least one of the second node 102 and a third node 103 operating in the computer system 10, a second indication of at least one of: a) the obtained name for a first topic of the conversation, b) the obtained set of keywords describing the conversation, and c) the obtained

identification of at least one of the one or more speakers.

The second indication may be for example, a message comprising a code for the first topic of conversation, a list of codes or identifiers for each of the keywords, or a list of names, or pictures for the identified one or more speakers.

Figure 3 is a schematic diagram illustrating some of the concepts used herein. Audio-visual content may be, for example, a recording of a news television program that a user may want to navigate to search for a part of the program where a new law on taxation is discussed. The audio-visual content comprises audio signals, here illustrated as a waveform representing sound. The audio signals may be divided into audio segments, some of which may comprise speech, as indicated in the Figure, or other audio signals, such as music or sounds, or, background noise. Audio information is used to refer generally herein to any of audio content, audio signals, and/or audio segments.

Embodiments of a method, performed by the second node 102, for handling audio information, will now be described with reference to the flowchart depicted in Figure 4. As stated earlier, the second node 102 operates in the computer system 10.

The detailed description of some of the following corresponds to the same references provided above, in relation to the actions described for the first node 101 , and will thus not be repeated here. For example, the audio content may be obtained as audio signals. The audio signals may comprise speech.

The method comprises the following actions. Action 401

In some embodiments, the name for the first topic of the conversation may be determined or derived by the second node 102, which may be referred to herein as a "topic annotation node". The second node 102 may be understood to add annotations or metadata to the audio content to describe various aspects of the audio signals comprised therein, so that a user may ultimately be enabled to search for and navigate through the audio-visual content, e.g., a movie, based on topic of conversation.

For example, the annotations or metadata added by the second node 102 may associate topic names to the scene semantic models that may be received from the first node 101. In order to be able to determine the name for a first topic of the conversation and ultimately annotate the audio content, in this Action 401 , the second node 102 first obtains, from the first node 101 operating in the computer system 10, the first indication of the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech. The first indication is based on the first analysis of the linguistic content of the conversation.

The first indication, as described above, may be a semantic model of the scene, such as the first semantic model, that is, the DRS, or the graphic nodes with the links described earlier.

Obtaining may be understood as receiving, e.g., via the first link 131.

That the first indication is based on the first analysis of the linguistic content of the conversation may be understood to mean that since the change in topic of conversation may have been determined based on the first analysis of the linguistic content of the conversation, the content of the first indication will so be based.

Action 402

In this Action, the second node 102 determines the name for the first topic of the conversation, based on the obtained first indication.

The determining in this Action 402 may be implemented by performing automated logical reasoning about the semantic model, based on knowledge stored in a knowledge database. The latter may be provisioned by e.g., experts in different content types, such as news, sports, movies etc... In other words, first a formal model, e.g., DRS, may be built from text transcripts, as described above. In addition, a knowledge base may be available, wherein the base may comprise rules in a for form [if xxx then yyy], which capture the knowledge on how to determine a topic of conversation based on the contents of DRS. For example, a rule may state [if there is X and X is a city and there is Y and Y is type of precipitation and there is Z and Z is indication of the temperature value, then topic of conversation is "weather in city X"]. The automated logical reasoning may be understood to try to match various rules from the knowledge base to the DRS. If some of them match, it may deduce the topic of conversation. It is important to note that rules may trigger, e.g., match "if" part", on results of application of other rules produced by, e.g., the "then" part. Logical reasoning may be understood as involving searching through various combinations and sequences of rule invocations.

According to the foregoing, in some embodiments, the first indication may comprise the first semantic model as the first formal representation of the first meaning of the first set of one or more sentences, and the determining of Action 402 of the name for the first topic may be based on performing automated logical reasoning about the first semantic model, based on knowledge stored in a database. Action 403

In this Action 303, the second node 102 initiates providing, to at least one of the first node 101 , via e.g., the first link 131 , and the third node 103, via e.g., the third link 133, operating in the computer system 10, the third indication of the determined name for the first topic of conversation. The third indication may be e.g., a textual representation of the scene topic. For example, a semantic annotation, or metadata, added to the audio content, e.g., to a frame of a movie. The semantic annotation or metadata may be understood as a description of which conversation topics, and, possibly, involving which speakers, appear at which time points in the annotated audio content. The annotation/s may be performed by segment of audio signals. In other examples, the third indication may be a message comprising a code or index for the determined name for the first topic of conversation.

The second node 102 may also initiate providing the third indication to a metadata database for storage.

Embodiments of a method, performed by a third node 103, for handling audio information will now be described with reference to the flowchart depicted in Figure 5. As stated earlier, the third node 103 operates in the computer system 10.

The detailed description of some of the following corresponds to the same references provided above, in relation to the actions described for the first node 101 , and will thus not be repeated here. For example, the audio content may be obtained as audio signals comprising speech.

The method comprises the following actions.

The method may comprise the following actions. In Figure 5, the optional action is indicated with dashed lines. In some embodiments all the actions may be performed. One or more embodiments may be combined, where applicable. All possible

combinations are not described to simplify the description.

Action 501

The third node 103 may be understood as a node which may allow a user to search or navigate through the audio content. The third node 103 may be referred to herein as a "human interface node". In order to ultimately facilitate the searching of the audio content to the user, in this Action 501 , the third node 103 may first obtain, based on the first analysis of the linguistic content of the conversation by the one or more speakers the at least one of: a) the name for the first topic of the conversation, b) the set of keywords describing the conversation, and c) the identification of at least one of the one or more speakers.

The obtaining in this Action 501 may comprise a) determining autonomously, similarly to how it has been described for the first node 101 or the second node 102 above, b) retrieving from a memory, e.g., the metadata database node, or c) receiving from the first node 101 , e.g., via the second link 132, or from the second node 102, e.g., via the third link 133. In some examples, the name for the first topic of the conversation may be obtained from the second node 102 as the third indication. Option c) may be the most typical in examples of embodiments herein. The first analysis has been described in relation to Action 201.

Action 502

In this Action, the third node 103 facilitates searching, in the interface 105 of the third node 103, by a user of the third node 103, the audio signals of the conversation by the one or more speakers. The searching is based on at least one of: a) the name for a first topic of the conversation, e.g., as obtained in Action 501 , b) the set of keywords describing the conversation, e.g., as obtained in Action 501 , and c) the identification of at least one of the one or more speakers, e.g., as obtained in Action 501.

In some embodiments, the searching may be facilitated in a segment of audio signals comprising speech. The searching of the audio signals may be facilitated in the audio content. That is, the user may be enabled to perform the searching in e.g., a recording of a TV program or an audio stream.

The third node 103 may facilitate searching in the interface 105 by any of the 5 following two functionalities.

The first functionality may be understood as searching of the scene in a media database by combination of all or some of the parameters described above: a) name of the topic, b) known speaker identities, and c) set of keywords. The third node 103 may enable to perform a syntactic search in the metadata database based on user input, and 10 may output pointers to various audio content fragments that match a given topic. A

syntactic search may be understood as searching for an appearance of a given text string in a topic of conversation taken from metadata of an audio content.

The second functionality may be understood as a scrolling-like navigation in a given audio stream, where at each scrolling position, a topic name at that position, speaker 15 identities/labels, and keywords associated with that scene may be displayed.

To exemplify some of the foregoing in other words, particular examples of embodiments of the method and arrangement of the computer system 10 herein may relate to a system and one or more methods that may identify audio scenes in an audio

20 content, semantically annotate them, and allow to search or scroll through a collection of audio content by the topics, keywords, and set of speakers, associated with an audio scene. Embodiments herein may therefore relate to audio-visual content, semantic annotation, semantic graphs, a semantic network, Automatic Speech Recognition, audio analysis, linguistic analysis, audio to text, audio transcription or text transcript, and/or

25 logical reasoning.

One advantage of embodiments herein is that the methods described improve the process of navigating through audio-visual content by automatically extracting audio semantic information, and using this data to identify audio segments, as well as annotate 30 those segments with a set of semantic metadata, used for later search through the audiovisual content.

Figure 6 is a schematic block diagram of a particular non-limiting example of the computer system 10, according to embodiments herein. In Figure 6, the computer system 35 10 comprises the following components or nodes: the audio content source 601 , the transcription node 602, the first node 101 or scene detection node, the knowledge database node 603, the speaker identifier node 604, the second node 102 or topic annotation node, the metadata database node 605, the third node 103 or human interface node, and the media storage 606. The nodes described herein may be understood to be logical and may be placed in a distributed fashion across a network, such as the computer system 10, which may be e.g., the telecommunications network 100.

In the example of Figure 6, the input to the computer system 10 comes from the audio content source 601 , e.g., digital media streamed from Internet, or a video capturing device such as a video camera. The same media may be later presented to the user via the third node 103 or human interface node. The storage in e.g., media storage 606 and fetching of media content that takes place in between is not described herein. The transcription node 602 may comprise the initial audio segmenter described above, which may include also a speaker change detector, and an Automatic Speech Recognition (ASR) engine. This transcription node 602 may convert an audio stream into a set of textual utterances and time codes for the corresponding audio segments. The generated audio transcription utterances may be passed to the first node 101 or scene detection node and the speaker identifier node 604. It may be noted that these audio transcription utterances may be based on an audio segment from a particular speaker, and the time codes for those utterances should not be mistaken with time codes for "audio scenes" described next. In general "audio scenes" may span a few minutes of audio, while the initial audio segmenter may normally produce blocks of 10-30 seconds of audio.

The first node 101 or scene detection node may gradually build a semantic model of the scene as described above. Once a scene boundary may have been detected in Action 201 , information about it, e.g., the time codes of the beginning and the end may be sent to the metadata database node 605 for storage. The semantic model of the scene, e.g., the concept graph, generated so far may be sent to the second node 102, or topic annotation node, as described in Action 202 in the first indication.

In parallel with described above operations the first node 101 may continuously generate set of keywords from the transcript, with e.g., a key words generator module. The task of the second node 102 or topic annotation node may be understood to be to associate topic names to the received scene semantic models. The second node 102 may rely on knowledge stored in the knowledge database node 603. The latter may be provisioned by e.g., experts 607 in different content types. The output of the topic annotation node may be a textual representation of the scene topic, which may be sent to the metadata database node 605 for storage, as described in Action 403.

The speaker identifier node 604 may perform analysis of audio data to detect 5 various speakers of the one or more speakers, and either provide their identity, in case an enrolled speaker is detected, or associate different labels/color codes to the

corresponding parts of audio transcription, in case the true speaker identity is not known. These associations may be sent to the metadata database node 605 for storage along with the corresponding audio transcriptions and detected keywords.

10 The third node 103 or human interface node may then obtain, according to Action 501 , the name for the first topic of the conversation, the set of keywords describing the conversation, and/or the identification of at least one of the one or more speakers, and may then offer the two functionalities described above in relation to Action 502.

15

To perform the method actions described above in relation to Figure 2, the first node 101 , configured to handle audio information may comprise the following arrangement depicted in Figure 7. As stated earlier, the first node 101 is further configured to operate in the computer system 10.

20 The detailed description of some of the following corresponds to the same

references provided above, in relation to the actions described for the first node 101 , and will thus not be repeated here. For example, the audio content may be obtained as audio signals. The audio signals may comprise speech.

The first node 101 is further configured to, e.g., by means of a determining module

25 701 configured to, determine, automatically, the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech, wherein to determine is configured to be based on the first analysis of the linguistic content of the conversation.

In some embodiments, the first analysis of the linguistic content may be configured 30 to be performed on the text transcript of the conversation, the text transcript being

configured to be based on the audio signals.

In some embodiments, the first analysis may be configured to comprise: a) to build the first semantic model as the first formal representation of the first meaning of the first set of one or more sentences, b) to build the second semantic model as the second 35 formal representation of the second meaning of the subsequent second set of one or more sentences, and c) to detect whether or not the second semantic model has the number of common elements with the first semantic model.

The first node 101 is further configured to, e.g., by means of an initiating module 702 configured to, initiate providing, to the second node 102 configured to operate in the 5 computer system 10, the first indication of the change in topic of conversation configured to be determined.

In some embodiments, to initiate providing may be further configured to comprise to initiate providing, to at least one of the second node 102 and a third node 103 configured to operate in the computer system 10, the second indication of at least one of: a) the

10 name, configured to be obtained, for the first topic of the conversation, b) the set of

keywords, configured to be obtained, describing the conversation, and c) the

identification, configured to be obtained, of at least one of the one or more speakers.

The first node 101 may be further configured to, e.g., by means of an obtaining module 703 configured to, obtain, based on the first analysis of the linguistic content, or

15 on a second analysis of audio signals of the conversation, at least one of: a) the name for the first topic of the conversation, b) the set of keywords describing the conversation, and c) the identification of at least one of the one or more speakers.

The embodiments herein may be implemented through one or more processors,

20 such as a processor 704 in the first node 101 depicted in Figure 7, together with

computer program code for performing the functions and actions of the embodiments herein. The program code mentioned above may also be provided as a computer program product, for instance in the form of a data carrier carrying computer program code for performing the embodiments herein when being loaded into the in the first node 101. One

25 such carrier may be in the form of a CD ROM disc. It is however feasible with other data carriers such as a memory stick. The computer program code may furthermore be provided as pure program code on a server and downloaded to the first node 101.

The first node 101 may further comprise a memory 705 comprising one or more memory units. The memory 705 is arranged to be used to store obtained information,

30 store data, configurations, and applications etc. to perform the methods herein when being executed in the first node 101.

In some embodiments, the first node 101 may receive information from the second node 102, the third node 103, another node and/or any of the pertinent databases described above, through a receiving port 706. In some examples, the receiving port

35 706 may be, for example, connected to one or more antennas in first node 101. In other embodiments, the first node 101 may receive information from another structure in the computer system 10 through the receiving port 706. Since the receiving port 706 may be in communication with the processor 704, the receiving port 706 may then send the received information to the processor 704. The receiving port 706 may also be configured to receive other information from other communication devices or structures in the computer system 10.

The processor 704 in the first node 101 may be further configured to transmit or send information to e.g., the second node 102, the third node 103, another node and/or any of the pertinent databases described above, through a sending port 707, which may be in communication with the processor 704, and the memory 705.

Those skilled in the art will also appreciate that the determining module 701 , the initiating module 702, and the obtaining module 703 described above may refer to a combination of analog and digital modules, and/or one or more processors configured with software and/or firmware, e.g., stored in memory, that, when executed by the one or more processors, such as the processor 704, perform as described above. One or more of these processors, as well as the other digital hardware, may be included in a single Application-Specific Integrated Circuit (ASIC), or several processors and various digital hardware may be distributed among several separate components, whether individually packaged or assembled into a System-on-a-Chip (SoC).

Also, in some embodiments, the different modules 701-703 described above may be implemented as one or more applications running on one or more processors such as the processor 704.

Thus, the methods according to the embodiments described herein for the first node 101 may be respectively implemented by means of a computer program 708 product, comprising instructions, i.e., software code portions, which, when executed on at least one processor 704, cause the at least one processor 704 to carry out the action described herein, as performed by the first node 101. The computer program 708 product may be stored on a computer-readable storage medium 709. The computer-readable storage medium 709, having stored thereon the computer program 708, may comprise instructions which, when executed on at least one processor 704, cause the at least one processor 704 to carry out the action described herein, as performed by the first node 101. In some embodiments, the computer-readable storage medium 709 may be a non- transitory computer-readable storage medium, such as a CD ROM disc, or a memory stick. In other embodiments, the computer program 708 product may be stored on a carrier containing the computer program 708 just described, wherein the carrier is one of an electronic signal, optical signal, radio signal, or the computer-readable storage medium 709, as described above.

To perform the method actions described above in relation to Figure 4, configured to handle audio information, may comprise the following arrangement depicted in Figure 8. As stated earlier, the second node 102 is further configured to operate in the computer system 10.

The detailed description of some of the following corresponds to the same references provided above, in relation to the actions described for the second node 102, and will thus not be repeated here. For example, the audio content may be obtained as audio signals. The audio signals may comprise speech.

The second node 102 is configured to, e.g., by means of an obtaining module 801 configured to, obtain the signals configured to be collected by the receiving device 130 located in the space where the one or more users of the third communication device 103 are located, during the time period. The signals are configured to be at least one of: audio signals and video signals.

The second node 102 is further configured to, e.g., by means of a determining module 1102 configured to, obtain, from the first node 101 configured to operate in the computer system 10, the first indication of the change in topic of conversation by the one or more speakers in the segment of audio signals comprising speech. The first indication is configured to be based on the first analysis of the linguistic content of the conversation.

The second node 102 is further configured to, e.g., by means of a determining module 802 configured to, determine the name for the first topic of the conversation, based on the obtained first indication.

In some embodiments, the first indication may be configured to comprise the first semantic model as the first formal representation of the first meaning of the first set of one or more sentences, and to determine the name for the first topic may be configured to be based on performing automated logical reasoning about the first semantic model, based on knowledge stored in the database.

The second node 102 is further configured to, e.g., by means of an initiating module 803 configured to, initiate providing, to at least one of the first node 101 and the third node 103 configured to operate in the computer system 10, the third indication of the name, configured to be determined, for the first topic of conversation.

The embodiments herein may be implemented through one or more processors, such as a processor 804 in the second node 102 depicted in Figure 8, together with computer program code for performing the functions and actions of the embodiments herein. The program code mentioned above may also be provided as a computer program product, for instance in the form of a data carrier carrying computer program code for performing the embodiments herein when being loaded into the in the second node 102. 5 One such carrier may be in the form of a CD ROM disc. It is however feasible with other data carriers such as a memory stick. The computer program code may furthermore be provided as pure program code on a server and downloaded to the second node 102.

The second node 102 may further comprise a memory 805 comprising one or more memory units. The memory 805 is arranged to be used to store obtained information,

10 store data, configurations, schedulings, and applications etc. to perform the methods herein when being executed in the second node 102.

In some embodiments, the second node 102 may receive information from the first node 101 , the third node 103, another node, and/or any of the pertinent databases described above, through a receiving port 806. In some examples, the receiving port

15 806 may be, for example, connected to one or more antennas in second node 102. In other embodiments, the second node 102 may receive information from another structure in the computer system 10 through the receiving port 806. Since the receiving port 806 may be in communication with the processor 804, the receiving port 806 may then send the received information to the processor 804. The receiving port 806 may also be

20 configured to receive other information.

The processor 804 in the second node 102 may be further configured to transmit or send information to e.g., the first node 101 , the third node 103, another node, and/or any of the pertinent databases described above, through a sending port 807, which may be in communication with the processor 804, and the memory 805.

25 Those skilled in the art will also appreciate that the obtaining module 801 , the

determining module 802, and the initiating module 803 described above may refer to a combination of analog and digital modules, and/or one or more processors configured with software and/or firmware, e.g., stored in memory, that, when executed by the one or more processors such as the processor 804, perform as described above. One or more of

30 these processors, as well as the other digital hardware, may be included in a single

Application-Specific Integrated Circuit (ASIC), or several processors and various digital hardware may be distributed among several separate components, whether individually packaged or assembled into a System-on-a-Chip (SoC). Also, in some embodiments, the different modules 801-803 described above may be implemented as one or more applications running on one or more processors such as the processor 804.

Thus, the methods according to the embodiments described herein for the second node 102 may be respectively implemented by means of a computer program 808 product, comprising instructions, i.e., software code portions, which, when executed on at least one processor 804, cause the at least one processor 804 to carry out the action described herein, as performed by the second node 102. The computer program 808 product may be stored on a computer-readable storage medium 809. The computer- readable storage medium 809, having stored thereon the computer program 808, may comprise instructions which, when executed on at least one processor 804, cause the at least one processor 804 to carry out the action described herein, as performed by the second node 102. In some embodiments, the computer-readable storage medium 809 may be a non-transitory computer-readable storage medium 809, such as a CD ROM disc, or a memory stick. In other embodiments, the computer program 808 product may be stored on a carrier containing the computer program 808 just described, wherein the carrier is one of an electronic signal, optical signal, radio signal, or the computer-readable storage medium 809, as described above. To perform the method actions described above in relation to Figure 5, the third node 103, configured to handle audio information, may comprise the following

arrangement depicted in Figure 9. As stated earlier, the third node 103 is further configured to operate in the computer system 10.

The detailed description of some of the following corresponds to the same references provided above, in relation to the actions described for the third node 103, and will thus not be repeated here. For example, the audio content may be obtained as audio signals. The audio signals may comprise speech.

The third node 103 is configured to, e.g., by means of a facilitating module 901 configured to, facilitate searching, in the interface 105 of the third node 103, by the user of the third node 103, the audio signals of the conversation by the one or more speakers. The searching is configured to be based on at least one of: a) the name for a first topic of the conversation, b) the set of keywords describing the conversation, and c) the identification of at least one of the one or more speakers.

In some embodiments, to search may be configured to be facilitated in a segment of audio signals comprising speech. The third node 103 is further configured to, e.g., by means of an obtaining module 902 configured to, obtain, based on the first analysis of the linguistic content of the conversation by the one or more speakers, at least one of: a) the name for the first topic of the conversation, b) the set of keywords describing the conversation, and c) the identification of at least one of the one or more speakers.

The embodiments herein may be implemented through one or more processors, such as a processor 903 in the third node 103 depicted in Figure 9 together with computer program code for performing the functions and actions of the embodiments herein. The program code mentioned above may also be provided as a computer program product, for instance in the form of a data carrier carrying computer program code for performing the embodiments herein when being loaded into the in the third node 103. One such carrier may be in the form of a CD ROM disc. It is however feasible with other data carriers such as a memory stick. The computer program code may furthermore be provided as pure program code on a server and downloaded to the third node 103.

The third node 103 may further comprise a memory 904 comprising one or more memory units. The memory 904 is arranged to be used to store obtained information, store data, configurations, schedulings, and applications etc... to perform the methods herein when being executed in the third node 103.

In some embodiments, the third node 103 may receive information from the first node 101 , the second node 102, another node, and/or any of the pertinent databases described above, through a receiving port 905. In some examples, the receiving port 905 may be, for example, connected to one or more antennas in third node 103. In other embodiments, the third node 103 may receive information from another structure in the computer system 10 through the receiving port 905. Since the receiving port 905 may be in communication with the processor 903, the receiving port 905 may then send the received information to the processor 903. The receiving port 905 may also be configured to receive other information.

The processor 903 in the second node 102 may be further configured to transmit or send information to e.g., first node 101 , the third node 103, another node, and/or any of the pertinent databases described above, through a sending port 906, which may be in communication with the processor 903, and the memory 904.

Those skilled in the art will also appreciate that the facilitating module 901 , and the obtaining module 902 described above may refer to a combination of analog and digital modules, and/or one or more processors configured with software and/or firmware, e.g., stored in memory, that, when executed by the one or more processors such as the processor 903, perform as described above. One or more of these processors, as well as the other digital hardware, may be included in a single Application-Specific Integrated Circuit (ASIC), or several processors and various digital hardware may be distributed among several separate components, whether individually packaged or assembled into a System-on-a-Chip (SoC).

Also, in some embodiments, the different modules 901-902 described above may be implemented as one or more applications running on one or more processors such as the processor 903.

Thus, the methods according to the embodiments described herein for the third node 103 may be respectively implemented by means of a computer program 907 product, comprising instructions, i.e., software code portions, which, when executed on at least one processor 903, cause the at least one processor 903 to carry out the action described herein, as performed by the third node 103. The computer program 907 product may be stored on a computer-readable storage medium 908. The computer- readable storage medium 908, having stored thereon the computer program 907, may comprise instructions which, when executed on at least one processor 903, cause the at least one processor 903 to carry out the actions described herein, as performed by the third node 103. In some embodiments, the computer-readable storage medium 908 may be a non-transitory computer-readable storage medium 908, such as a CD ROM disc, or a memory stick. In other embodiments, the computer program 907 product may be stored on a carrier containing the computer program 907 just described, wherein the carrier is one of an electronic signal, optical signal, radio signal, or the computer-readable storage medium 908, as described above.

According to the foregoing, some examples of embodiments herein may also comprise a carrier comprising any of the second indication and the third indication, as respectively described above, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

When using the word "comprise" or "comprising" it shall be interpreted as non- limiting, i.e. meaning "consist at least of".

The embodiments herein are not limited to the above described preferred embodiments. Various alternatives, modifications and equivalents may be used.

Therefore, the above embodiments should not be taken as limiting the scope of the invention.