Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
EXTRACTING AND CONTEXTUALIZING TEXT
Document Type and Number:
WIPO Patent Application WO/2022/256864
Kind Code:
A1
Abstract:
A computer-implemented method of extracting and contextualising text from a transcript ofan interaction between two or more participants includes extracting segments of text fromthe transcript. Context items, related to respective said segments of text, are extracted fromthe transcript. The segments of text together with their related context items are displayed ina graphic user interface.

Inventors:
MCCOWAN IAIN (AU)
DU PLESSIS JACO (AU)
JOSHI ADITYA (AU)
CRIPWELL LIAM (AU)
Application Number:
PCT/AU2022/050560
Publication Date:
December 15, 2022
Filing Date:
June 08, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PINCH LABS PTY LTD (AU)
International Classes:
G06F40/205; G06F17/16; G06F40/279; G06N3/08; G06N20/00; G10L15/16; G10L15/18; G10L15/183; G10L15/26; G10L17/00
Foreign References:
US20210099317A12021-04-01
US20170344535A12017-11-30
US20200243095A12020-07-30
US9336776B22016-05-10
US20190147882A12019-05-16
Attorney, Agent or Firm:
EAGAR, Barry (AU)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method of extracting and contextualising text from a transcript of an interaction between two or more participants, the method comprising the steps of: extracting segments of text from the transcript; extracting context items, related to respective said segments of text, from the transcript; and displaying the segments of text together with their related context items in a graphic user interface.

2. The computer-implemented method as claimed in claim 1 , which includes the steps of: receiving an audio input in the form of an audio stream or a recording of the interaction between the two or more participants; and generating the transcript with speech recognition software.

3. The computer-implemented method as claimed in claim 2, which includes the step of partitioning the audio input according to speaker identity and generating participant data including one or more of participant identification, timing of participant speech and duration of participant speech to be associated with respective partitions in the audio input.

4. The computer-implemented method as claimed in claim 3, which includes the step of tagging the segments of text with tags representing the participant data associated with the segments of text, the step of extracting the context items including the step of extracting the participant data so that it can be displayed together with the extracted segments of text.

5. The computer-implemented method as claimed in claim 1 , wherein the step of extracting the segments of text from the transcript includes the step of extracting segments of text, in the form of action items, which indicate action to be taken as a result of the recorded interaction.

6. The computer-implemented method as claimed in claim 5, wherein the step of identifying and extracting the action items includes the steps of: extracting suggested action items from the transcript, the suggested action items including duplicate action items; and extracting the duplicate action items.

7. The computer-implemented method as claimed in claim 6, wherein the step of extracting the duplicate action items includes the steps of: converting each action item to a numerical vector in an embedding space; comparing the vectors to determine their similarity; and grouping those vectors that cannot be readily distinguished from each other together as potential duplicates of a common underlying action item.

8. The computer-implemented method as claimed in claim 5, wherein the step of identifying and extracting the action items includes the step of executing a classifier that is trained to at least predict action items.

9. The computer-implemented method as claimed in claim 8, which includes the step of training the classifier using a manually labelled dataset in which relevant phrases or sentences are manually labelled as action items.

10. The computer-implemented method as claimed in claim 9, wherein the step of training the classifier includes the step of augmenting the manually labelled dataset to increase a size of the dataset.

11 . The computer-implemented method as claimed in claim 10, wherein the step of augmenting the manually labelled dataset includes one of more of the following steps:

(a) performing back translation on the manually labelled dataset to generate a back translated dataset; and

(b) masking certain words within the manually labelled dataset and generating words for masked positions in the dataset to obtain a mask-generated dataset.

12. The computer-implemented method as claimed in claim 9, which includes inputting data from a separately trained fluency predictor into the classifier.

13. The computer-implemented method as claimed in claim 5, wherein the step of extracting the context items from the transcript includes the step of searching for contextual information in segments of text other than action items.

14. The computer-implemented method as claimed in claim 13, wherein the step of searching for the contextual information is carried out in segments of text proximate the action items.

15. The computer-implemented method as claimed in claim 5, wherein the step of extracting the context items from the transcript includes the step of extracting participant data including one or more of participant identification, timing of participant speech and duration of participant speech.

16. The computer-implemented method as claimed in claim 5, wherein the step of extracting the context items from the transcript includes the step of extracting action verbs from the transcript.

17. The computer-implemented method as claimed in claim 16, wherein the step of extracting the action verbs includes the step of filtering out modal and subjunctive verbs.

18. The computer-implemented method as claimed in claim 5, wherein the step of extracting the action items includes the steps of: segmenting the transcript into output sentence segments; determining a similarity between each output sentence segment and an action item using a part-of-speech distribution model to generate a similarity score; predicting whether each output sentence segment is an action item with a language model based classifier to generate a prediction score; and classifying each output sentence as an action item or non-action item based on the similarity and prediction scores.

19. The computer-implemented method as claimed in claim 5, wherein the step of extracting the context items includes the steps of: searching for contextual information by finding verb and noun phrases as candidate phrases to be related to the action items; taking, as input, phrases or sentences containing co-referent terms; replacing said co-referent terms with the candidate phrases; and scoring an intelligibility of the resultant phrases or sentences.

20. A non-transitory computer-readable storage medium that stores computer instructions to be implemented on at least one computing device including at least one processor, the computer instructions when executed by the at least one processor cause the at least one computing device to: extract segments of text from a transcript of an interaction between two or more participants; extract context items, related to respective said segments of text, from the transcript; and display the segments of text together with their related context items.

21 . A non-transitory computer-readable storage medium that stores computer instructions to be implemented on at least one computing device including at least one processor, the computer instructions when executed by the at least one processor cause the at least one computing device to perform the method as claimed in any one of claims 1 to 19.

22. A server for extracting and contextualising text from a transcript of an interaction between two or more participants, the computer server comprising: a memory device; at least one processor configured to perform a method that comprises: extracting segments of text from the transcript; extracting context items, related to respective said segments of text, from the transcript; and displaying the segments of text together with their related context items in a graphic user interface.

23. A server for extracting and contextualising text from a transcript of an interaction between two or more participants, the computer server comprising: a memory device; and at least one processor configured to perform the method as claimed in any one of claims 1 to 18.

Description:
EXTRACTING AND CONTEXTUALIZING TEXT

FIELD OF THE INVENTION

This invention relates to extracting and contextualizing text.

BACKGROUND OF THE INVENTION

A key outcome from a recorded or streamed interaction such as a meeting, conversation, or conference or other recorded interaction between two or more participants is a list of action items that require attention. These action items can be in the form of phrases or sentences that provide details of actions decided upon during the interaction and that require attention. Other useful outcomes are lists of text items that may represent moments, comments, and participants. These may or may not be in the context of the action items.

Currently, it is usual for a list of action items, and other textual items, to be manually created from a transcript of a meeting, conversation, or other recorded interaction or taken live and manually during the interaction by someone in a dedicated minute taker role, such as a secretary. This is costly, requires a focussed person and has the potential to be subjective and incomplete. When dealing with a transcript, a list of action items is either extracted and entered into a task reminder system or simply highlighted/underlined and circulated, as necessary. It will be appreciated that such a process can be time consuming and inaccurate. Furthermore, action items that are generated manually from a meeting transcript can also often be subjective and incomplete. The same applies to other text items that might represent moments and comments attributable to participants.

SUMMARY OF THE INVENTION

According to a first aspect of the invention, there is provided a computer- implemented method of extracting and contextualising text from a transcript of an interaction between two or more participants, the method comprising the steps of: extracting segments of text from the transcript; extracting context items, related to respective said segments of text, from the transcript; and displaying the segments of text together with their related context items in a graphic user interface.

The computer-implemented method may include the steps of: receiving an audio input in the form of an audio stream or a recording of the interaction between the two or more participants; and generating the transcript with speech recognition software.

The computer-implemented method may include the step of partitioning the audio input according to speaker identity and generating participant data including one or more of participant identification, timing of participant speech and duration of participant speech to be associated with respective partitions in the audio input.

The computer-implemented method may include the step of tagging or labelling the segments of text with tags representing the participant data associated with the segments of text, the step of extracting the context items including the step of extracting the participant data so that it can be displayed together with the extracted segments of text.

The step of extracting the segments of text from the transcript may include the step of extracting segments of text, in the form of action items, which indicate action to be taken as a result of the recorded interaction.

The step of identifying and extracting the action items may include the steps of: extracting suggested action items from the transcript, the suggested action items including duplicate action items; and extracting the duplicate action items.

The step of extracting the duplicate action items may include the steps of: converting each action item to a numerical vector in an embedding space; comparing the vectors to determine their similarity; and grouping those vectors that cannot be readily distinguished from each other together as potential duplicates of a common underlying action item.

The step of identifying and extracting the action items may include the step of executing a classifier that is trained to at least predict action items.

The computer-implemented method may include the step of training the classifier using a manually labelled dataset in which relevant phrases or sentences are manually labelled as action items.

The step of training the classifier may include the step of augmenting the manually labelled dataset to increase a size of the dataset. The step of augmenting the manually labelled dataset may include one of more of the following steps:

(a) performing back translation on the manually labelled dataset to generate a back translated dataset; and

(b) masking certain words within the manually labelled dataset and generating words for masked positions in the dataset to obtain a mask-generated dataset.

The computer-implemented method may include inputting data from a separately trained fluency predictor into the classifier.

The step of extracting the context items from the transcript may includes the step of searching for, and extracting, contextual information in segments of text other than action items.

The step of searching for the contextual information may be carried out in segments of text proximate the action items.

The step of extracting the context items from the transcript may include the step of extracting participant data including one or more of participant identification, timing of participant speech and duration of participant speech.

The step of extracting the context items from the transcript may include the step of extracting action verbs from the transcript.

The step of extracting the action verbs may include the step of filtering out modal and subjunctive verbs.

The step of extracting the action items may include the steps of: segmenting the transcript into output sentence segments; determining a similarity between each output sentence segment and an action item using a part-of-speech distribution model to generate a similarity score; predicting whether each output sentence segment is an action item with a language model based classifier to generate a prediction score; and classifying each output sentence as an action item or non-action item based on the similarity and prediction scores.

The step of extracting the context items may include the steps of: searching for contextual information by finding verb and noun phrases as candidate phrases to be related to the action items; taking, as input, phrases or sentences containing co-referent terms; replacing said co-referent terms with the candidate phrases; and scoring an intelligibility of the resultant phrases or sentences.

According to a second aspect of the invention, there is provided a non-transitory computer-readable storage medium that stores computer instructions to be implemented on at least one computing device including at least one processor, the computer instructions when executed by the at least one processor cause the at least one computing device to: extract segments of text from a transcript of an interaction between two or more participants; extract context items, related to respective said segments of text, from the transcript; and display the segments of text together with their related context items.

According to a third aspect of the invention, there is provided a non-transitory computer-readable storage medium that stores computer instructions to be implemented on at least one computing device including at least one processor, the computer instructions when executed by the at least one processor cause the at least one computing device to perform the method of the first aspect of the invention.

According to a fourth aspect of the invention, there is provided a server for extracting and contextualising text from a transcript of an interaction between two or more participants, the computer server comprising: a memory device; at least one processor configured to perform a method that comprises: extracting segments of text from the transcript; extracting context items, related to respective said segments of text, from the transcript; and displaying the segments of text together with their related context items in a graphic user interface.

According to a fifth aspect of the invention, there is provided a server for extracting and contextualising text from a transcript of an interaction between two or more participants, the computer server comprising: a memory device; and at least one processor configured to perform the method of the first aspect of the invention. According to a sixth aspect of the invention, there is provided a method of automatically extracting and contextualizing text from a transcript of a recorded or streamed interaction between two or more participants, the method comprising the steps of: receiving text from the transcript as input text; extracting action items from the input text; extracting context items from the input text, the context items including participant data relating to the two or more participants and being related to the action items such that the action items are contextualised by the context items; and displaying the action items together with their related context items in a graphic user interface.

The step of extracting the action items may include the step of executing a classifier that is trained to identify and label phrases or sentences in the input text as action items.

The step of extracting the action items from the input text may include the step of extracting duplicate action items to be displayed as duplicates or discarded.

The method may include converting each sentence of the text into a numerical vector in an embedding space using a general language model and comparing the vectors to determine whether they are duplicate vectors or related vectors.

The step of displaying the action items together with their related context items may include the step of outputting, to an electronic display, parts of the input text together with the extracted action and context items.

The method may include the step of extracting duplicate action items to be displayed as duplicates or discarded.

According to a seventh aspect of the invention, there is provided a system for extracting and contextualizing text from a transcript of an interaction between two or more participants, the system comprising: an extraction module or extractor configured to extract segments of text from the transcript; a context item extraction module or extractor configured to extract context items related to respective said segments of text; and a graphic user interface for displaying the action items together with their related context items. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows a high-level block diagram representing an example of a method, in accordance with the invention, for extracting and contextualizing text from a transcript of an interaction between two or more participants.

Figure 2 shows a high-level block diagram representing an example of a method, in accordance with the invention, for extracting and contextualizing text from a transcript of the interaction.

Figure 3 shows a block diagram representing operation of text extractor modules used in the method of figures 1 and 2.

Figure 4 shows a further block diagram representing operation of text extractor modules used in the method of figures 1 and 2.

Figure 5 shows a block diagram representing an example of a text extractor module used in the method of figures 1 and 2, in a training mode.

Figure 6 shows a block diagram representing an example of a text extractor module used in the method of figures 1 and 2, in a deployment mode.

Figure 7 shows a block diagram representing an example of a text extraction method, in accordance with the invention, used by a text extractor module to extract duplicate action items from the transcript.

Figure 8 shows a block diagram representing an example of a method for extracting context items from the transcript.

Figure 9 shows an example of an output resulting from an execution of a method for extracting and contextualizing text from a transcript of an interaction between two or more participants.

Figure 10 shows an example of an interactive output interface generated on an electronic display resulting from an execution of a method for extracting and contextualizing text from an audio input in the form of an audio stream or a recording of an interaction between two or more participants.

Figure 11 shows an example of an interactive output interface generated on an electronic display resulting from an execution of a method for extracting and contextualizing text from an audio input in the form of an audio stream or a recording of an interaction between two or more participants. Figure 12 shows an example of an expansion interface table resulting from the selection of an item from the interface of figure 11 .

Figure 13 shows an example of an interactive output interface generated on an electronic display resulting from an execution of a method for extracting and contextualizing text from an audio input in the form of an audio stream or a recording of an interaction between two or more participants.

Figure 14 shows an example of a system, in accordance with the invention, for automatically extracting and contextualising text from a transcript of an interaction between two or more participants.

DETAILED DESCRIPTION

In figure 1 , reference numeral 10 generally indicates a high-level block diagram representing a computer implemented method of extracting and contextualizing text from a transcript 12 of an interaction between two or more participants. The transcript 12 is of an audio input in the form of an audio stream or a recording, shown at 14, of an interaction between two or more participants. The method is implemented by executing computer instructions on at least one computing device or server with at least one processor of the computing device or server. See below with reference to figure 14, which describes an example of a system for performing the method.

The interaction can take various forms. For example, the interaction can be a meeting. In other examples, the interaction can be recorded audio that references two or more participants that are not necessarily in a meeting.

The method uses speech recognition software to generate the transcript 12. The speech recognition software converts the audio stream or recording 14 into the transcript 12 and generates timing information associated with each word in the audio to link that timing back to each word in the transcript. The speech recognition software can thus generate time stamps associated with segments of speech. In this example, the audio stream or recording 14 is generated during the interaction, such as a meeting, conversation, conference, or any other recorded or streamed interaction between two or more participants.

Suitable speaker recognition software is used to generate tags or IDs that represent participant data related to respective participants in the interaction. To that end, the speaker recognition software can include suitable speaker diarization software which partitions or segments an input audio stream into segments according to an identity of a speaking participant. The participant data can be generated to include participant identification, timing of participant speech and duration of participant speech. The speaker recognition software can be configured to associate tags or IDs with participant data relating to respective participants. Thus, the transcript 12 can be labelled or tagged with the participant data allowing segments of text from the transcript 12, such as the action items described below, to be tagged and thus associated with respective participant data. The participant data indicates speaker turns 16 that can include the timing and duration of participant speech.

The method makes use of a software product in the form of a text extraction module for extracting segments of text from the transcript 12. The text extraction module is hereinafter referred to as an Action Item and Context Extractor (AICE) 18. The AICE 18 is programmed or configured to receive the transcript 12, generated by the speech recognition software and tagged with the participant data generated by the speaker recognition software, and to identify and extract segments of text in the form of a number of action items 20. The action items 20 are in the form of phrases or sentences or other segments of text, which indicate action to be taken as a result of the interaction. The AICE 18 is also configured to identify and extract context items 22 that are related to the respective action items 20, in a contextual manner, so that the action items 20 can be read in context. Such context items 22 can include the participant data referred to above. The speaker diarization software can thus facilitate association of the action items 20 with corresponding participant data.

In figure 2, reference numeral 30 generally indicates a further high-level block diagram that represents the method of extracting and contextualizing text referred to above.

As can be seen in figure 2, the AICE 18 includes a software module, in the form of a text extractor, which is referred to hereinafter as an action item extractor 32. The action item extractor 32 is configured to extract suggested action items from the meeting transcript 12. The AICE 18 includes a software module in the form of an action item contextual phrase extractor, or context item extractor 34, that extracts the context items 22 that are related to the respective suggested action items. The AICE 18 further includes an action item duplication extractor 36. The action item duplication extractor 36 receives, as input, the suggested action items from the extractor 32 and identifies and extracts duplicate action items also as context items 22 that are related to the action items 20. Thus, the AICE 18 can output the action items 20 and their associated context items 22. This is indicated graphically in figure 2 with sun icons where cores represent the action items 20, and rays represent the context items 22. The cores and rays can be understood to represent at least part of a graphic user interface that displays the action items 20 and associated context items 22 on an electronic display.

In figure 3, reference numeral 40 generally indicates a block diagram that represents operation of the AICE 18 schematically. More particularly, the block diagram 40 shows operation of the extractors 32, 34, 36. At 41 , the extractor 32 identifies and extracts the suggested action items using a trained classifier (described in further detail below with reference to figure 5). An example of a method of operation of the extractor 32 is described below with reference to figure 6. The classifier is trained using a set of sentences from real meetings that have been manually labelled for the presence of action items and a fluency score. At 43, the action item contextual extractor 34 extracts the context items 22. Details of an example of a method of operation of the contextual phrase extractor 34 is shown in figure 8. At 44, the extractor 34 extracts phrases from the transcript 12 that are related to the respective action items 20. The extractor 34 further extracts times, dates, and places from the related phrases at 46. This data can include the participant data referred to above such that the participant data can be associated with the action items 20 using, for example, the speaker diarization software.

The extractor 34 extracts action verbs from the related phrases at 48 and co-referent terms or entities from the related phrases at 50 (see below for an explanation of “co referent”). At 52, the extractor 34 extracts assignees (the person(s) to whom the action item has been delegated) and assignors (the person(s) delegating the action item to the assignees). It will be appreciated that the assignee(s) and assignor(s) can be participants, such that the assignor(s) related to specific action items 20 can be identified using the speaker diarization software.

At 42, the action item duplicate extractor 36 identifies and extracts action item duplicates from the suggested action items.

The action item duplication extractor 36 can be configured to group the action item duplicates together, for example, using a user interface element that expands or collapses each group of duplicates. For example, the user interface element can be an accordion interface or a tree-type interface. Alternatively, the extractor 36 can be configured to disregard duplicates to avoid cluttering a user interface. Further detail regarding operation of the action item duplication extractor 36 is described below with reference to figure 7.

The AICE 18 is configured so that the action items 20, in the form of action item phrases, or sentences, can be displayed or presented linked with the context items 22 as related phrases or sentences by, for example, making use of a hover text function.

The action item extractor 32 is configured or trained to execute the trained classifier to label a sentence or a phrase as an action item in the form of a Boolean yes/no output. The action item extractor 32 is also configured or trained to output a fluency score, for example, in the form of a real value between 0.0 and 1 .0. The fluency score is used to rank or filter suggested action item sentences or phrases according to intelligibility or the extent of information within the sentence. Speech in meetings is conversational and therefore contains regular dysfluencies. Examples of dysfluent speech include repeated words or syllables, false starts to phrases or sentences, corrections to mispronunciations, and the insertion of filler words that do not contribute any semantic value, such as “urn” and “like”. Intelligibility can also be reduced by a transcript that is output from an automatic speech to text system that contains errors. Use of a fluency score is described in further detail below with reference to figure 5.

As is explained in further detail below with reference to figure 5, the classifier is trained using a set of sentences from real meetings that have been manually labelled for the presence of action items and given fluency scores. The manually labelled dataset is further augmented using several strategies to obtain a larger, richer dataset for classifier training. Examples of the strategies include augmentation via language model masking, for example, replacing nouns in a sentence to synthesise new action item examples. Back-translation, that is, automatically translating labelled sentences into a second language (pivot language) and then directly translating them back into their original language to get a paraphrased sentence with the same meaning but different word sequence, can also be used to augment the dataset.

The contextual phrase extractor or context extractor 34 is configured to identify contextual information or the context items 22 in other sentences within the transcript 12 that are relevant to a given action item 20 that has been detected by the action item extractor 32. This process can be carried out by numerical vectorisation. In that process, each sentence in the transcript is converted to a numerical vector in an embedding space using a general language model, such as any one of those known as BERT and GPT2, or similar. This can be implemented using a neural network architecture such as a transformer. A vector similarity measure such as the cosine similarity can then be used to compare vectors in the embedded space to determine whether they are duplicates or the related phrases referred to above.

If action item sentences have a high similarity such that they cannot readily be distinguished, they are grouped together as potential duplicates of the same underlying action item, as set out above. Detail of an example of a method of extracting action item duplicates is described below with reference to figure 7.

The AICE 18 is configured so that each action item 20 is tagged or labelled with a unique identification code (“ID”) that links it back to the sentence in the transcript. The context extractor 34 is configured to provide a list of IDs that are potential duplicates of a given action item 20. An example of the output for two groups of related actions is shown in figure 9, which illustrates an array of suggested action item sentences 20, their IDs, and a set of IDs of potential duplicate actions. As can be seen, the first action item 20.1 has been tagged with the ID “162”, the second action item 20.2 with the ID “167”, the third action item with the ID “207”, the fourth action item with the ID “210” and the fifth action item with the ID “218”. These ID’s can be used to tag potential duplicate action items accordingly. For example, action item 20.2 is a potential duplicate of action item 20.1 , and action items 20.4 and 20.5 are potential duplicates of action item 20.3, etc. The IDs can thus be used to display one of the action items 20, with a link to potential duplicates, on a user interface.

The contextual phrase extractor 34 is configured so that, for each detected action item 20, a set of sentences that are proximate to that sentence containing the detected action item 20 is searched for related contextual information, which can be extracted. Thus, sentences that precede or follow the sentence containing the detected action item 20 can be searched. These sentences are parsed to find verb and noun phrases as candidate phrases to be related to the action items 20. The candidate phrases are compared to the detected action item 20 using the vector similarity measure described above. The “k” most similar phrases are classified as related phrases. Further detail of this method is described below with reference to figure 8.

It will be appreciated that action items 20 in the form of action item sentences or phrases will contain action verbs. In order to identify those action verbs, modal verbs and subjunctive verbs are filtered out and the highest occurring verb from the remaining verbs is classified as the action verb associated with the respective action item. Further detail of this method is described below with reference to figure 8.

It will also be appreciated that, in a particular transcript, some sentences, or phrases containing action items may include nouns such as “that” or other terms that are defined by phrases or words occurring earlier in the transcript. These are called co referent terms or items. The contextual item extractor 34 is configured to take as input each suggested action item sentence and to mask out occurrences of co referent terms from a lexicon of common co-referent terms. The co-referent terms are replaced by the candidate phrases from the related phrases extracted by the contextual phrase extractor 34, as described above. The intelligibility of the resultant phrase is then scored according to a language perplexity measure. The candidate phrase with the lowest perplexity measure is extracted as the relevant co-referent entity or item. The extractor 34 is also configured to extract subjects of objects of sentences or phrases containing action items. These are referred to as co-referent entities in the table below. Further detail of this method is described below with reference to figure 8.

The following table sets out some examples of co-reference:

The contextual phrase extractor 34 is configured to identify names and pronouns as the subject of the action verb in the action item and the related phrases and identified as assignees of the action item (see above with reference to item 52 in figure 3).

Also, the action items, duplicates and related phrases are parsed to identify times, dates, places, and other proper nouns such as company names. These are tagged as context items in relation to the associated action item.

In figure 4, reference numeral 60 generally indicates a block diagram that further represents operation of the AICE 18 schematically.

The AICE 18 is configured to tune the relevant language model for classification using the fluency score described above and the augmentation, also as described above. These are shown at 62 in figure 4.

The use of the vector similarity measure described above by the contextual phrase extractor 34 is shown at 64. At 64, a vector similarity measure, such as cosine similarity, is used to compare language model (LM) representations of action items to identify related items, that form context items 22, and duplicate action items that can be deleted or displayed. The replacement of the co-referent terms and use of the intelligibility score according to a language perplexity measure to determine related phrases is shown at 66.

The contextual phrase extractor 34 is configured to identify action verbs by filtering out modal verbs and subjunctive verbs, as described above. This is achieved by dependency parsing at 68 with an appropriate software module or package for natural language processing (NLP). For example, Python libraries such as Natural Language Toolkit (NLTK) - https://www.nltk.org/, and spaCy - https://spacy.io/ - would be suitable. As is known, dependency parsing can be used to build a parsing tree with tags that determine the relationship between words in a sentence. This is more effective than syntactical parsing which uses grammar rules and is overly flexible, which leaves room for uncertainty. Such libraries can make use of a lexicon of modal and subjunctive verbs. See below with reference to figure 8 for an example of this method. It will be appreciated that the dependency parsing can also be used to identify and extract the co-referent entities at 50, such as the objects of the phrases or sentences, as described above.

In figure 5, reference numeral 80 generally indicates a functional block diagram that represents a process for training the action item extractor 32 of the AICE 18 during a training mode.

In the process 80, it is assumed that the relevant language is English. It will be appreciated that the process 80 can suit other languages, as required.

Meeting transcripts, or other documents, are used to generate a manually labelled dataset 82. The dataset 82 is labelled to indicate or tag sentences or phrases as action items or not. This is a Boolean true/false label.

The dataset 82 is translated into a pivot language at 84 with suitable translation software. At 86, the dataset 82 is translated directly back into English (or other language of the meeting transcript) from the pivot language with the translation software to provide a back-translated dataset 87 of paraphrased sentences with the same meaning as the sentences in the original dataset but with different word sequences. It will be appreciated that multiple pivot languages can be used further to augment a training dataset.

At 88, relevant words, such as verbs and nouns, are masked. The determination of words to mask is carried out using various resources. These can include a meeting lexicon 90 developed for the meetings from which the transcripts were generated.

The meeting lexicon 90 can obtain a vocabulary from a resource such as VerbNet 92, which is a comprehensive verb lexicon. The meeting lexicon 90 can also make use of a domain-specific lexicon 94, where the domain is relevant to the field or scope of the meeting.

A suitable word representation resource, in the form of a pre-trained masked language model, such as a General Language Model, for example, a language model based on BERT (by Google (registered trade mark)) and a language model based on GPT-2 (by Open Al (https://openai.com/blog/gpt-2-1-5b-release/) can be used to replace the masked words at 96 with words generated by the language model to output a mask-generated dataset 98.

The back-translated dataset 86 and the mask-generated dataset 98 can be combined to provide an augmented training dataset at 100. A part-of-speech distribution model 102 takes the augmented training dataset as input. The part-of speech distribution model 102 is the expected distribution over part-of-speech tags calculated over a corpus of action items and non-action items. That is, two average vectors (as referred to above), one for action items and one for non-action items are calculated. These vectors include the expected proportions of parts of speech that are typically found in action items as opposed to non-action items. Such vectors can thus be used to differentiate between action and non-action items.

The process 80 makes use of a general language model based classifier 104 that is configured or trained by taking, as input, the mask generated dataset 98. The classifier 104 also takes, as input, from a fluency predictor module 106, a fluency dataset including dialogues that are labelled as fluent according to a fluency score of a real number. This fluency dataset can include up to one million dialogues labelled as fluent according to the fluency score. The fluency dataset is a publicly available dataset.

In figure 6, reference numeral 110 generally indicates a method or process for applying or deploying the trained action item extractor 32, in a deployment mode, to a new transcript 12, such as a meeting transcript. The action item extractor 32 is trained as described above with reference to figure 5.

At 112, the process 110 takes the meeting transcript 12 as input. The meeting transcript 12 can be taken as a full off-line document that has been generated from a recorded interaction. In this case, the process 110 can be executed as a batch or off line process. Alternatively, the new meeting transcript can be taken as a live stream in real-time such that the process 110 is carried out as a real-time/streaming process.

At 114, the process 110 carries out sentence segmentation on the meeting transcript 12 to generate sentence segments. The part-of-speech distribution module 102 takes output from the sentence segmentation step 114 as input and generates a similarity score at 116. Such a score can indicate a level of similarity between each segment and an action item. The classifier 104, configured in the process 80, takes output from the sentence segmentation step 114 as input and generates a prediction score at 117. Such a score would indicate whether, according to the classifier, the relevant segment would be an action item.

At 118, a step for outputting the suggested action items (see above with reference to figure 2) at 120 is activated depending on the similarity and prediction scores at 116. These can include the status of items as action items or not. In figure 7, reference numeral 130 generally indicates a process, carried out by the action item duplication extractor 36, for extracting duplicate action items.

At 132, the extractor 34 takes, as input, pairs of the suggested action items generated at 120 above that are to be assessed for similarity. These are shown as suggested action item A at 134 and suggested action item B at 136. Selection of the input pairs can be an exhaustive process comparing action items to each other (using vectorisation, for example, as described above) or could be constrained to selecting action items occurring withing a certain temporal proximity (using the timing information referred to above, for example, with reference to figure 1).

The classifier 104 generates and outputs an action vector A at 138 and an action vector B at 140, by numerical vectorisation, from the suggested action items A and B, respectively. The vectors A and B are compared, at 142, using the vector similarity measure, in this example the cosine similarity method, described above. A step is activated at 144 if the vectors A and B are identified as identical, or cannot be readily distinguished from each other, and are grouped together as potential duplicates of a common underlying action item. The grouped vectors are used to generate duplication linkages at 146 that link to action item duplicates. The duplication linkages 146 are made available in a user interface so that just one of them is presented or a group is presented to a user to avoid clutter in the user interface by presenting a number of action items that are, in fact, the same.

In figure 8, reference numeral 150 generally indicates a process, carried out by the contextual phrase extractor 34 to identify and extract the related phrases, co-referent entities and action verbs, as set out above.

At 152, the extractor 34 receives, as input, the transcript 12, for example, a meeting transcript. As set out above, the extractor 34 extracts suggested action items at 154, using the classifier 104 and proximate context items at 156 determined from sentences proximate those of the suggested action items.

The extractor 34 includes an action verb extractor module 158 that takes, as input, the suggested action items. The action verb extractor module 158 is configured to filter out modal and subjunctive verbs at 160 to extract an action or top verb at 162 as a context item. As set out above, the action or top verb can be used to replace co referent entities. The extractor 34 is configured to extract verb and noun phrases, at 164, from both the suggested action items and the preceding contextual items to provide the candidate phrases referred to above. The extractor 34 is also configured to generate masked sentences or phrases, as described above, at 166. The masked sentences or phrases are generated using a co-referent mention lexicon 168, which is a list of words that are used to refer to previous verb or noun phrases such as “that”, “this”, “it”, etc., as described above.

The extractor 34 generates related phrases from all candidate phrases at 170. As set out above, this is achieved by numerical vectorization and comparison using a vector similarity measure, at 172, which returns the top “k” most similar phrases at 174, to be tagged as the related phrases referred to above.

The extractor 34 processes the masked sentences generated at 166 together with the verb and noun phrases extracted at 164. This is done by replacing the masked words with the verb and noun phrases at 173. The resultant candidate sentences with the lowest perplexity measures are then extracted at 175. These are taken as input at 172, to determine the related phrases.

Thus, the process 150 is capable of generating suggested actions at 176 together with related phrases, co-referent entities and action verbs at 178.

Figures 10 and 11 show, schematically, graphic user interfaces 180, 182 that are output and displayed by the AICE 18. Each of the interfaces 180, 182 displays part of a meeting transcript that has been generated in accordance with the method described herein. More particularly, the interfaces 180, 182 show a list of suggested action items generated according to the method described herein.

The interface 180 shows a list of the suggested action items represented by paragraphs 184 that include the co-referent “that”. As set out above, the processes described herein are able to identify and extract for display, for example hover display, as illustrated in figure 10, action nouns that are associated with the co referents.

Each of the action items displayed in the interface 180 is associated with a time stamp 190 that is generated by the speech recognition software described above with reference to figure 1. Thus, a user can identify a time in a meeting when certain action items were discussed. The interface 182 shows a list of suggested items represented by paragraphs 186 that also include the co-referent “that”. In this case, the AICE 18 highlights the relevant action nouns and verbs at 188 that have been identified as suitable replacements for the co-referent entity, as described above. Furthermore, the interface is also able to display, for example by hover display, action nouns from the contextual extraction described above and labelled “Context”, and action verbs labelled “To do”.

In figure 12 reference numeral 192 shows, schematically, a graphic user interface that can be output and displayed by the AICE 18. The interface 192 displays a series of the suggested action items that all include the object “action item”. This object can be identified in the meeting transcript as set out in step 164 described with above with reference to figure 8. The interface 192 also displays, adjacent each suggested action item a link 194 to participant data, as described above with reference to figure 1. The link 194 can be used to view participant data associated with the participant whose speech is linked to the suggested action item. The link 194 can also open an expansion interface that allows a user to select a participant who will be responsible for responding to the suggested action item.

The interface 192 also displays similar suggested action items at 196. The similar suggested action items can be extracted using the similarity assessment method executed by the action item duplication extractor 36 as described above with reference to figure 7.

In figure 13 reference numeral 198 shows, schematically, a graphic user interface that can be output and displayed by the AICE 18.

The interface 198 displays a list 200 of sentences that contain the action items 20 in the form of segments of text 208 (“Update to Ruby 8”) and co-referent terms (“that” 209) forming part of the meeting transcript 12. In this example the action item 20 is displayed with the context items 22, in this case, participant data (“Farrokh Bulsara” and “Sally Sorenson”), an action verb (“update”) 210, and a date (“next week? (13 June 2022)”) 212. The extraction of these context items 22 is described above with reference to figure 8, for example. The participant data is in the form of an assignor at 202 and an assignee at 204. These are described in further detail above with reference to figure 3. Each of the assignor 202 and the assignee 204 are associated with a drop-down expansion link 206 that opens a menu allowing the user to select a different assignor and assignee associated with the action item. Thus, the interface 198 can be used to assign or re-assign tasks flowing from action items to participants.

Similarly, the date 212 is also associated with a drop down menu link that allows a date for the task to be set.

The action item 208 is associated with an expansion link 214. The expansion link 214 provides a link to a list 216 of phrases that contain segments of text from the transcript 12 that are similar to the action item 208. Effectively, these are further or other mentions of the action item 208. These can be determined using the action item duplication extractor 36 as described above with reference to figure 7.

In figure 14 reference numeral 220 generally indicates an example of a system for carrying out the method of extracting and contextualising text from a transcript of an interaction between two or more participants, as described above. The transcript is thus the transcript 12, as described above.

The interaction in question is a meeting between participants 222. The meeting is recorded or streamed with an audio input device, such as a microphone 224.

The system 220 includes a server 226, in accordance with the invention. The server 226 includes a processor 228 and a storage device in the form of a memory 230. The memory 230 includes a non-transitory storage medium that is capable of being read by the processor 228. The memory 230 stores computed instructions that can be read and executed by the processor 228 to carry out the various steps of the method described herein, including the execution of the software described herein, such as the text extraction modules (extractors) or software.

The server 226 can be a single computing device, or multiple computing devices or other data processing apparatus that can carry out server-side operations. Thus, the server 226 can carry out distributed operations The server 226 can be a virtual server.

The processor 228 can take various forms. For example, the processor 228 can be a single device, a number of interconnected devices, or a virtual computing device.

The server 226 can include suitable storage for storing the audio recording 14 for access by the processor 228 and the memory 230 to generate the transcript 12.

The server 226 can be connected directly, or hardwired, to the microphone 224. Alternatively, the microphone 224 can be wirelessly connected to the server 226 via a suitable router 232. The connection to the server 226 can thus be via a WLAN or the Internet 234, or both.

Various computing devices, in the form of handheld devices 236, Notebooks 238 and Desktop Computers 240 can connect to the server 226, such that the server 226 can display user interfaces or GUIs on the devices, such as the GUIs shown in figures 10 to 13. Users can thus communicate with the server 226 via the GUIs to view the action items 20 and the context items 22, and to provide inputs such as allocation of assignors and assignees, dates/times for performing tasks based on the action items 20, etc.

It will be appreciated that the system and method described herein is capable of automatically scanning a transcript of a recorded or streamed interaction between two or more participants to identify action items, resulting in an efficient follow up after the interaction. A particularly useful feature of the system and method is that it can deal with relevant information that is required to interpret an action item that is generally distributed across more than one phrase or sentence in the transcript. Furthermore, the system and method can also deal with a situation where the same action item is restated multiple times within the transcript. It will be appreciated that the system and method described herein is capable of identifying and extracting co referent items and other words and phrases that represent certain moments and comments within a meeting that are not necessarily action items. This can be useful for users that need to obtain summary material from a meeting.

Generally, a purpose of the invention described herein and set out in the appended claims is to take a transcript of the recorded or streamed interaction and extract the action items together with their context. This information can then be presented to users or can be interfaced into other task management systems. It will be appreciated that the system and method allows a user to rapidly review meeting recordings with the goal of determining and assigning the work to be done. In general, the contextual information provides:

• information about the assignor or assignee for the action,

• details of when or where the action items to be carried out,

• repeats or rephrasing of the same action in other parts of the transcript,

• key action verbs within the action items,

• co-referent terms and related phrases from neighbouring sentences. The appended claims are to be considered as incorporated into the above description.

Throughout this specification, the following definitions are applied.

1. “Computer” or “Computing Device” is any apparatus capable of carrying out data processing functions and includes a computer system, or a collection of one or more apparatus, including mobile, hand-held devices.

2. “Server” is a single computing device, or multiple computing devices or other data processing apparatus that can carry out server-side operations and includes a virtual server.

Throughout this specification, reference to any advantages, promises, objects or the like should not be regarded as cumulative, composite, and/or collective and should be regarded as preferable or desirable rather than stated as a warranty.

Throughout this specification, unless otherwise indicated, "comprise," "comprises," and "comprising," (and variants thereof) or related terms such as "includes" (and variants thereof)," are used inclusively rather than exclusively, so that a stated integer or group of integers may include one or more other non-stated integers or groups of integers.

The term “and/or”, e.g., “A and/or B” shall be understood to mean either “A and B” or “A or B” and shall be taken to provide explicit support for both meanings and for either meaning.

Features which are described in the context of separate aspects and embodiments of the invention may be used together and/or be interchangeable. Similarly, features described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.

It is to be understood that the terminology employed above is for the purpose of description and should not be regarded as limiting. The described embodiments are intended to be illustrative of the invention, without limiting the scope thereof. The invention is capable of being practised with various modifications and additions as will readily occur to those skilled in the art.