Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COMPUTER-IMPLEMENTED METHOD FOR PROCESSING A RECORDING OF A REAL-TIME WEB COLLABORATION AND COMMUNICATION SESSION, REAL-TIME WEB COMMUNICATION AND COLLABORATION PLATFORM, AND APPLICATION
Document Type and Number:
WIPO Patent Application WO/2021/013360
Kind Code:
A1
Abstract:
The present invention relates to a method a computer-implemented method for processing a recording of a real-time web collaboration and communication session, wherein the method comprises the steps of transcribing the recorded audio data of the real-time web collaboration and communication session so as to obtain text data of the recorded audio data; subdividing the text data into time portions (2, 2', 2'') of a predetermined length; creating a word cloud (3, 3', 3'') for every time portion (2, 2', 2''); and combining the word clouds (3, 3', 3'') created for every time portion (2, 2', 2'') in a chronological order. Further, the present invention relates to a real-time web collaboration and communication platform and an application.

Inventors:
SCHIFFER WOLFGANG (DE)
EL OUAAZIZA KAMAL (DE)
Application Number:
PCT/EP2019/070111
Publication Date:
January 28, 2021
Filing Date:
July 25, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIFY PATENTE GMBH & CO KG (DE)
International Classes:
H04M3/42; H04L12/18; G10L15/08; G10L15/26; G10L25/54
Foreign References:
US20120179465A12012-07-12
US20180027123A12018-01-25
US20170201387A12017-07-13
Attorney, Agent or Firm:
SCHAAFHAUSEN PATENTANWÄLTE PARTGMBB (DE)
Download PDF:
Claims:
Claims

1. Computer-implemented method for processing a recording of a real-time web collaboration and communication session, wherein the method comprises the steps of

- transcribing the recorded audio data of the real-time web collaboration and communication session so as to obtain text data of the recorded audio data;

- subdividing the text data into time portions (2, 2’, 2”) of a predetermined length;

- creating a word cloud (3, 3’, 3”) for every time portion (2, 2’, 2”); and

- combining the word clouds (3, 3’, 3”) created for every time portion (2, 2’, 2”) in a chronological order.

2. Computer-implemented method according to claim 1 , wherein the time portions (2, 2’, 2”) are configurable.

3. Computer-implemented method according to claim 1 or claim 2, wherein the method further comprises a step of determining the top keyword (4, 4’, 4”) of every time portion (2, 2’, 2”) and listing the thus determined top keywords (4, 4’, 4”) in a chronological order, in particular, below a word cloud sequence (1 ) which is formed by the word clouds (3, 3’, 3”).

4. Computer-implemented method according to any one of the preceding claims, wherein the method further comprises a step of aggregating the time portions (2, 2’, 2”) on a first level (L1), wherein a predetermined number of chronologically adjacent time portions (2, 2’, 2”) are aggregated so as to form together a first aggregated time portion.

5. Computer-implemented method according to claim 4, wherein the step of aggregating is carried out on further levels (LN) until all time portions (2, 2’, 2”) are aggregated in one top aggregated time portion.

6. Computer-implemented method according to any one of the preceding claims, wherein the method further comprises a step of mapping the time portions (2, 2’, 2”) to the section of the recorded audio data.

7. Computer-implemented method according to any one of the preceding claims, wherein the method further comprises a step of mapping the text data of the time portions (2, 2’, 2”) to the section of the recorded audio data.

8. Computer-implemented method according to any one of the preceding claims, wherein the method further comprises a step of starting playback of the recorded audio data of a time portion (2, 2’, 2”), when a selection of the time portion is detected.

9. Computer-implemented method according to any one of the preceding claims, wherein the method further comprises a step of generating a temporal word cloud video sequence (6) of the word clouds (3, 3’, 3”) according to the chronological order of the respective time portions (2, 2’, 2”).

10. Computer-implemented method according to claim 9, wherein the method further comprises a step of assigning speakers to the respective time portions (2, 2’, 2”) and overlaying or augmenting the temporal word cloud video sequence (6) with additional data, in particular, with pictures or representations of the speakers.

11. Computer-implemented method according to claim 9 or claim 10, wherein the method further comprises a step of assigning documents or presentations to the respective time portions (2, 2’, 2”) and overlaying the temporal word cloud video sequence (6) with the documents.

12. Real-time web communication and collaboration platform, comprising a media server to which a plurality of clients is connected, wherein the media player is adapted to run an application for carrying out the method according to any one of the preceding claims. 13. Application for performing a real-time web communication and collaboration session on a real-time web communication and collaboration platform, which is adapted for carrying out the method according to any one of claims 1 to 1 1.

Description:
Computer-implemented method for processing a recording of a real-time web collaboration and communication session, real-time web communication and collaboration platform, and application

Description

The present invention relates to Web communication and collaboration platforms for performing real-time web conferences or other real-time web communication and/or collaboration sessions, for example, webinars, training sessions, project meetings, sales meetings, or product presentations, are getting more and more popular. Further, the present invention relates to a real-time web communication and collaboration platform and an application.

Due to increasing use of online meetings and conferences, quite often it is not possible for a user to join a meeting, because of unavailability or other meetings in parallel. Thus, the recording an audio or video call is a good way to provide the meeting content to those people who were not able to join, as modern enterprise real-time collaboration platforms allow for recording collaboration sessions and replaying them at a later point of time.

However, listening to a 1 hour conference call is not always possible or rather the recording of it, even at a later point in time since you need to spend the same amount of time (1 hour) in order to become fully aware of everything what has been discussed. Thus, this is rather time consuming and thus, for the user, it is not very convenient.

One way to safe some time is creating a transcription of the recording. This helps since usually reading a text (up to 1.000 words/minute) is much faster than listening to the same text (125-200 word/minute). Mechanisms to create a summary of a meeting, based on a recording and a transcription created from the recording, help even more to get a quick overview of the essence of a meeting. However, quite often the user who was not able to participate in a meeting or conference is not interested in the meeting summary, but rather he or she needs to listen to the important parts of the original discussion in order to get a complete understanding, including who contributed to the discussion. Also, the way the discussion evolved may be an important aspect for some users who missed a meeting of conference.

Therefore, the present invention is based on the object to provide a method to easily and efficiently and in short time identify content relevant to a user that has, for example, missed a real-time web communication and collaboration session, as a meeting or a conference. The object is solved by a Web communication and collaboration platforms for performing real-time collaboration and communication sessions having the features according to claim 1 , as well as to a real-time web communication and collaboration platform having the features according to claim 12 and an application having the features according to claim 13. Preferred embodiments of the invention are defined in the respective dependent claims.

Thus, according to the present invention, a computer-implemented method for processing a recording of a real-time web collaboration and communication session is provided, wherein the method comprises the steps of

- transcribing the recorded audio data of the real-time web collaboration and communication session so as to obtain text data of the recorded audio data;

- subdividing the text data into time portions of a predetermined length;

- creating a word cloud for every time portion; and

- combining the word clouds created for every time portion in a chronological order. Thus, a user who was not able to join an online meeting or conference, will be able to comprehend what has been discussed during the course of a conference call and when it has been discussed very fast without having to go through to whole recording. According to the inventive method, the user may easily navigate to those parts of the recording which are of particular interest to him or her.

In particular, word clouds provide a very intuitive idea of the content a given text is about to a user so that the user is able to find relevant information very fast and easy. Usually, word clouds are static, in the sense, that they cover the whole text. Thus, creating a word cloud from a conference recording provides an easy way to grasp an overview of the conference content and goes, due to aspect of visually emphasizing some keywords more than others, well beyond auto-tagging features available in social networking platforms. Moreover, according to the inventive method, since a sequence of word clouds is created, the user is also provided with an overview of how the content of a conference call evolves over time.

According to a preferred embodiment of the invention, the time portions are configurable.

According to another preferred embodiment of the invention, the method further comprises a step of determining the top keyword of every time portion and listing the thus determined top keywords in a chronological order.

Preferably, the method further comprises a step of aggregating the time portions on a first level, wherein a predetermined number of chronologically adjacent time portions are aggregated so as to form together a first aggregated time portion.

It is also advantageous, if the step of aggregating is carried out on further levels until all time portions are aggregated in one second aggregated time portion.

According to still another preferred embodiment of the invention, the method further comprises a step of mapping the time portions to the section of the recorded audio data. Also, the method further may further comprise a step of mapping the text data of the time portions to the section of the recorded audio data.

According to yet another preferred embodiment of the invention, the method further comprises a step of starting playback of the recorded audio data of a time portion, when a selection of the time portion is detected.

Preferably, the method further comprises a step of generating a video sequence of the word clouds according to the chronological order of the respective time portions.

Moreover, the method further may comprise a step of assigning speakers to the respective time portions and overlaying the video sequence of the word clouds by additional data, in particular, by pictures or representations of the speakers.

According to still another preferred embodiment of the invention, the method further comprises a step of assigning documents or presentations to the respective time portions and overlaying the video sequence of the word clouds by the documents.

Further, according to the present invention, a real-time web communication and collaboration platform is provided, comprising a media server to which a plurality of clients is connected, wherein the media player is adapted to run an application for carrying out the method as described above.

Moreover, according to the present invention, an application for performing a real-time web communication and collaboration session on a real-time web communication and collaboration platform is provided, which is adapted for carrying out the method as described above.

The invention and embodiments thereof will be described below in further detail in connection with the drawing. Fig. 1 shows a schematic illustration of a word cloud sequence according to an embodiment of the invention;

Fig. 2 shows a schematic illustration of a word cloud aggregation procedure according to an embodiment of the invention;

Fig. 3 shows a schematic illustration of the hierarchical order of the word clouds according to an embodiment of the invention; and

Fig. 4 shows a schematic illustration of a temporal word cloud video track according to an embodiment of the invention.

Fig. 1 schematically shows an illustration of a word cloud sequence 1 according to an embodiment of the invention. The word cloud sequence 1 has been created as follows. A real-time web collaboration and communication session, for example, a conference call, which has been recorded is transcribed. That means the recorded audio data of the real-time web collaboration and communication session is transformed into text data, which is then subdivided into configurable time portions 2, 2’, 2” of a predetermined length, for example, 5 minutes per time portion. Flowever, this value may vary and may take any value which may be selected for the predetermined length of the time portions 2, 2’, 2”. Then, a word cloud 3, 3’, 3” is created for every time portion 2, 2’, 2”, and the resulting word clouds 3, 3’, 3” created for every time portion 2, 2’, 3” are combined in a chronological order. Then, the top keyword 4, 4’, 4” of every time portion 2, 2’, 2” is determined and is listed, here, right below the respective word cloud 3, 3’, 3”.

A user who was not able to participate in the real-time conference call, may now determine very easily and fast, which parts of the conversation are relevant for him or her when looking at and comparing the different word clouds 3, 3’, 3” created for the different timeframes or time portions 2, 2’, 2”. Also, it becomes quite clear and is visible for the user, how the discussion in the conference has evolved. Namely, while the top keyword 4 in the first 5 minutes corresponding to the first time portion 2 was “Java”, it was less prominent in the second 5 minutes corresponding to the second time portion 2’, and stayed on that level in the third time portion 2”, while in the second time portion 2’.

The keyword“Test” however, became much more prominent in the second 5 minutes corresponding to the second time portion 2’, in which it has been determined to be the top keyword 4’ of that time portion 2’, but it basically disappeared from the discussion in the third time portion 2”.

In the third time portion 2”, however, the discussion was mainly focusing on the term “Metrics” as the top keyword 4”, which was not a topic at all in previous time periods 2 and 2’.

Fig. 2 shows a schematic illustration of a word cloud aggregation procedure according to an embodiment of the invention. As can be seen here, the base word clouds 3, 3’, 3” of the word cloud sequence 1 , which have been determined and created according to the method described with respect to Fig. 1 , may be aggregated on several levels, until one final top word cloud 3 n is obtained for the entire recording. Namely, in the embodiment shown here, the first and second base word clouds 3 and 3’ are aggregated on a first level L 1 so as to form a first aggregated word cloud 3 1 . Then, the first aggregated word cloud 3 1 is further aggregated with the third base word cloud 3” on the next level L 2 , which here is already the top level L T , since the procedure started from only three base word clouds 3, 3’, 3”. If there are more base word clouds, then there will be further levels of aggregation until all base word clouds are aggregated into one top word cloud 3 n which comprises the content of the entire recording, since the aggregation of the word clouds 3, 3’, 3” basically is the aggregation of the time portions 2, 2’, 2” into which the recording was initially subdivided. Thus, when aggregating the first time portion 2 having a length of 5 minutes with the chronologically adjacent subsequent the second time portion 2’ also having a length of 5 minutes on the first level L 1 , then the time portions 2 and 2’ aggregated into a first aggregated time portion 2 1 on the first level L 1 will comprise audio or text data corresponding to an original length of 10 minutes. The aggregated time portions on the next level L 2 or L T (as the top level which here is already reached due to the initial situation of 3 time portions in total) will then correspond to a length of 15 minutes. Thus, the top level L T word cloud 3 n represents a summary of the whole conference call, showing the topics that have been discussed in the conference call.

It is noted that each time portion 2, 2’, 2” references a specific duration within the recorded conference and also in the conference transcription. This is used to create a mapping between the generated word clouds 3, 3’, 3” and the concrete section in the recorded video, which is further described with respect to Fig. 3 and Fig. 4 for this word cloud. This mapping in turn, is used to provide means to navigate back and forth between a time portion corresponding to a word cloud and the conferencing recording.

Fig. 3 shows a schematic illustration of the hierarchical order of the word clouds. In particular, the diagram shows the different layers or levels, starting from the base layer or base level L B of the initially created base word clouds 3, 3’, 3”, 3’” and how they span different parts of the original video and/or audio recording 5. In addition it is shown here, how a specific word clouds 3, 3’, 3”, 3’” map to respective specific parts of the original recorded video track 5, as indicated by the respective lines and arrows.

The user is presented the final, aggregated word cloud 3 n as a top level cloud, and by a certain action, for example, by clicking on the graphic, or by showing details, etc., the user can drill down to the next layer or level. On that layer or level, the word clouds, which were combined to the top layer, are shown, each with different visualization of the keywords, as already explained with respect to Fig. 1 and Fig. 2.

Again, by means of a certain action, e.g. a click on one of the word clouds, the user can drill down into the next layer or level, and thus, he or she will easily find the position, where a certain topic in the conversion started as, respectively, the specific (top) keyword appears. This step can be repeated for each level, until the base level L B has been reached. Of course, it is also possible to go back up through the layers and it is also possible to explore other regions of the temporal word clouds. Once the user spots a word cloud he or she is particularly interested in, the he or she may start the playback of the original recording relating to the time portion corresponding to that word cloud by means of a certain action. Playback will then start, exactly from the position where the word cloud in which the keyword of interest appeared was created. This is possible, since the exact position and timespan for each word cloud time portion in the originating video is known.

Fig. 4 shows a schematic illustration of a temporal word cloud video track 6 according to an embodiment of the invention. Here, for even better visualization, from the single word clouds 3, 3’, 3”, 3’”, 3””, 3. , etc., a video-like‘view’ is created, which shows the word cloud sequence 1 and how the prominence of certain keywords (not shown here, see Fig. 1 ) evolve during the course of the conference.

For this use case, it is preferable that the original recoded video track is subdivided into small time periods, from which the respective word clouds 3, 3’, 3”, 3’”, 3””, 3. , etc. are created to have rather small word cloud chunks, so that the visualization, basically a concatenation of the separate word cloud chunks or the word cloud sequence 1 is smooth, even when played at different playback speeds.

The size of the resulting temporal word cloud video 6 depends on the number of word cloud chunks and the technique used to blend between the single pictures, but it is definitely much shorter than the original recorded video, as is schematically illustrated in the figure. It can be seen that the original duration of the recorded video track 5 was 60 minutes. The temporal word cloud video track 6 is much shorter, as indicated by the shorter bar which is used here to illustrate the temporal word cloud video track 6.

It is noted that a VCR like control 7 may be used for controlling the temporal word cloud video track 6 including forward, backward, fast forward, fast backward, etc., which basically are the same controls the user uses when manipulating the original video recording 5. Thus, the VCR like controls 7 may be applied both, to navigate through the temporal word cloud video 6, and also to navigate through the original recorded video, when blending from the temporal word cloud video to the recorded video.

At any time, when a keyword is spotted that the user is particularly interested in, by means of a certain action, the original recording may be blended in. Again, this is possible, since the exact position and timespan for each word cloud chunk in the original recorded video track 5 is known. This allows a seamless integration of the temporal word cloud video track 6 in the playback of the recording video, without switching tools. Thus, the temporal word cloud video track 6 and the recorded original video track 5 may be controlled within the same tool and the same kind of controls, namely, the VCR like control 7 mentioned above. Technically, this may be accomplished by having separate video tracks in one Matroska Video Container MKV file and by blending in either one of other video stream, namely, either the temporal word cloud video track 6 or the original recorded video track 5.

Moreover, it is noted that the above described embodiment may be further improved by overlaying the original recorded video track 5 by the temporal word cloud track 6 in a semi-transparent manner, so that the speakers in the conference can be seen, or presentations which were shared, may be observed.

Further, it is noted that in this figure, the timely mapping between the temporal word cloud chunks 3, 3’, 3”, 3’”, 3””, 3. , etc. and the original recorded video track 5 is illustrated.

Thus, summarizing the above, a very convenient and user-friendly method is provided for a user to watch only those parts in a conference call recording that are of particular interest for that user, which is enabled by subdividing a transcription of an originally recorded video/audio track into timely fixed but configurable, chunks so as to create separate word clouds these chunks which actually the subdivided time portions of the originally recorded video or audio sequence. These word clouds 3, 3’, 3” provide a summary of the content of the conference call at any time, which is very easy to grasp for a user.

As mentioned above, according to a very preferable embodiment, a temporal word cloud video track 6 is created from the temporal word cloud sequence 1 , which allows the user to use well-known navigation means so as to navigate through the temporal word cloud video track 6 (forward, backward, fast forward, fast backward, etc.) and also seamless switching between the temporal word cloud track 6 and the recorded video track 5.

The embodiments described above allow for the creation of a kind of static temporal word cloud, which is the same for every user. However, in order to make the usage of the temporal word cloud even more effective, a user specific skill profile may be used to customize the temporal word cloud for user. For instance, such a skill profile may contain specific keywords. These keywords may be emphasized or marked otherwise in the resulting word cloud chunks. Thus, the user will be able to find the sections of the discussion, which he or she is most likely interested in, even more easily. Another approach is to analyze the conversational behavior of the user. For example, keywords that are often used in conversations, in which the user is actively participating, may also be used to create a skill profile of the user.

With those techniques, the created temporal word clouds may be tailored or customized for the user, making it even more efficient for him or her to navigate through temporal word clouds and thus, through the originally recorded video track.

Reference numerals

1 word cloud sequence

2, 2’, 2” time portion

3, 3’, 3” word cloud

4, 4’, 4” top keyword

5 video track

6 temporal word cloud video track

7 VCR-like control