METHODS, COMPUTER PROGRAM PRODUCTS AND DEVICES FOR AUTOMATICALLY SYNCHRONIZING AN AUDIO TRACK WITH A PLURALITY OF PAGES

Title:

METHODS, COMPUTER PROGRAM PRODUCTS AND DEVICES FOR AUTOMATICALLY SYNCHRONIZING AN AUDIO TRACK WITH A PLURALITY OF PAGES

Document Type and Number:

WIPO Patent Application WO/2018/078463

Kind Code:

Abstract:

The present invention relates generally to methods, computer program products and devices for automatically synchronizing an audio track with a plurality of pages, wherein categorization of content the pages and categorization of parts of speech content in the audio track is used for the synchronization.

Inventors:

THÖRN OLA (SE)

Application Number:

PCT/IB2017/055817

Publication Date:

May 03, 2018

Filing Date:

September 26, 2017

Export Citation:

Click for automatic bibliography generation Help

Assignee:

SONY MOBILE COMMUNICATIONS INC (JP)
SONY MOBILE COMM AB (SE)

International Classes:

G11B27/10; G07F17/30; G09B5/06

Domestic Patent References:

WO2005069171A1

2005-07-28

Foreign References:

US20030188255A1	2003-10-02
US6578040B1	2003-06-10
US20080055468A1	2008-03-06
US6636238B1	2003-10-21
JP2002304420A	2002-10-18

Other References:

YUAN, AUTOMATIC VIDEO GENRE CATEGORIZATION USING HIERARCHICAL SVM

Attorney, Agent or Firm:

AWAPATENT AB (SE)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A method (400) for automatically synchronizing an audio track (104) with a plurality of pages (102; 102a-c), the method comprising the steps of: transcribe (S402) speech content in the audio track into text, for each page of the plurality of pages, categorize (S404) content of the page into at least one first category (202a-g),

divide (S406) the text into a sequence of parts of text, each part of text corresponding to a time span (t1 -t3) in the audio track, and for a part of text, categorize (S408) the part of text into at least one second category (202a-g), map (S410; 106a-c) the part of the text to one of the plurality of pages by comparing the at least one category of each page with the at least one second category.

2. The method of claim 1 , further comprising the step of: tagging the audio track at the time span in the audio track to which the part of the text corresponds with a pointer to said one of the pluralities of pages.

3. The method of any of claims 1 -2, wherein at least one page of the plurality of pages comprises an image, wherein the step of categorize the content of said at least one page into at least one first category comprises analyzing the image using an image recognition algorithm.

4. The method of any one of claims 1 -3, wherein at least one page of the plurality of pages comprises text, wherein the step of categorize the content of said at least one page into at least one first category comprises indexing the text using a key word extraction algorithm.

5. The method of any one of claims 1 -4, wherein the step of categorize the part of text into at least one second category comprises indexing the text using a key word extraction algorithm.

6. The method of any one of claims 4-5, wherein the key word extraction algorithm comprises at least one from the list of: support vector machine, K- nearest neighbor algorithm, K-means clustering and boosting algorithm. 7. The method of any one of claims 1 -6, wherein the step of categorize content of the page into at least one first category comprises:

determining a plurality of categories for the content of the page, running the plurality of categories through a database, wherein the database comprises a hierarchal structure (200) of categories, and retrieving a single category from the database,

categorize the content of the page into said single category retrieved from the database.

8. The method of any one of claims 1 -7, wherein the step of categorize the part of text into at least one second category comprises:

determining a plurality of categories for the part of text,

running the plurality of categories through a database, wherein the database comprises a hierarchal (200) structure of categories, and retrieving a single category from the database,

categorize the part of text into said single category retrieved from the database.

9. The method of any one of claims 1 -8, performed iteratively (L1 ) for each specific part of text of the sequence of parts of texts.

10. The method of claim 9, wherein the plurality of pages is in a

prearranged sequence, wherein the step of mapping the part of the text to one of the plurality of pages comprises:

checking if any part of text in the sequence the parts of text already have been mapped to a page of the plurality of pages,

mapping the part of the text such that the mapping does not break the prearranged sequence of the pages with respect to an order of the time spans in the audio track.

1 1 . The method of any one of claims 9-10, wherein each specific part of text of the sequence of parts of texts is mapped to a unique page among the plurality of pages.

12. The method of any one of claims 9-10, wherein more than one part of text among the sequence of parts of text is mapped to a same single page among the plurality of pages. 13. The method of any one of claims 1 -12, wherein at least one page of the plurality of pages comprises text, wherein the method further comprising the steps of:

determining a language of the speech content in the audio track, determining that the text of the at least one page is of a language other than the language of the speech content,

prior to categorize content of the at least one page into at least one first category, translating the text of the at least one page into the language of the speech content using a computerized translation service. 14. A computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of any one of claims 1 -13 when executed by a device having processing capability.

15. A device (300) for automatically synchronize an audio track (104) with a plurality of pages (102; 102a-c), the device comprising a processing unit configured to:

transcribe (S402) speech content in the audio track into text, for each page of the plurality of pages, categorize (S404) content of the page into at least one first category (202a-g),

divide (S406) the text into a sequence of parts of text, each part of text corresponding to a time span (t1 -t3) of the audio, and for a part of text, categorize (S408) the part of text into at least one second category (202a-g), map (S410, 106a-c) the part of the text to one of the plurality of pages by comparing the at least one category of each page with the at least one second category.

Description:

METHODS. COMPUTER PROGRAM PRODUCTS AND DEVICES FOR AUTOMATICALLY SYNCHRONIZING AN AUDIO TRACK WITH A

PLURALITY OF PAGES

Technical field

The present invention relates generally to methods, computer program products and devices for automatically synchronizing an audio track with a plurality of pages.

Background of the invention

Online learning is a growing area. More and more online course content is available online. Example of such online course content includes recorded lectures, and documents comprising a plurality of pages, such a power point slides or PDF documents, which often include e.g. key words or content which should emphasise and clarify parts of a lecture. The content of such documents may include both text and images, and also multimedia content such as sound clips, video clips or animations. Mapping the documents to the recorded lecture, e.g. what page of the document that should be shown at a specific time span of the lecture is a labour-intensive task which often requires that someone listens through the entire lecture and manually make sure that the document switch page at the right point in time. Considering the vast number of recorded lectures available online, this is not obtainable for all those lectures.

Some automatic algorithms have been developed for automatically synchronizing an audio track with a plurality of pages. One such algorithm is presented in "Automatic synchronization of speech transcript and slides in presentation" (Chen and Heng, Institute for Infocomm Research). However, these algorithms are often limited and targeted to a specific problem and will not be applicable on many of the available online course content.

There is thus room for improvement within this field. Summary of the invention

In view of the above, an objective of the invention is to solve or at least reduce one or several of the drawbacks discussed above. Generally, the above objective is achieved by the attached independent patent claims.

According to a first aspect, the present invention is realized by a method for automatically synchronizing an audio track with a plurality of pages, the method comprising the steps of:

- transcribe speech content in the audio track into text,

- for each page of the plurality of pages, categorize content of the page into at least one first category,

- divide the text into a sequence of parts of text, each part of text

corresponding to a time span in the audio track, and for a part of text, categorize the part of text into at least one second category, and

- map the part of the text to one of the plurality of pages by comparing the at least one category of each page with the at least one second category.

By the term "page" should, in the context of present specification, be understood a virtual placeholder for content such as text, images, sound clips, video clips, animations etc. An example of a page is a slide in a Microsoft Office PowerPoint slideshow, a page in a Microsoft Office Word document, a page in an Apple Keynote document, a page in a Google docs document or a page in any other suitable multipage document which can hold the above described content. According to embodiments, a page does not necessarily need to be part of a document. For example, an entry in a database pointing to content (vie e.g. url:s or other types of links) is an example of a page. A folder in a computer file system may also be considered as a page, where the content in the folder (text, images, sound clips, video clips, animations etc.) is the content of the page.

By the term "category" should, in the context of present specification, be understood a word describing shared characteristics of e.g. a part of text, or content of a page.

By the term "part of text" should, in the context of present specification, be understood a plurality of consecutive words from the text transcribed from the speech content in the audio track.

In a second aspect, the present invention provides a computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of the first aspect when executed by a device having processing capability.

In a third aspect, the present invention provides a device for

automatically synchronize audio with a plurality of pages, the device comprising a processing unit configured to:

- transcribe speech content in the audio into text,

- for each page of the plurality of pages, categorize content of the page into at least one first category,

- divide the text into a sequence of parts of text, each part of text

corresponding to a time span of the audio, and for a part of text, categorize the part of text into at least one second category, and

- map the part of the text to one of the plurality of pages by comparing the at least one category of each page with the at least one second category.

The second and third aspect may generally have the same features and advantages as the first aspect.

Other objectives, features and advantages of the present invention will appear from the following detailed disclosure, from the attached dependent claims as well as from the drawings.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [element, device, component, means, step, etc]" are to be interpreted openly as referring to at least one instance of the element, device, component, means, step, etc., unless explicitly stated otherwise.

The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

Brief description of the drawings

The above, as well as additional objects, features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of embodiments of the present invention, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:

figure 1 schematically shows synchronization of an audio track with a plurality of pages according to embodiments, figure 2 show a hierarchal structure of categories according to embodiments,

figure 3 shows by way of example a device for automatically

synchronize an audio track with a plurality of pages using a database comprising the hierarchal structure of figure 2,

figure 4 shows a method for automatically synchronizing an audio track with a plurality of pages.

Detailed description of embodiments

Figure 1 shows by way of example an audio track 104 and a plurality of pages 102. The audio track may originate from a video recording, e.g.

recorded by a microphone on a video camera/smart phone etc., or it may be an audio recording without any video content. The audio track 104 comprises speech content which may be synchronized with the plurality of pages 102. As described above, a page 102a-c is a placeholder for content such as text, audio, video, images, animations etc.

The synchronization can be automated using the method described herein. The synchronization process is shown in figure 1 and will be described in conjunction with figure 4.

Speech content in the audio track 104 is transcribed S402 into text.

Transcribing speech into text is well known in the art, and this step is left for the skilled person to implement. However, it should be noted that it exists many algorithms and software for transcribing speech into text, for example software implementing Google Cloud Speech API. According to some embodiments, the step of transcribing speech into text also comprise manual editing of the transcribed text by a human.

Each page 102a-c of the plurality of pages 102 is categorized S404, such that the content of each page is categorized into at least one first category. The categorization is further described below.

The transcribed text is divided S406 into a sequence of parts of text, each part of text corresponding to a time span t1 -t3 in the audio track 104. This may be done using any Natural language processing (NLP) or

Computational linguistics algorithm. An example may be to use software implementing the Natural Language Toolkit available for the Python

programming language. Alternatively, or additionally, analysis of the audio track, e.g. using interruptions in the speech content in the audio track 104 as key, may be used for dividing S406 the transcribed text into the sequence of parts of text.

A first part of the text of the sequence of parts of text, i.e.

corresponding to one of the time spans t1 -t3 in figure 1 , is categorized S408 into at least one second category. The first part of the text that is categorized does not necessarily correspond to the part of the text that comes first in a time order in the audio track 104. The categorization is further described below.

When the first part of the text has been categorized S408, this part is mapped S410; 106a-c to one of the plurality of pages 102a-c. This is done by comparing the at least one category of each page 102a-c with the at least one second category. The time span t1 -t3 (corresponding to the first part of the text that is currently processed and mapped) of the audio track 104 will thus be mapped (matched, etc.) with a specific page 102a-c among the plurality of pages 102. Consequently, this specific page 102a-c may be shown while this time span of the audio track is played. This may be implemented in many ways; one way may be to tag the audio track 104 at the time span in the audio track to which the first part of the text corresponds with a pointer to said one of the pluralities of pages. In the example of figure 1 , the first time span t1 is matched with the first page 102a. Consequently, the audio track 104 may be tagged with a pointer 106a at the beginning of time span t1 , which points to the first page 102a. The first page 102a may then be shown until the audio track 104 is tagged with another pointer, in this case pointer 106b, pointing to another page 106b of the plurality of pages 102.

According to embodiments, each specific part of text of the sequence of parts of text are mapped to a page 102a-c. This may be implemented by iteratively L1 categorize and map a further part of text until it is determined S412 that all parts of text have been categorized.

The iterative approach may be implemented such that, in case the plurality of pages 102 is in a prearranged sequence (such as pages in a deck of PPT-slides, or pages in a PDF document), the mapping of the parts of text does not break the prearranged sequence of the pages 102a-c with respect to an order of the time spans t1 -t3 in the audio track 104. This is shown in figure 1 where the part of text corresponding to the first time span t1 in the audio track 104 is mapped to the first page 102a (in a sequence running 102a, 102b, 102c) among the plurality of pages 102, the second time span t2 is mapped to the second page 102b etc. This may be implemented by including in the automatic synchronizing method the step of checking if any part of text in the sequence the parts of text already have been mapped to a page of the plurality of pages, and making the mapping decision also based on this knowledge. For example, the method may be performed on the parts of the text in a time order, such that the first part of text to be mapped is the one corresponding to the first (earliest) time span t1 in the audio track, the second part to be mapped is the one corresponding to the second time span t2 etc. In this embodiment, a time span (part of text) may only be mapped to the same page or a later page compared to the previously mapped time span. It should be noted that not all pages 102a-c among the plurality of pages needs to be mapped to a part of text. According to embodiments, one or more of the pages may be skipped. For example, the first page 102a (or the second page 102b or the last page 102c) among the plurality of pages does not need to be mapped to any part of text. According to some embodiments, the order of the pages 102a-c among the plurality of pages 102 is not a decisive factor when mapping the parts of texts, and an earlier time span may thus be mapped to a later page compared to a later time span or vice versa. Also, several parts of text may be mapped to the same page 102a-c among the plurality of pages 102. In other words, more than one part of text among the sequence of parts of text may according to embodiments be mapped to a same single page 102a-c among the plurality of pages 102. According to other embodiments, each specific part of text of the sequence of parts of texts is mapped to a unique page 102a-c among the plurality of pages 102. The later embodiment is shown in figure 1.

The mapping S410 may involve mapping a part of the text, i.e.

corresponding to a specific time span t1 -t3, being of a specific category with a page 102a-c being of the same category. The mapping S410 may involve mapping a part of the text having a plurality of categories with a page 102a-c which, among the plurality of pages 102, shares most of the categories with the part of the text.

According to some embodiments, the step of categorize content of the page into at least one first category comprises determining a plurality of categories for the content of the page. In order to improve and simplify mapping of such a page, the plurality of categories may be investigated for a common concept. For example, in case the page has content which is categorized into two categories, cars and motorbikes, a common concept may be motor vehicles, and the page may thus be categorized into the category of motor vehicles. Such hierarchal structure of categories may be available to the device running the method of automatically synchronizing an audio track 104 with a plurality of pages 102 via e.g. a database comprising a hierarchal structure of categories. Such structure 200 is by way of example shown in figure 2. At the lowest level of the structure 200, three categories 202d, 202e, 202g exist. The category 202d, 202f named category 4 is present at two places in the structure. Category 4 and 5 (referred to as 202d-e) belongs to the same category 2 (referred to as 202b), whereas category 4 and 6 (referred to as 202f-g) belongs to the same category 3 (referred to as 202c). Category 2 and 3 belongs to the same category 1 (referred to as 202a). The hierarchal structure in figure 2 is just an example, according to embodiments, three or more categories may have the same parent category etc.

Using a hierarchal structure 200 as described in above, in case a plurality of categories is determined for a page, the step of categorize content of the page into at least one first category may comprise the steps of determining a plurality of categories for the content of the page, running the plurality of categories through a database, wherein the database comprises a hierarchal structure 200 of categories, and retrieving a single category from the database, and categorize the content of the page into said single category retrieved from the database.

In a similar way, the step of categorize the part of text into at least one second category may comprise: determining a plurality of categories for the part of text, running the plurality of categories through a database, wherein the database comprises a hierarchal structure 200 of categories, and retrieving a single category from the database, categorize the part of text into said single category retrieved from the database.

Moreover, the step of mapping S410 a part of the text to one of the plurality of pages may also be done using a similar hierarchal structure. For example, in case the part of the text is categorized into category 5, while a first page is categorized into category 4 and a second page is categorized into category 6, the hierarchal structure may be used for determining the mapping. In this case, category 4 and 5 share a common parent (common concept) which is category 2. This is not true for category 5 and 6.

Consequently, the part of the text is mapped to the first page. In another example, the part of the text is categorized into category 5, while a first page is categorized into category 3 and a second page is categorized into category 2. In this case, category 2 is parent to category 5. This is not true for category 5 and 3. Consequently, the part of the text is mapped to the second page.

The step of mapping S410 may also be done by mapping a part of the text to a page which share most of the categories with the categories of the part of the text. For example, in case the part of the text is categorized in category A and only one of the pages is categorized as that category, the mapping is done accordingly. In another example, the part of the text is categorized in category A, B and D, and only one of the pages is categorized to two or more of those categories, the mapping is done accordingly.

Figure 3 shows a system for automatically synchronize an audio track

104 with a plurality of pages 102; 102a-c. The system comprises a device (a computing device having a processor) 300 which is adapted to perform a synchronization of the audio track 104 and the plurality of pages 102. The device 300 may for example receive the audio track 104 and the plurality of pages 102 through a digital network module. According to other

embodiments, the audio track 104 and the plurality of pages are stored at a memory on the device 300. The device (e.g. the processor of the device) 300 may be arranged to perform the method of figure 4 and described herein, for example by running a computer program product comprising a computer- readable storage medium with instructions adapted to carry out the method of figure 4 and further described herein when executed by a device 300 having processing capability (e.g. the processor of the device).

The device 300 may according to embodiment be connected 302 to a database comprising a hierarchal structure 200 of categories, for example as described in figure 2 and above.

The output from the device 300 may comprise mappings between at least on page 102a-c of the plurality of pages 102 and a time span of the audio track 104, e.g. by mapping a part of the text to one of the plurality of pages by comparing the at least one category of each page with the at least one second category.

In this way the audio track 104 may be automatically divided into "chapters" (time spans) based on the mapping to pages among the plurality of pages 102. This division may facilitate improved navigation in the audio track 104 since the user then can navigate between different parts of the audio track 102 by e.g. swiping between pages 102a-c.

The categorization may be done in many different ways. For example, for text content of a page or for a part of text, the categorization may comprise indexing the text using a key word extraction algorithm and categorize the text/page based on the extracted key words. The key word extraction algorithm may be any such algorithm known in the art, for example it may comprises at least one from the list of: support vector machine, K-nearest neighbor algorithm, K-means clustering and boosting algorithm. The extracted key word(s) may then be used directly as categories, or they may be processed through a hierarchal structure as described in figure 2 and above.

In case a page comprises one or more images, these may also be used for categorization of the page. The step of categorize the content of said at least one page into at least one first category may thus comprise analyzing the image using an image recognition algorithm. The image recognition algorithm may be provided by a service such as Imagga provided by Imagga Technologies Ltd. , where the image(s) of the page is/are sent to through the Imagga API and where one or more tags from the API is received from the API. Many similar services exist, and alternatively the image recognition algorithm may be directly implemented at the device running the method for automatically synchronizing the audio track 104 with the plurality of pages 102; 102a-c.

Any video material or animations present on a page among the plurality of pages may be categorized using suitable services or algorithms, such as the methods described in "Automatic video genre categorization using hierarchical SVM" (Yuan et al.). Audio clips present at a page may be categorized using the same method as described herein for the audio track 104. Other types of analysis of audio clips may also be used for

categorization purposes, for example analysis determining that the audio clip is recorded at a sports arena, or at sea, in the woods etc.

Sometimes the language of the speech content in the audio track is not the same as the language of text present in the pages. In this case, the method of synchronization may comprise determining a language of the speech content in the audio track, and determining that the text present in the pages is of a language other than the language of the speech content. Then, prior to categorize content of the at least one page into at least one first category, one of the text of the pages and the transcribed text from the speech content is translated to the language of the other. Advantageously the text of the pages is translated since it typically is less to translate compared to if the text originating from the speech content is translated. In other words, according to embodiments, the text of the pages is translated into the language of the speech content using a computerized translation service. Many such translation services exist, for example Google translate.

The systems devices and methods disclosed hereinabove may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks between functional units or stages referred to in the above description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit (ASIC). Such software may be distributed on computer readable media, which may comprise computer storage media (or non- transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and that can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Previous Patent: REFILLABLE WRITING INSTRUMENT

Next Patent: SUSPENSION FOR A BICYCLE