Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
"METHOD AND SYSTEM FOR REAL-TIME AUDIO/VIDEO SYNCHRONIZATION"
Document Type and Number:
WIPO Patent Application WO/2023/131864
Kind Code:
A1
Abstract:
A method (700) for real-time audio/video synchronization, comprising steps of: uploading (701), by a central electronic computer, a video content received by means of a user interface provided by the central electronic computer usable online by a local electronic device of a user requesting a service for real-time audio/video synchronization, operatively connected to the central electronic computer by means of a data communication network; providing (702), by the central electronic computer, the uploaded video content to a classifier module operatively connected to the central electronic computer and adapted to analyze, by means of specific artificial intelligence algorithms, frame by frame, the video content received from the central electronic computer and generate a plurality of visual elements found in the video content, associating a corresponding confidence score with each visual element; receiving (704), by the central electronic computer, the plurality of visual elements found in the video content with a corresponding confidence score, generated by the classifier module, associated with each visual element; generating (705), by the central electronic computer, at least one tag representative of the video content starting from the plurality of visual elements generated by and received from the classifier module; selecting (706), by the central electronic computer, a sub-group of audio tracks of said plurality of audio tracks stored in a memory unit of the central electronic computer, based on the at least one tag representative of the video content generated by the central electronic computer, each audio track of said plurality of audio tracks comprises a plurality of tags associated with said audio track, each audio track of the plurality of audio tracks being stored in the memory unit by assigning a numerical value representative of a score to each tag of the respective plurality of tags; searching (707), by the central electronic computer, for the at least one tag representative of the video content generated by the central electronic computer in each plurality of tags assigned to each audio track of said plurality of audio tracks, the step of selecting (706) being performed, by the central electronic computer, to select the audio tracks comprising the searched at least one tag as the sub-plurality of audio tracks of said plurality of audio tracks; sorting (708), by the central electronic computer, the selected audio tracks of the sub-plurality of audio tracks in descending order based on the score of the searched at least one tag assigned to each audio track; providing (709), by the central electronic computer, the user requesting the service for real-time audio/video synchronization with the sub-group of audio tracks selected and sorted based on the relevance to the uploaded video content; receiving (711) from the user, by the central electronic computer, a selection of an audio track of said sub-group of audio tracks to be synchronized with the video content; generating (712), by the central electronic computer, an audio/video content comprising the uploaded video content synchronized with the audio track selected by the user requesting the service for real-time audio/video synchronization; providing (719), by the central electronic computer, the user requesting the service for real-time audio/video synchronization with the generated audio/video content.

Inventors:
LAMBERTI TIZIANO (IT)
FACCINI PAOLO (IT)
ABRATE CARLO (IT)
Application Number:
PCT/IB2022/062917
Publication Date:
July 13, 2023
Filing Date:
December 30, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SOUNZONE S R L (IT)
International Classes:
G06F16/683; G06V20/00; G06V20/40; G11B27/10; G11B27/28; G11B27/34
Foreign References:
US20170092328A12017-03-30
US10140515B12018-11-27
US10276189B12019-04-30
US20210020149A12021-01-21
Attorney, Agent or Firm:
MOZZI, Matteo et al. (IT)
Download PDF:
Claims:
CLAIMS

1. A method (700) for real-time audio/video synchronization, comprising steps of: uploading (701 ), by a central electronic computer (1 ), a video content (VD) received by means of a user interface (l-U) provided by the central electronic computer (1 ), usable online by a local electronic device (10) of a user (U) requesting a service for real-time audio/video synchronization, operatively connected to the central electronic computer (1 ) by means of a data communication network (NTW); providing (702), by the central electronic computer (1 ), the uploaded video content (VD) to a classifier module (40) operatively connected to the central electronic computer (1 ), the classifier module (40) analyzing, by means of specific artificial intelligence algorithms, frame by frame, the video content (VD) received from the central electronic computer (1 ) and generating a plurality (P-E) of visual elements found in the video content (VD), associating a corresponding confidence score with each visual element; receiving (704), by the central electronic computer (1 ), the plurality (P-E) of visual elements found in the video content (VD) with a corresponding confidence score, generated by the classifier module (40), associated with each visual element; generating (705), by the central electronic computer (1 ), at least one tag (TG) representative of the video content (VD) starting from the plurality (P-E) of visual elements generated by and received from the classifier module (40); selecting (706), by the central electronic computer (1 ), a sub-group (S- TK) of audio tracks of said plurality (P-TK) of audio tracks stored in a memory unit (2) of the central electronic computer (1 ), based on the at least one tag (TG) representative of the video content (VD) generated by the central electronic computer (1 ), each audio track of said plurality (P-TK) of audio tracks comprises a plurality (P-T) of tags associated with said audio track, each audio track of the plurality (P-TK) of audio tracks being stored in the memory unit (2) by assigning a numerical value representative of a score to each tag of the respective plurality (P-T) of tags; searching (707), by the central electronic computer (1 ), for the at least one tag (TG) representative of the video content (VD) generated by the central electronic computer (1 ) in each plurality (P-T) of tags assigned to each audio track of said plurality (P-TK) of audio tracks, the step of selecting (706) being performed, by the central electronic computer (1 ), to select the audio tracks comprising the searched at least one tag (TG) as the sub-plurality (S-TK) of audio tracks of said plurality (P-TK) of audio tracks; sorting (708), by the central electronic computer (1 ), the selected audio tracks of the sub-plurality (S-TK) of audio tracks in descending order based on the score of the searched at least one tag (TG) assigned to each audio track; providing (709), by the central electronic computer (1 ), the user (U) requesting the service for real-time audio/video synchronization with the subgroup (S-TK) of audio tracks selected and sorted based on the relevance to the uploaded video content (VD); receiving (711 ) from the user (U) requesting the service for real-time audio/video synchronization, by the central electronic computer (1 ), a selection of an audio track (TK) of said sub-group (S-TK) of audio tracks to be synchronized with the video content (VD); generating (712), by the central electronic computer (1 ), an audio/video content (AV) comprising the uploaded video content (VD) synchronized with the audio track (TK) selected by the user (U) requesting the service for real-time audio/video synchronization; providing (719), by the central electronic computer (1 ), the user (U) requesting the service for real-time audio/video synchronization with the generated audio/video content (AV).

2. The method (700) according to claim 1 , wherein the step of providing (702) the video content (VD) to the classifier module (40) comprises a step of processing (703), by the central electronic computer (1 ), the uploaded video content (VD) so as to reduce the size thereof, in order to limit the uploading and analysis times by the classifier module (40).

3. The method (700) according to any one of the preceding claims, wherein the step of providing (709) the sub-group (S-TK) of audio tracks selected and sorted based on the relevance to the uploaded video content (VD) is performed by the central electronic computer (1 ) by means of a respective area (A2) for providing audio tracks of the user interface (l-U) usable online by means of the local electronic device (10) of the user (U) requesting the service for real-time audio/video synchronization.

4. The method (700) according to any one of the preceding claims, further comprising a step of providing (710), by the central electronic computer (1 ), the at least one tag (TG) used to search for the sub-group (S-TK) of audio tracks in a respective area (A3) for displaying tags of the user interface (l-U) usable online by means of the local electronic device (10) of the user (U) requesting the service for real-time audio/video synchronization.

5. The method (700) according to any one of the preceding claims, wherein the step of generating (712), by the central electronic computer (1 ), the audio/video content (AV) comprises steps of: establishing (713) a starting point of the video content (VD); establishing (714) a starting point of the selected audio track (TK); establishing (715) a maximum duration of the video; encoding (716) the video content (VD) and the selected audio track (TK); decoding (717) the video content (VD) and the audio track (TK) generating the audio/video content (AV) as a mix of the video content (VD) with the selected audio track (TK).

6. The method (700) according to claim 5, wherein the step of generating (712), by the central electronic computer (1 ), the audio/video content (AV) comprises a step of notifying (718) the user (U) requesting the service for realtime audio/video synchronization, of the end of the encoding operation, thus of the generation of the audio/video content (AV).

7. The method (700) according to any one of the preceding claims, wherein the generating step (705) is performed by the central electronic computer (1 ) to generate more than one tag (TG) representative of the video content (VD) starting from the plurality (P-E) of visual elements generated by and received from the classifier module (40).

8. A system for real-time audio/video synchronization (100), comprising a central electronic computer (1 ) operatively connected to a local electronic device (1 ) of a user (U) requesting a service for real-time audio/video synchronization, by means of a data communication network (NTW), the central electronic computer (1 ) being operatively connected to a classifier module (40), the classifier module (40) being configured, by means of specific artificial intelligence algorithms, to analyze, frame by frame, a video content (VD) received from the central electronic computer (1 ) and generate a plurality (P-E) of visual elements found in the video content (VD), associating a corresponding confidence score with each visual element, the central electronic computer (1 ) being configured to perform the steps of the method for real-time audio/video synchronization according to any one of the preceding claims.

Description:
DESCRIPTION

“Method and system for real-time audio/video synchronization”

Field of the invention

The present invention relates to audio/video synchronization techniques, in particular to a method for real-time audio/video synchronization and related system.

Technological background of the invention

Nowadays, the audio tracks in a digital music library are mainly used in various types of videos or projects on different media, such as video games, for example.

In the first case, the users, mainly video makers and editors, acquire an audio track from a digital music library and later perform local tests on the audio track, synchronizing it with the video being processed.

This synchronization operation can be performed with different video editing software in which the audio track and the video track are uploaded, later encoded and exported together in a single video content so as to be used together.

This procedure, between selecting the audio track, then encoding the video track with the selected audio track (audio/video synchronization) and finally sharing the video with the synchronized audio track with an assessment team and then with a possible customer, requires hours of work, and even more time is required in the case of creative video products (such as commercials, videos for social networks, films or documentaries) where there is a need to repeat it several times, ending up with the perfect audio track for the reference video. Moreover, when a customer and/or various managers within the same company are to be faced, such a procedure is further slowed down and the work of various components on the creative team is required several times at each new attempt.

Indeed, a musical supervisor must search for new musical tracks based on the feedback received from the customer, an editor is then to synchronize the new audio track again with the video and the resulting video product is to be sent again to the performance testing step of the audio track on the reference video and for final approval.

Thus, the need is strongly felt to provide a method for audio/video synchronization which considerably reduces the procedure time for selecting an audio track and the performance tests on a selected audio track on a reference video.

Summary

It is the object of the present invention to devise and provide a method for real-time audio/video synchronization which allows at least partially obviating the above drawback with reference to the known art, in particular which allows considerably reducing the procedure time for selecting an audio track and the performance tests on a selected audio track on a reference video, while ensuring users adequate support in selecting the best audio track adapted to the video.

Such an object is achieved by a method according to claim 1 .

Preferred embodiments of said method are defined in the dependent claims.

A system for real-time audio/video synchronization is also an object of the present invention.

Brief description of the drawings

Further features and advantages of the method and related system according to the invention will become apparent from the following description of preferred embodiments, given by way non-limiting indication, with reference to the accompanying drawings, in which:

Figure 1 diagrammatically shows, by means of a functional block diagram, a system for real-time audio/video synchronization according to the present invention;

Figures 2-5 diagrammatically show respective screens of a user interface usable in the system for real-time audio/video synchronization according to the present invention;

Figure 6 shows an example of a mathematical matrix usable when performing the method for real-time audio/video synchronization according to the present invention, and

Figure 7 diagrammatically shows, by means of a block diagram, a method for real-time audio/video synchronization according to the present invention.

It should be noted that, in the aforesaid drawings, equal or similar elements are indicated by the same numeric and/or alphanumeric reference.

Detailed description

With reference to the aforesaid drawings, a system 100 for real-time audio/video synchronization according to the present invention, hereinafter simply synchronization system or only system, is now described.

The system 100 comprises a central electronic computer 1 . The central electronic computer 1 comprises a respective data processing unit 2, e.g., a microcontroller or microprocessor.

The central electronic computer 1 further comprises a memory unit 3 operatively connected to the data processing unit 2.

The memory unit 3 can be outside (as diagrammatically shown in Figure 1 ) or inside the data processing unit 2 (embodiment not shown in the drawings).

The data processing unit 2 is configured, by means of uploading and executing one or more program codes stored in the memory unit 3, to perform the method for real-time audio/video synchronization of the present invention, as described below.

It should be noted that in addition to storing one or more program codes, the memory unit 3 is configured to store the data generated and processed following the execution of said one or more program codes.

Moreover, the memory unit 3 is configured to store a plurality of audio tracks P-TK, preferably musical audio tracks, provided by the central electronic computer 1 , by means of a respective digital platform, when performing the method for real-time audio/video synchronization of the present invention.

The central electronic computer 1 is configured to upload a video content VD received by means of a user interface l-U provided by the central electronic computer 1 , usable online by a local electronic device 10 of a user U requesting a service for real-time audio/video synchronization, operatively connected to the central electronic computer 1 by means of a data communication network NTW (diagrammatically shown in Figure 1 ), e.g., the Internet.

As shown in Figure 1 , the local electronic device 10 comprises a respective local data processing module 20, for example a microcontroller or microprocessor.

The local electronic device 10 further comprises a local memory module 30 operatively connected to the local data processing module 20.

The local memory module 30 can be outside (as diagrammatically shown in Figure 1 ) or inside the local data processing module 20 (embodiment not shown in the drawings).

The local data processing module 20 is configured, by uploading and executing one or more program codes stored in the local memory module 30, to allow the operation of the local electronic device 10, in particular also to allow the user U to access a service for real-time audio/video synchronization, according to the present invention, provided by the central electronic computer 1.

It should be noted that, in addition to storing one or more program codes, the local memory module 3 is configured to possibly also store one or more video contents VD which the user U requesting a service for real-time audio/video synchronization generated and with which an audio track can require synchronizing.

The local electronic device 10 of the user U requesting the service for real-time audio/video synchronization is any electronic device having accessibility to the data communication network NTW to use, via Web of the user interface l-U, e.g., a personal computer, a notebook, a tablet, a smartphone, and so on.

The user U requesting the service for real-time audio/video synchronization is any user, private or a company, video maker or editor, with the need to assess and propose to third parties, e.g., a possible customer, a previously recorded video synchronized, with the best audio track possible in terms of sound performance, with the video images and dynamics.

In an embodiment, the video content VD to be uploaded can be uploaded by the user U by performing a “drag and drop” operation from his/her local electronic device 10, i.e., by dragging the video content VD onto a respective first video uploading area A1 of the user interface l-U.

The first video uploading area A1 is shown in Figures 2, 3 and 4.

In an embodiment, as an alternative to the previous one, the video content VD to be uploaded can be uploaded by the user U by selecting, from the local file system among the video content stored in the local memory module 30 of the respective local electronic device 10, i.e., specifying, within the user interface l-U, a path URL corresponding to the position of the video content VD to be uploaded present in the local memory module 30.

For example, possible formats of the video content to be uploaded are MPEG-4, WEBM, FLV, and so on.

It should be noted that the central electronic computer 1 is configured to upload content VD as a video track, ignoring and/or muting any audio track associated with the video content acquired during the recording of the video content.

Returning to the system 100 according to the present invention and to Figure 1 , the system 100 further comprises a classifier module (or concentrator) 40 operatively connected to the central electronic computer 1 .

The classifier module 40 can be outside the central electronic computer 1 and operatively connected thereto by means of the data communication network NTW (as shown in Figure 1 ). Alternatively, the classifier module 40 can be integrated in the central electronic computer 1 .

Once the video content VD is uploaded, the central electronic computer 1 is configured to provide the video content VD to a classifier module 40.

In this regard, in an embodiment, prior to providing the video content VD to the classifier module 40, the central electronic computer 1 is configured to process the video content VD so as to reduce the size thereof in order to limit the uploading and analysis times by the classifier module 40.

For example, if it is not, the central electronic computer 1 is configured to convert the video content VD into an MPEG-4 format with a set maximum duration, e.g., 30 seconds, and a set size, e.g., 50 megabytes.

Once the video content VD has been received by the central electronic computer 1 , the classifier module 40 is configured, by means of specific artificial intelligence algorithms (Al), to analyze, frame by frame, the video content VD received by the central electronic computer 1 and generate a plurality P-E of visual elements (tags) found in the video content VD, associating a corresponding confidence score with each visual element.

The usable artificial intelligence algorithms are, for example, visual ranking algorithms in continuous evolution, the purpose of which is to analyze each individual frame of the video content VD, divide each individual frame into areas up to identifying recognizable individual visual elements (for example, cat, cup, automobile, and so on) and return the name of the recognized visual element, while associating the position coordinates within the frame and the respective size with such a name.

“Confidence score” means an estimated percentage of certainty of having recognized a visual element. A confidence score value close to “1 ” corresponds to a high confidence score which indicates the certain and predominant presence of the recognized visual element in the video content.

Contrarily, a lower confidence score value, close to “0”, corresponds to a low (poor) confidence score which indicates that the visual element was not actually recognized, rather a false positive is involved.

The visual elements with the highest confidence score values within the aforesaid confidence score scale, from “0” to “1 ”, will be given increased importance in the system 100 according to the present invention.

The classifier module 40 is configured to provide the central electronic computer 1 with the generated plurality P-E of visual elements.

In an embodiment, prior to providing the central electronic computer 1 with the generated plurality P-E of visual elements, the classifier module 40 is configured to aggregate the results obtained from the analysis of all the frames of the video content VD in order to yield the plurality P-E of visual elements found in the video content which is as consistent as possible with the general contents of the video content VD.

Returning in general to the system 100 according to the present invention, the central electronic computer 1 is configured to receive the generated plurality P-E of visual elements from the classifier module 40.

The central electronic computer 1 is further configured to generate at least one tag TG representative of the video content VD starting from the plurality P-E of visual elements generated by and received from the classifier module 40.

In greater detail, the plurality P-E of visual elements (tags) received from the classifier module 40 is processed by the central electronic computer 1 and translated into synonymous tags in a root tag database (not shown in the drawings) operatively connected to the central electronic computer 1 .

For example, if the classifier module 40 finds an automobile as visual element and recognizes it as “BUGATTI” (tag), the central electronic computer 1 recognizes the tag “BUGATTI” as belonging to the group of the “AUTOMOTIVE” root tag.

Therefore, the generated at least one tag TG in this case will be “AUTOMOTIVE”, whereby several tens of relevant audio tracks will be tagged.

It should be noted that the greater the number of generated tags representative of the video content VD, the better the representation of the video content VD by means of the corresponding tags.

According to an embodiment, the central electronic computer 1 is further configured to generate more than one tag TG representative of the video content VD starting from the plurality P-E of visual elements generated by and received from the classifier module 40, preferably at least four tags.

The central electronic computer 1 is further configured to select a subgroup S-TK of audio tracks of said plurality P-TK of audio tracks stored in the memory unit 2 based on the at least one tag TG representative of the video content VD generated by the central electronic computer 1 .

Each audio track of said plurality P-TK of audio tracks comprises a plurality P-T of tags associated with said audio track.

Each audio track of the plurality P-TK of audio tracks is stored in the memory unit 2 by assigning a numerical value representative of a score to each tag of the respective plurality P-T of tags. The central electronic computer 1 is configured to search for the at least one tag TG representative of the video content VD generated by the central electronic computer 1 in each plurality P-T of tags assigned to each audio track of said plurality P-TK of audio tracks.

The central electronic computer 1 is configured to select the audio tracks comprising the searched at least one tag TG as sub-plurality S-TK of audio tracks of said plurality P-TK of audio tracks.

The central electronic computer 1 is configured to sort the selected audio tracks of the sub-plurality S-TK of audio tracks in descending order based on the score of the searched at least one tag TG assigned to each audio track.

In this regard, with particular reference also to Figure 6, an example of method for storing the plurality P-TK of audio tracks in the memory unit 2 of the central electronic computer 1 and a method for selecting a sub-group S-TK of audio tracks of said plurality P-TK of audio tracks based on the search for the at least one tag TG, are now described.

Firstly, it should be noted that the assignment of tags to audio tracks is already present online and is typically performed according to stylistic or technical elements attributable to the audio track itself.

This allows searching for relevant audio content but can also be cold and repetitive from a creative viewpoint.

To improve this technique, so-called mood and cultural reference elements can be included in each tag associated with an audio track.

The technique suggested in the present invention consists in assigning a score, i.e., a value in weight, to each tag associated with an audio track.

In greater detail, according to the present invention, the Applicant has introduced two types of tags: root tag; synonym tags.

The root tags actually represent the only type of tags associated with an audio track.

Various synonym tags can refer to each root tag, the synonym tags being obtained from linguistic synonyms but especially from tags which were put into relation for mood elements and cultural references.

Each root tag can have a variable number of synonyms in continuous evolution.

For example, the root tag “abstract” can have, as synonym tags, one or more of: “transcendent”, “whimsical”, “ideational”, “mental”, “weird”, “putative”, “suppositive”, “utopian”, “non concrete”, “theoretical”, “indefinite”, “immaterial”, “extravagant”, “unrealistic”, “hypothetical”, “reverie”, “intellectual”, “conceptual”, “experimental”, “metaphysical”, “abstracted”, “conjectured”, “outline”, “notional”, “abstractive”, “strange”, “vague”, “complex”, “chimerical”, “philosophical”, “odd”, “psychedelic”, “avant-garde”, “suppositional”, “recondite”, “peculiar”, “abstruse”, “discrete”, “unreal”, “transcendental”, “freak”, “conjectural”, “suppositious”.

The compilation of various editorial playlists can be added to this structure, where for different genres and moods, an ad hoc playlist of audio tracks, preferably musical audio tracks, can be created, being particularly representative of that category.

It should be noted that these playlists are continuously evolving and are a significant editorial component which can be inserted into the proposal of audio tracks stored in the memory unit 2 of the central electronic computer 1 . These audio tracks, specifically assessed by a team of experts, have a higher score than other audio tracks.

By way of example, if a search for the at least one tag TG is made within this plurality P-TK of audio tracks, the audio tracks obtained as a result of the search will be sorted in descending order (from the highest score to the lowest score) following a logic of this type: the audio tracks must contain all the searched tags and be in the editorial playlists; the audio tracks must contain a synonym tag for each searched root tag and be in the editorial playlists; the audio tracks must contain all the searched tags. At this level, those present in the editorial playlists are also ignored; the audio tracks must contain a synonym tag. At this level, those present in the editorial playlists are also ignored; the audio tracks must contain the name of the song or artist.

Greater detail is now provided on the method of storing the plurality P- TK of audio tracks in the memory unit 3 of the central electronic computer 1 , implemented to date.

Firstly, the process performed resulted in the definition of two main modules: a first main module, referred to as Tree Tags, representative of the tag data structure implemented in the present invention; a second main module, referred to as Co-Occurrence Track Ranking, representative of the audio track search and ranking method implemented in the present invention. For the definition of the first mam module (Tree Tags), the following operations were performed: selecting the root tags; associating synonyms and keywords semantically associated with the root tags (Keyword Expansion); manual control for cleaning and associating the old tags with the root tags (Human Override).

In the first Root Tags operation, a previous version of the tag data structure, in which there was an association of the tags associated with the songs in macro categories such as INST/VOX, GENRE, SUB-GENRE, MOOD, AMBIENTS, INSTRUMENTS, MOVIE TYPE, TV TYPE, FREE, PRODUCT AREA, SOUND A LIKE, TEMPO, for example, was updated starting from the aforesaid macro categories by selecting three new macro categories, i.e., MOOD, ENVIRONMENT, VIDEO TYPE.

For each new macro category, a small number of tags was associated, which could partition the macro reference category, the latter being referred to as Root Tags.

In the second operation (Keyword Expansion), other tags which are semantic synonyms of the root tags, were associated for each root tag.

Examples of methods of selecting the synonyms used comprise: Word2Vec synonyms, Human Override.

In the third operation (Human Override), the synonym selection and association process was further cleaned and completed, adding further synonyms and automatically filtering the specific synonyms during the Keyword Expansion operation. The second main module (Co-Occurrence Track Ranking), an audio track search and ranking system, can be represented by a matrix (an example of which is shown in Figure 6) with: rows equal to the number of tags in the collection (corpus) at the basis of the system 100; columns equal to the number of audio tracks stored in the memory module 3 (S in the mathematical representation) of the central electronic computer 1 .

Each element represents the score of the i-th tag of the corpus e corpus for the j-th audio track

The scores of matrix M are determined following the definition of the following metrics.

A matrix of the co-occurrences is a square matrix with rows and columns equal to the number of tags in the corpus, where each element in the matrix counts in number of audio tracks which have both

Such a matrix Mco allows defining a distance between a tag , where an audio track is defined by the plurality P-T of tags associated therewith

Intuitively, the distance defined between an audio track and a tag measures the level of co-occurrence between the tags in the audio track and the at least one input tag TG to be searched for.

There are two normalizations of this measurement, cojt is discounted by the cojj in order to make the “rarest” tags (which occur less often) more significant.

Such a normalization is very frequent in information retrieval systems and is often referred to as tf.idf, i.e., term frequency-inverse document frequency.

The second normalization is required in order to take into consideration the total number of tags present in each audio track, this number not being constant between all the audio tracks.

Accordingly, such a distance measurement allows defining the symmetrical matrix

Finally, the matrix can be simplified by setting certain values of the matrix to 0 so as to make the matrix more scattered.

Indeed, considering the column of a determined audio track such that the tag is not present among the tags of the audio track the value of the corresponding matrix e M is set to zero.

As for the search for one or more tags in each plurality P-T of tags associated with an audio track, in the case of a search for a single tag (tag_i), the first search operation occurs at the association level between the input tag to be searched for and the tags in the corpus.

To this end, two possible solutions are provided: tag in the corpus of the tags (corpus) tag absent.

In this second case, the search can suggest the set of tags most similar to the tag tag_i to be searched for.

As a second operation, the tag of the selected corpus tag t is considered and the row of the matrix corresponding to the is selected.

At this point, a decreasing sorting operation is applied to the row M i and the final ranking of the audio tracks to be provided to the user U is obtained for the tag i .

Instead, in the case several tags are searched for, for example two, the first operation is equal to the case of the search for a single tag, i.e., two tags of the corpus are associated with the two input tags to be searched for:

As a second operation step, the two rows of the matrix corresponding to the two input tags are selected:

In order to create a final score corresponding to the two input tags for each audio track, the following vector product is calculated:

Where to consider only the audio tracks having both the tags considered.

Therefore, the vector M 12 e contains the scores of each audio track with respect to the two input tags to be searched for.

As for the individual search, after the decreasing sorting operation, the final ranking is obtained.

Returning to the system 100 according to the present invention, the central electronic computer 1 is configured to provide the user U requesting the service for real-time audio/video synchronization with the sub-group S-TK of audio tracks selected and sorted based on the relevance to the uploaded video content VD. In greater detail, the central electronic computer 1 is configured to provide the sub-group S-TK of audio tracks selected and sorted based on the relevance to the uploaded video content VD in a respective area A2 for providing audio tracks of the user interface l-U usable online by means of the local electronic device 10 of the user U requesting the service for real-time audio/video synchronization.

It should be further noted that the central electronic computer 1 is configured to provide the at least one tag TG used to search for the sub-group S-TK of audio tracks in a third area A3 of the user interface l-U usable online by means of the local electronic device 10 of the user U requesting the service for real-time audio/video synchronization.

The central electronic computer 1 is configured to receive, from the user U, the selection of an audio track TK of said sub-group S-TK of audio tracks to be synchronized with the video content VD.

The central electronic computer 1 is configured to generate an audio/video content AV comprising the uploaded video content VD synchronized with the audio track TK selected by the user.

In this regard, in an embodiment, when generating the audio/video content AV, the central electronic computer 1 is configured to: establish a starting point of the video content VD; establish a starting point of the selected audio track TK; establish a maximum duration of the video (in seconds or up to the end of either the video content VD or the selected audio track TK); encode the video content VD and the selected audio track TK (operation which can require several minutes by the central electronic computer 1 ); decode the video content VD and the audio track TK, generating the audio/video content AV as a mix of the video content VD with the selected audio track TK.

In an embodiment, in combination with the previous one, the central electronic computer 1 is further configured to notify the user U requesting the service for real-time audio/video synchronization (for example, by a notification on the user interface l-U or other channel, e.g., email) of the end of the encoding and/or decoding operation, thus of the generation of the audio/video content AV.

It should be noted that the audio/video content is preferably generated in MPEG-4 format with a set restrained bit-rate (for example, equal to 1 -2 Mbit/s) so as to be easily used as a preview version of a final audio/video content.

The central electronic computer 1 is thus configured to provide the user U requesting the service for real-time audio/video synchronization with the generated audio/video content AV.

Thus, following the selection, by the user U, of the audio track TK of said sub-group S-TK of audio tracks to be synchronized with the video content VD, and the time required for the generation thereof having passed, the user U can use the received audio/video content AV of the central electronic computer 1 by means of the user interface l-U usable online.

The selected audio track TK is represented in the user interface l-U as a waveform of the corresponding audio signal, as shown in Figures 3 and 4.

In greater detail, the user interface l-U is configured to allow the user U to: play the audio/video content AV; it should be noted that executing the video track involves the synchronized execution of the audio track, and vice versa; skip to a next play point of the video track and the audio track by selecting such a next point on a first slider bar B1 of the video track VD (preferably overlapping the portion of the video) or on a second slider bar B2 of the selected audio track TK (preferably located below the video).

Thereby, the user U can best assess whether the selected audio track TK is suitable or not to the uploaded video content VD according to the creative idea thereof.

An example of the first slider bar B1 of the video track VD and the second slider bar B2 of the selected audio track TK is shown in Figures 4 and 5.

If the selected audio track TK synchronized with the video content VD in the audio/video content AV used as a preview by means of the user interface I- U is not suitable, in order to find an audio track which better corresponds to one’s creativity, the user U can select another audio track of the sub-group S- TK provided in the respective area A2 for providing audio tracks of the user interface l-U or cause the central electronic computer 1 to generate a further sub-group S-TK of audio tracks of the plurality P-TK of audio tracks stored in the memory unit 2 of the central electronic computer 1 , and then execute an audio track of the further sub-group S-TK of audio tracks.

With reference now to Figure 7, a method 700 for real-time audio/video synchronization according to the present invention, hereinafter simply synchronization method or simply method, is now described.

The method 700 comprises a symbolic step of starting ST.

The method 700 comprises a step of uploading 701 , by a central electronic computer 1 , a video content VD received by means of a user interface l-U provided by the central electronic computer 1 , usable online by a local electronic device 10 of a user U requesting a service for real-time audio/video synchronization, operatively connected to the central electronic computer 1 by means of a data communication network NTW.

The central electronic computer 1 and the local electronic device 10 were already described above.

Methods for uploading the video content VD by the user U were described above with reference to various embodiments.

For example, possible formats of the video content to be uploaded are MPEG-4, WEBM, FLV, and so on.

It should be noted that the central electronic computer 1 uploads content VD as a video track, ignoring and/or muting any audio track associated with the video content acquired during the recording of the video content.

The method 700 comprises a step of providing 702, by the central electronic computer 1 , the uploaded video content VD to a classifier module 40 operatively connected to the central electronic computer 1 .

As already mentioned above, the classifier module 40 can be outside the central electronic computer 1 and operatively connected thereto by means of the data communication network NTW (as shown in Figure 1 ). Alternatively, the classifier module 40 can be integrated in the central electronic computer 1 .

In an embodiment, shown with dashed lines in Figure 7, the step of providing 702 the video content VD to the classifier module 40 comprises a step of processing 703, by the central electronic computer 1 , the uploaded video content VD so as to reduce the size thereof, in order to limit the uploading and analysis times by the classifier module 40.

For example, if it is not, the central electronic computer 1 converts the video content VD into an MPEG-4 format with a set maximum duration, e.g., 30 seconds, and a set size, e.g., 50 megabytes.

The classifier module 40 analyzes, by means of specific artificial intelligence algorithms (Al), frame by frame, the video content VD received by the central electronic computer 1 and generates a plurality P-E of visual elements (tags) found in the video content VD, associating a corresponding confidence score with each visual element.

The definition of “confidence score” was provided above.

As already mentioned above, the usable artificial intelligence algorithms are, for example, visual ranking algorithms in continuous evolution, the purpose of which is to analyze each individual frame of the video content VD, divide each individual frame into areas up to identifying recognizable individual visual elements (for example, cat, cup, automobile, and so on) and return the name of the recognized visual element, while associating the position coordinates within the frame and the respective size with such a name.

Moreover, in an embodiment, in combination with the previous one, the classifier module 40 aggregates the results obtained from the analysis of all the frames of the video content VD in order to yield the plurality P-E of visual elements found in the video content which is as consistent as possible with the general contents of the video content VD.

Returning in general to the method 700 according to the present invention, the method 700 further comprises a step of receiving 704, by the central electronic computer 1 , the plurality P-E of visual elements found in the video content VD with a corresponding confidence score, generated by the classifier module 40, associated with each visual element.

The method 700 further comprises a step of generating 705, by the central electronic computer 1 , at least one tag TG representative of the video content VD starting from the plurality P-E of visual elements generated by and received from the classifier module 40.

In greater detail, the plurality P-E of visual elements (tags) received from the classifier module 40 is processed by the central electronic computer 1 and translated into synonymous tags in a root tag database (not shown in the drawings) operatively connected to the central electronic computer 1 .

An example of at least one tag which can be generated by the central electronic computer 1 was provided above.

As mentioned above, it should be noted that the greater the number of tags which can be generated, representative of the video content VD, the better the representation of the video content VD by means of the corresponding tags.

According to an embodiment, in combination with any one of those described, the step of generating 705 is performed by the central electronic computer 1 to generate more than one tag TG representative of the video content VD starting from the plurality P-E of visual elements generated by and received from the classifier module 40, preferably at least four tags.

The method 700 further comprises a step of selecting 706, by the central electronic computer 1 , a sub-group S-TK of audio tracks of said plurality P-TK of audio tracks stored in a memory unit 2 of the central electronic computer 1 based on the at least one tag TG representative of the video content VD generated by the central electronic computer 1 . Each audio track of said plurality P-TK of audio tracks comprises a plurality P-T of tags associated with said audio track.

Each audio track of the plurality P-TK of audio tracks is stored in the memory unit 2 by assigning a numerical value representative of a score to each tag of the respective plurality P-T of tags.

The method 700 further comprises a step of searching 707, by the central electronic computer 1 , for the at least one tag TG representative of the video content VD generated by the central electronic computer 1 in each plurality P-T of tags assigned to each audio track of said plurality P-TK of audio tracks.

The step of selecting 706 is performed, by the central electronic computer 1 , to select the audio tracks comprising the searched at least one tag TG as sub-plurality S-TK of audio tracks of said plurality P-TK of audio tracks.

The method 700 further comprising a step of sorting 708, by the central electronic computer 1 , the selected audio tracks of the sub-plurality S-TK of audio tracks in descending order based on the score of the searched at least one tag TG assigned to each audio track.

An example of a method of storing of the plurality P-TK of audio tracks in the memory unit 2 of the central electronic computer 1 and a method for selecting a sub-group S-TK of audio tracks of said plurality P-TK of audio tracks based on the search for the at least one tag TG have already described above and thus are not repeated for brevity of disclosure.

Returning to the method 700 according to the present invention, the method 700 further comprises a step of providing 709, by the central electronic computer 1 , the user U requesting the service for real-time audio/video synchronization with the sub-group S-TK of audio tracks selected and sorted based on the relevance to the uploaded video content VD.

In an embodiment, in combination with any one of those described above, the step of providing 709 the sub-group S-TK of audio tracks selected and sorted based on the relevance to the uploaded video content VD is performed by the central electronic computer 1 by means of a respective area A2 for providing audio tracks of the user interface l-U usable online by means of the local electronic device 10 of the user U requesting the service for realtime audio/video synchronization.

In an embodiment, in combination with any one of those described above, shown with dashed lines in Figure 7, the method 700 further comprises a step of providing 710, by the central electronic computer 1 , the at least one tag TG used to search for the sub-group S-TK of audio tracks in a respective area A3 for displaying tags of the user interface l-U usable online by means of the local electronic device 10 of the user U requesting the service for real-time audio/video synchronization.

Returning in general to the method 700 according to the present invention, the method 700 further comprises a step of receiving 711 , by the central electronic computer 1 , from the user U, a selection of an audio track TK of said sub-group S-TK of audio tracks to be synchronized with the video content VD.

The method 700 further comprises a step of generating 712, by the central electronic computer 1 , an audio/video content AV comprising the uploaded video content VD synchronized with the audio track TK selected by the user U requesting the service for real-time audio/video synchronization. In this regard, in an embodiment, shown with dashed lines in Figure 7, the step of generating 712, by the central electronic computer 1 , the audio/video content AV, comprises steps (performed by the central electronic computer 1 ) of: establishing 713 a starting point of the video content VD; establishing 714 a starting point of the selected audio track TK; establishing 715 a maximum duration of the video (in seconds or up to the end of either the video content VD or the selected audio track TK); encoding 716 the video content VD and the selected audio track TK (operation which can require several minutes by the central electronic computer 1 ); decoding 717 the video content VD and the selected audio track TK, generating the audio/video content AV as a mix of the video content VD with the selected audio track TK.

According to an embodiment, in combination with the previous one and shown with dashed lines in Figure 7, the generating step 712 further comprises a step of notifying 718 the user U requesting the service for real-time audio/video synchronization (for example, by means of a notification on the user interface l-U or other channel, e.g., email) of the end of the encoding and/or decoding operations, thus the generation of the audio/video content AV.

As already mentioned above, it should be noted that the audio/video content is preferably generated in MPEG-4 format with a set restrained bit-rate (for example, equal to 1 -2 Mbit/s) so as to be easily used as a preview version of a final audio/video content.

Returning in general to the method 700 according to the present invention, the method 700 comprises a step of providing 719, by the central electronic computer 1 , the user U requesting the service for real-time audio/video synchronization with the generated audio/video content AV.

Possible operations the user U can perform while using the received audio/video content AV of the central electronic computer 1 by means of the user interface l-U usable online were already described above.

With reference again to Figure 7, the method 700 further comprises a symbolic end step ED.

With reference now to the aforesaid drawings, an example of implementing the method for real-time audio/video synchronization according to the present invention is described.

The central electronic computer 1 uploads a video content VD received by means of a user interface l-U (Figure 2) provided by the central electronic computer 1 , usable online by a local electronic device 10 of a user U requesting a service for real-time audio/video synchronization, operatively connected to the central electronic computer 1 by means of a data communication network NTW (Figure 3).

The central electronic computer 1 provides the uploaded video content VD to a classifier module 40 operatively connected to the central electronic computer 1 , for example by means of the data communication network NTW.

The classifier module 40 analyzes, by means of specific artificial intelligence algorithms, frame by frame, the video content VD received by the central electronic computer 1 and generates a plurality P-E of visual elements found in the video content VD, associating a corresponding confidence score with each visual element. The central electronic computer 1 receives the plurality P-E of visual elements found in the video content VD with a corresponding confidence score, generated by the classifier module 40, associated with each visual element.

The central electronic computer 1 generates at least one tag TG representative of the video content VD starting from the plurality (P-E) of visual elements generated by and received from the classifier module 40.

The central electronic computer 1 selects a sub-group S-TK of audio tracks of said plurality P-TK of audio tracks stored in a memory unit 2 of the central electronic computer 1 based on the at least one tag TG representative of the video content VD generated by the central electronic computer 1 .

Each audio track of said plurality P-TK of audio tracks comprises a plurality P-T of tags associated with said audio track.

Each audio track of the plurality P-TK of audio tracks is stored in the memory unit 2 by assigning a numerical value representative of a score to each tag of the respective plurality P-T of tags.

The central electronic computer 1 searches for the at least one tag TG representative of the video content VD generated by the central electronic computer 1 in each plurality P-T of tags assigned to each audio track of said plurality P-TK of audio tracks.

The central electronic computer 1 selects the audio tracks comprising the searched at least one tag TG as sub-plurality S-TK of audio tracks of said plurality P-TK of audio tracks.

The central electronic computer 1 sorts the selected audio tracks of the sub-plurality S-TK of audio tracks in descending order based on the score of the searched at least one tag TG assigned to each audio track. The central electronic computer 1 provides the user U requesting the service for real-time audio/video synchronization with the sub-group S-TK of audio tracks selected and sorted based on the relevance to the uploaded video content VD (Figure 3).

The central electronic computer 1 receives a selection of an audio track TK of said sub-group S-TK of audio tracks to be synchronized with the video content VD, from the user U requesting the service for real-time audio/video synchronization.

The central electronic computer 1 generates an audio/video content AV comprising the uploaded video content VD synchronized with the audio track TK selected by the user U requesting the service for real-time audio/video synchronization.

The central electronic computer 1 provides the user U requesting the service for real-time audio/video synchronization with the generated audio/video content AV (Figure 4).

The user U plays the audio/video content AV (Figure 5).

The user U can skip to a next play point of the video track and the audio track by selecting such a next point on a first slider bar B1 of the video track VD or on a second slider bar B2 of the selected audio track TK (Figure 5).

Thereby, the user U can best assess whether the selected audio track TK is suitable or not to the uploaded video content VD according to the creative idea thereof.

If the selected audio track TK synchronized with the video content VD in the audio/video content AV used as a preview by means of the user interface I- U is not suitable, in order to find an audio track which better corresponds to one's creativity, the user U can select another audio track of the sub-group S- TK provided in the respective area A2 for providing audio tracks of the user interface l-U or cause the central electronic computer 1 to generate a further sub-group S-TK of audio tracks of the plurality P-TK of audio tracks stored in the memory unit 2 of the central electronic computer 1 , and then execute an audio track of the further sub-group S-TK of audio tracks.

As can be seen, the object of the invention is fully achieved.

Indeed, the method and related system allow providing relevant audio tracks with the content of the uploaded reference video specifically selected and sorted according to a score assigned and to a searched at least one tag after identifying visual elements in the video content which are characteristic and representative of the video content.

Thereby, the process times for selecting and testing the audio track on the reference video can be significantly optimized, eliminating the need to download the song to synchronize it under a video with a specific software.

Moreover, a musical supervisor can select the audio track(s) directly from a web browser, thus being able to have real-time performance feedback.

Finally, in order to optimize the process, the reference video can be directly tested in the platform provided (user interface usable online) while simply and intuitively synchronizing the reference video with the audio tracks selected, sorted and suggested on the platform.

Those skilled in the art may make changes and adaptations to the embodiments of the method and related system described above or can replace elements with others which are functionally equivalent in order to meet contingent needs without departing from the scope of the following claims. Each of the features described above as belonging to a possible embodiment can be implemented irrespective of the other embodiments described.