Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COMPUTERIZED ANALYSIS OF SOCIAL AND/OR EMOTIONAL PERFORMANCE IN A MULTIMEDIA SESSION
Document Type and Number:
WIPO Patent Application WO/2022/123554
Kind Code:
A1
Abstract:
A computerized system comprises a processor and memory circuitry configured to, for at least one session comprising at least one of an audio content and a video content: obtain data Dbehavioral expr. informative of behavioral expressions of one or more participants in the session, wherein Dbehavioral expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions in said time segment, obtain a label informative of one or more indicators of at least one of social and emotional performance of at least one participant, and use the set of embedded data structures and the label to train one or more machine learning modules usable, after their training, to determine data informative of a given indicator of said indicators for a participant of a session.

Inventors:
ORON AVIGAD (IL)
GLICKMAN OREN DAVID (IL)
JACOBS YARON ITZHAK (IL)
Application Number:
PCT/IL2021/051395
Publication Date:
June 16, 2022
Filing Date:
November 24, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
RESPONCITY LTD (IL)
International Classes:
G06V20/40; G06K9/00; G10L17/04
Foreign References:
US20180082112A12018-03-22
US20120290517A12012-11-15
US20110131144A12011-06-02
Attorney, Agent or Firm:
HAUSMAN, Ehud (IL)
Download PDF:
Claims:
- 56 -

CLAIMS

What is claimed is

1. A computerized system comprising a processor and memory circuitry (PMC) configured to, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehaviorai expr. informative of behavioral expressions of the one or more participants in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, obtain a label for at least one participant in the session, wherein the label is informative of one or more indicators of at least one of social and emotional performance of said participant in the session, and use at least part of the set of embedded data structures and the label to train one or more machine learning modules, wherein a given machine learning module of the one or more machine learning modules is usable, after its training, to determine data informative of a given indicator of said indicators for a given participant of a given session comprising at least one of an audio content and a video content.

2. The system of claim 1, wherein data informative of the given indicator for the given participant comprises a prospect that at least one of a social and emotional behavior of the given participant in the given session complies with the given indicator.

3. The system of claim 1 or of claim 2, wherein, the label obtained for said participant is informative of a plurality of different indicators of at least one of social and emotional performance of said participant, wherein the system is - 57 - configured to use Dbehaviorai expr, or data informative thereof, and the label, to train a plurality of machine learning modules, wherein each machine learning module of the plurality of machine learning modules is usable, after its training, to determine data informative of a different indicator of said plurality of different indicators for a given participant of a given session comprising at least one of an audio content and a video content. The system of claim 3, wherein training the plurality of machine learning modules involves using multi-task learning for at least a subset of machine learning modules of the plurality of machine learning modules. The system of any of claims 1 to 4, wherein training the plurality of machine learning modules involves using multi-task learning for at least: a first machine learning module trained to determine data informative of a first indicator of at least one of social and emotional performance, and a second machine learning module trained to determine data informative of a second indicator of at least one of social and emotional performance, wherein the first indicator and the second indicator are interrelated. The system of any of claims 1 to 5, configured to perform at least one of:

(i) for at least one participant, determine, for each period of time of a plurality of periods of time of the session, data D non-verbal informative of non-verbal expressions of the participant in the period of time based on data informative of at least one of a body motion and a facial expression of the participant;

(ii) for at least one participant, determine, for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions Dparaiinguistic based on an audio content, or data informative thereof, associated with the participant; - 58 -

(iii) for at least one participant, determine, for each period of time of a plurality of periods of time of the session, data Dianguage informative of language expressions used by the participant;

(iv) determine, for each period of time of a plurality of periods of time of the session, a main speaker D main/speaker among a plurality of participants of the session;

(v) use at least one of D non-verbal, Dparalinguistic, Dianguage and Dmain/speaker tO train the one or more machine learning modules. The system of any of claims 1 to 6, configured to determine, for each time segment, an embedded data structure for a participant based on a data structure informative of behavioral expressions of said participant in the time segment, wherein the data structure is of larger size than the embedded data structure and is derived from at least one of the audio content and the video content of the session in said time segment. The system of any of claims 1 to 7, configured to determine for at least one participant, for each time segment of a plurality of time segments of the session, a data structure including a structured representation informative of: behavioral expressions of the given participant in the time segment, and a reaction of one or more other participants different from the given participant in the time segment. The system of claim 7 or claim 8, configured to use a machine learning module to convert, for each time segment of the plurality of time segments, the data structure into an embedded data structure informative of behavioral expressions of the participant in this time segment. The system of any of claims 1 to 9, wherein Dbehaviorai expr. comprises, for at least one participant of the session, a personalized data structure informative of behavioral expressions of said participant across different sessions which - 59 - included said participant, determined based on at least one of an audio content and a video content of each of these different sessions, wherein the system is configured to use the personalized data structure to train the one or more machine learning modules. The system of any of claims 1 to 10, wherein:

D behavioral expr. comprises a second set of embedded data structures including, for each given time interval of a plurality of time intervals of the session, each given time interval including a plurality of given time segments of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the given time interval determined using data informative of behavioral expressions of the one or more participants in each given time segment of the given time interval, wherein the system is configured to use the second set of embedded data structures to train the one or more machine learning modules. The system of any of claims 1 to 11, configured to use a machine learning module to generate the set of embedded data structures, wherein the machine learning module implements a network derived from another network which has been trained to reconstruct data informative of behavioral expressions of one or more participants in a given time segment of a session using data informative of behavioral expressions of the one or more participants in other time segments of the session. The system of any of claims 1 to 12, configured to split the session into a plurality of time segments based on data informative of behavioral expressions of all participants in the session and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session. - 60 - A computerized system comprising a processor and memory circuitry (PMC) configured to, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehaviorai expr. informative of behavioral expressions of a given participant in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, feed at least part of the set of embedded data structures to a given machine learning module of one or more machine learning modules, wherein the given machine learning module is trained to determine data informative of a given indicator of at least one of social and emotional performance of a participant in a session, and output data informative of the given indicator of at least one of social and emotional performance of the given participant in the session. The system of claim 14, wherein data informative of the given indicator of at least one of social and emotional performance of the given participant in the session comprises a prospect that at least one of a social and emotional behavior of the given participant in the session complies with the given indicator. The system of claim 14 or of claim 15, configured to: feed at least part of the set of embedded data structures to a plurality of different machine learning modules, wherein each machine learning module of the plurality of machine learning modules is trained to determine data informative of a different indicator of at least one of social and emotional performance of a participant in a session, and output, by each machine learning module, data informative of a different indicator of at least one of social and emotional performance of the given participant in the session. The system of any of claims 14 to 16, wherein Dbehaviorai expr. includes at least one of:

(i) for the given participant, for each period of time of a plurality of periods of time of the session, data D non-verbal informative of non-verbal expressions of the given participant in the period of time based on data informative of at least one of a body motion and a facial expression of the given participant;

(ii) for the given participant, for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions Dparaiinguistic based on an audio content, or data informative thereof, associated with the given participant;

(iii) for the given participant, for each period of time of a plurality of periods of time of the session, data Dianguage informative of language expressions used by the given participant. The system of any of claims 14 to 17, configured to determine, for each time segment, an embedded data structure for the given participant based on a data structure informative of behavioral expressions of the given participant in the time segment, wherein the data structure is of larger size than the embedded data structure and is derived from at least one of the audio content and the video content of the session in said time segment. The system of any of claims 14 to 18, configured to determine, for the given participant, for each time segment of a plurality of time segments of the session, a data structure including a structured representation informative of: behavioral expressions of the given participant in the time segment, a reaction of one or more other participants different from the given participant in the time segment. The system of claim 18 orclaim 19, configured to use a machine learning module to convert, for each time segment of the plurality of time segments, the data structure into an embedded data structure informative of behavioral expressions of the given participant in this time segment. The system of any of claims 14 to 20, configured to use a machine learning module to generate the set of embedded data structures, wherein the machine learning module implements a network derived from another network which has been trained to reconstruct data informative of behavioral expressions of one or more participants in a given time segment of a session using data informative of behavioral expressions of the one or more participants in other time segments of the session. The system of claim 20 or claim 21, wherein the machine learning module includes a network derived from another trained machine learning module, wherein training of the other machine learning module includes: obtaining a first set of data informative of behavioral expressions of one or more participants in a first time segment of a session, obtaining a second set of data informative of behavioral expressions of one or more participants in a second time segment of the session, using the first set of data and the second set of data to train the other machine learning module to: generate a first data structure corresponding to an embedded representation of the first set of data, generate a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure. The system of any of claims 14 to 22, configured to split the session into a plurality of time segments based on data informative behavioral expressions of all participants in the session and a splitting criterion, wherein at least some of - 63 - the plurality of time segments is respectively informative of a different social interaction which occurred during the session. The system of any of claims 14 to 23, wherein Dbehaviorai expr. comprises, for the given participant, a personalized data structure informative of behavioral expressions of the given participant across different sessions which included the given participant, determined based on at least one of an audio content and a video content of each of these different sessions, and wherein the system is configured to feed the personalized data structure to the one or more machine learning modules. The system of any of claims 14 to 24, wherein Dbehaviorai expr. comprises a second set of embedded data structures including, for each given time interval of a plurality of time intervals of the session, each given time interval including a plurality of given time segments of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the given time interval determined using data informative of behavioral expressions of the one or more participants in each given time segment of the given time interval, wherein the system is configured to feed the second set of embedded data structures to the one or more machine learning modules. A computerized system comprising a processor and memory circuitry (PMC) configured to, for a session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehaviorai expr. informative of behavioral expressions of the one or more participants for each period of time of a plurality of periods of time in the session, derived from at least one of the audio content and the video content, and split the session into a plurality of time segments based on Dbehaviorai expr. and a splitting criterion, wherein at least some of the plurality of time segments - 64 - is respectively informative of a different social interaction which occurred during the session. The system of claim 26, wherein the splitting criterion is informative of a consistency of at least part of the data Dbehaviorai expr. in each of the time segments. The system of claim 26 or claim 27, wherein Dbehaviorai expr. comprises data for each of a plurality of different behavioral expressions, wherein the splitting criterion is informative of a consistency of at least part of the data in each of the time segments per behavioral expression. The system of any of claims 26 to 28, wherein Dbehaviorai expr. comprises, for each period of time of a plurality of periods of time of the session, data informative of behavioral expressions of the one or more participants of the session, wherein the splitting criterion is informative, for a given period of time, of a level of inclusion of the given period of time within a time segment of the plurality of time segments. The system of any of claims 26 to 29, wherein Dbehaviorai expr. comprises, for each period of time of a plurality of periods of time of the session, first data informative of a first behavioral expression of the one or more participants of the session, and second data informative of a second behavioral expression of the one or more participants of the session, wherein the splitting criterion is informative of a likelihood that a first value of the first data and a second value of the second data occur within a common time segment. The system of any of claims 26 to 30, configured to determine iteratively the time segments, wherein the system is configured to, for a given iteration, merge at least two time segments determined at a previous iteration, so as to increase a score determined based on the splitting criterion. - 65 - The system of any of claims 26 to 31, configured to split the session into a plurality of time intervals, and to split the time intervals into a plurality of time segments, wherein the plurality of time intervals is selected based on a determination of a single main speaker among the participants in each of these time intervals. A computerized system comprising a processor and memory circuitry (PMC), configured to: obtain a first set of data informative of behavioral expressions of a participant of a session during a first time segment of the session, obtain a second set of data informative of behavioral expressions of the participant during a second time segment of the session, different from the first time segment, use the first set of data and the second set of data to train a machine learning module comprising a first network comprising a plurality of layers and a second network identical to the first network, to: generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network. The system of claim 33, wherein the machine learning module further comprises an aggregating network, configured to receive the first data structure from the first network and the second data structure from the second network. - 66 - The system of claim 33 or claim 34, configured to use another machine learning module implementing the first network or the second network after training of the machine learning module, to convert data informative of behavioral expressions of a participant in a given time segment of a session into an embedded data structure. The system of any of claims 33 to 35, configured to: obtain a third set of data informative of behavioral expressions of the participant during a third time segment of the session, wherein the third time segment is different from the first time segment and from the second time segment, use the first set of data, the second set of data and the third set of data to train to a machine learning module comprising a first network comprising a plurality of layers, a second network identical to the first network, a third network identical to the first network, to: generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, generate, by the third network, a third data structure corresponding to an embedded representation of the third set of data, and reconstruct the second set of data using the first data structure, the second data structure and the third data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network and to a weight of a corresponding node of a corresponding layer of the third network. A method comprising, by a processor and memory circuitry (PMC), for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: - 67 - obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, obtaining a label for at least one participant in the session, wherein the label is informative of one or more indicators of at least one of social and emotional performance of said participant in the session, using at least part of the set of embedded data structures and the label to train one or more machine learning modules, wherein a given machine learning module of the one or more machine learning modules is usable, after its training, to determine data informative of a given indicator of said indicators for a given participant of a given session comprising at least one of an audio content and a video content. The method of claim 37, wherein data informative of the given indicator for the given participant comprises a prospect that at least one of a social and emotional behavior of the given participant in the given session complies with the given indicator. The method of claim 37 or of claim 38, wherein the label obtained for said participant is informative of a plurality of different indicators of at least one of social and emotional performance of said participant, wherein the method comprises using Dbehaviorai expr, or data informative thereof, and the label, to train a plurality of machine learning modules, wherein each machine learning module of the plurality of machine learning modules is usable, after its training, to determine data informative of a different indicator of said plurality of different indicators for a given participant of a given session comprising at least one of an audio content and a video content. - 68 - A method comprising, by a processor and memory circuitry (PMC), for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of a given participant in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, feeding at least part of the set of embedded data structures to a given machine learning module of one or more machine learning modules, wherein the given machine learning module is trained to determine data informative of a given indicator of at least one of social and emotional performance of a participant in a session, and outputting data informative ofthe given indicatorof at least one of social and emotional performance of the given participant in the session. The method of claim 40, wherein data informative of the given indicator of at least one of social and emotional performance of the given participant in the session comprises a prospect that at least one of a social and emotional behavior of the given participant in the session complies with the given indicator. The method of claim 40 or of claim 41, comprising: feeding at least part ofthe set of embedded data structures to a plurality of different machine learning modules, wherein each machine learning module of the plurality of machine learning modules is trained to determine data informative of a different indicator of at least one of social and emotional performance of a participant in a session, - 69 - outputting, by each machine learning module, data informative of a different indicator of at least one of social and emotional performance of the given participant in the session. The method of any of claims 40 to 42, wherein Dbehaviorai expr. includes at least one of:

(i) for the given participant, for each period of time of a plurality of periods of time of the session, data D non-verbal informative of non-verbal expressions of the given participant in the period of time based on data informative of at least one of a body motion and a facial expression of the given participant;

(ii) for the given participant, for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions Dparaiinguistic based on an audio content, or data informative thereof, associated with the given participant;

(iii) for the given participant, for each period of time of a plurality of periods of time of the session, data Dianguage informative of language expressions used by the given participant. The method of any of claims 40 to 43, comprising determining, for each time segment, an embedded data structure for the given participant based on a data structure informative of behavioral expressions of the given participant in the time segment, wherein the data structure is of larger size than the embedded data structure and is derived from at least one of the audio content and the video content of the session in said time segment. A method comprising, by a processor and memory circuitry (PMC), for a session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants for each period of time of a plurality of periods of time - 70 - in the session, derived from at least one of the audio content and the video content, and splitting the session into a plurality of time segments based on Dbehaviorai expr. and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session. A method comprising, by a processor and memory circuitry (PMC): obtaining a first set of data informative of behavioral expressions of a participant of a session during a first time segment of the session, obtaining a second set of data informative of behavioral expressions of the participant during a second time segment of the session, different from the first time segment, using the first set of data and the second set of data to train a machine learning module comprising a first network comprising a plurality of layers and a second network identical to the first network, to: generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network. A non-transitory computer readable medium comprising instructions that, when executed by a processor and memory circuitry (PMC), cause the PMC to perform operations comprising, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: - 71 - obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, obtaining a label for at least one participant in the session, wherein the label is informative of one or more indicators of at least one of social and emotional performance of said participant in the session, using at least part of the set of embedded data structures and the label to train one or more machine learning modules, wherein a given machine learning module of the one or more machine learning modules is usable, after its training, to determine data informative of a given indicator of said indicators for a given participant of a given session comprising at least one of an audio content and a video content. A non-transitory computer readable medium comprising instructions that, when executed by a processor and memory circuitry (PMC), cause the PMC to perform operations comprising, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of a given participant in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, feeding at least part of the set of embedded data structures to a given machine learning module of one or more machine learning modules, wherein the given machine learning module is trained to determine data informative of - 72 - a given indicator of at least one of social and emotional performance of a participant in a session, and outputting data informative ofthe given indicatorof at least one of social and emotional performance of the given participant in the session. A non-transitory computer readable medium comprising instructions that, when executed by a processor and memory circuitry (PMC), cause the PMC to perform operations comprising, for a session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants for each period of time of a plurality of periods of time in the session, derived from at least one of the audio content and the video content, splitting the session into a plurality of time segments based on Dbehaviorai expr. and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session. A non-transitory computer readable medium comprising instructions that, when executed by a processor and memory circuitry (PMC), cause the PMC to perform operations comprising: obtaining a first set of data informative of behavioral expressions of a participant of a session during a first time segment of the session, obtaining a second set of data informative of behavioral expressions of the participant during a second time segment of the session, different from the first time segment, using the first set of data and the second set of data to train a machine learning module comprising a first network comprising a plurality of layers and a second network identical to the first network, to: - 73 - generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network.

Description:
COMPUTERIZED ANALYSIS OF SOCIAL AND/OR EMOTIONAL PERFORMANCE IN A

MULTIMEDIA SESSION

TECHNICAL FIELD

[001] The presently disclosed subject matter relates, in general, to the field of computerized analysis of social and/or emotional skills of participants in a session, such as a multimedia session.

BACKGROUND

[002] Social-emotional learning (SEL) is the process of developing self-awareness, self-control, and interpersonal skills that are vital for school, work, and life success. People with strong social-emotional skills are generally better able to cope with everyday challenges and benefit academically, professionally, and socially.

[003] In this context, there is a growing need to propose new methods and systems capable of automatically determining data informative of social and/or emotional behavior (or skills) of people, based e.g. on a recorded audio and/or video session.

GENERAL DESCRIPTION

[004] In accordance with certain aspects of the presently disclosed subject matter, there is provided a computerized system comprising a processor and memory circuitry (PMC) configured to, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehaviorai expr. informative of behavioral expressions of the one or more participants in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, obtain a label for at least one participant in the session, wherein the label is informative of one or more indicators of at least one of social and emotional performance of said participant in the session, and use at least part of the set of embedded data structures and the label to train one or more machine learning modules, wherein a given machine learning module of the one or more machine learning modules is usable, after its training, to determine data informative of a given indicator of said indicators for a given participant of a given session comprising at least one of an audio content and a video content.

[005] In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (i) to (xii) below, in any technically possible combination or permutation: i. data informative of the given indicator for the given participant comprises a prospect that at least one of a social and emotional behavior of the given participant in the given session complies with the given indicator; ii. the label obtained for said participant is informative of a plurality of different indicators of at least one of social and emotional performance of said participant, wherein the system is configured to use Dbehavioral expr., or data informative thereof, and the label, to train a plurality of machine learning modules, wherein each machine learning module of the plurality of machine learning modules is usable, after its training, to determine data informative of a different indicator of said plurality of different indicators for a given participant of a given session comprising at least one of an audio content and a video content; iii. training the plurality of machine learning modules involves using multi-task learning for at least a subset of machine learning modules of the plurality of machine learning modules; iv. training the plurality of machine learning modules involves using multi-task learning for at least a first machine learning module trained to determine data informative of a first indicator of at least one of social and emotional performance, and a second machine learning module trained to determine data informative of a second indicator of at least one of social and emotional performance, wherein the first indicator and the second indicator are interrelated; v. the system is configured to perform at least one of: a. for at least one participant, determine, for each period of time of a plurality of periods of time of the session, data D non-verbal informative of non-verbal expressions of the participant in the period of time based on data informative of at least one of a body motion and a facial expression of the participant; b. for at least one participant, determine, for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions D para iinguistic based on an audio content, or data informative thereof, associated with the participant; c. for at least one participant, determine, for each period of time of a plurality of periods of time of the session, data Di ang uage informative of language expressions used by the participant; d. determine, for each period of time of a plurality of periods of time of the session, a main speaker D main/speaker among a plurality of participants of the session; e. use at least one of D non-verbal, Dp a ralinguistic, D| a nguage and Dmain/speaker tO train the one or more machine learning modules. vi. the system is configured to determine, for each time segment, an embedded data structure for a participant based on a data structure informative of behavioral expressions of said participant in the time segment, wherein the data structure is of larger size than the embedded data structure and is derived from at least one of the audio content and the video content of the session in said time segment; vii. the system is configured to determine for at least one participant, for each time segment of a plurality of time segments of the session, a data structure including a structured representation informative of behavioral expressions of the given participant in the time segment, and a reaction of one or more other participants different from the given participant in the time segment; viii. the system is configured to use a machine learning module to convert, for each time segment of the plurality of time segments, the data structure into an embedded data structure informative of behavioral expressions of the participant in this time segment; ix. Dbehavioral expr. comprises, for at least one participant of the session, a personalized data structure informative of behavioral expressions of said participant across different sessions which included said participant, determined based on at least one of an audio content and a video content of each of these different sessions, wherein the system is configured to use the personalized data structure to train the one or more machine learning modules;

X. Dbehavioral expr- comprises a second set of embedded data structures including, for each given time interval of a plurality of time intervals of the session, each given time interval including a plurality of given time segments of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the given time interval determined using data informative of behavioral expressions of the one or more participants in each given time segment of the given time interval, wherein the system is configured to use the second set of embedded data structures to train the one or more machine learning modules; xi. the system is configured to use a machine learning module to generate the set of embedded data structures, wherein the machine learning module implements a network derived from another network which has been trained to reconstruct data informative of behavioral expressions of one or more participants in a given time segment of a session using data informative of behavioral expressions of the one or more participants in other time segments of the session; and xii. the system is configured to split the session into a plurality of time segments based on data informative of behavioral expressions of all participants in the session and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session.

[006] In accordance with certain other of the presently disclosed subject matter, there is provided a computerized system comprising a processor and memory circuitry (PMC) configured to, for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehaviorai expr. informative of behavioral expressions of a given participant in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, feed at least part of the set of embedded data structures to a given machine learning module of one or more machine learning modules, wherein the given machine learning module is trained to determine data informative of a given indicator of at least one of social and emotional performance of a participant in a session, and output data informative of the given indicator of at least one of social and emotional performance of the given participant in the session.

[007] In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xiii) to (xxiii) below, in any technically possible combination or permutation: xiii. data informative of the given indicator of at least one of social and emotional performance of the given participant in the session comprises a prospect that at least one of a social and emotional behavior of the given participant in the session complies with the given indicator; xiv. the system is configured to feed at least part of the set of embedded data structures to a plurality of different machine learning modules, wherein each machine learning module of the plurality of machine learning modules is trained to determine data informative of a different indicator of at least one of social and emotional performance of a participant in a session, and output, by each machine learning module, data informative of a different indicator of at least one of social and emotional performance of the given participant in the session; xv. Dbehaviorai expr. includes at least one of: a. for the given participant, for each period of time of a plurality of periods of time of the session, data D non-verbal informative of non-verbal expressions of the given participant in the period of time based on data informative of at least one of a body motion and a facial expression of the given participant; b. for the given participant, for each period of time of a plurality of periods of time of the session, one or more pa ralinguistic expressions D para iinguistic based on an audio content, or data informative thereof, associated with the given participant; c. for the given participant, for each period of time of a plurality of periods of time of the session, data Di ang uage informative of language expressions used by the given participant; xvi. the system is configured to determine, for each time segment, an embedded data structure for the given participant based on a data structure informative of behavioral expressions of the given participant in the time segment, wherein the data structure is of larger size than the embedded data structure and is derived from at least one of the audio content and the video content of the session in said time segment; xvii. the system is configured to determine, for the given participant, for each time segment of a plurality of time segments of the session, a data structure including a structured representation informative of behavioral expressions of the given participant in the time segment, a reaction of one or more other participants different from the given participant in the time segment; xviii. the system is configured to use a machine learning module to convert, for each time segment of the plurality of time segments, the data structure into an embedded data structure informative of behavioral expressions of the given participant in this time segment; xix. the system is configured to use a machine learning module to generate the set of embedded data structures, wherein the machine learning module implements a network derived from another network which has been trained to reconstruct data informative of behavioral expressions of one or more participants in a given time segment of a session using data informative of behavioral expressions of the one or more participants in other time segments of the session; xx. the machine learning module includes a network derived from another trained machine learning module, wherein training of the other machine learning module includes obtaining a first set of data informative of behavioral expressions of one or more participants in a first time segment of a session, obtaining a second set of data informative of behavioral expressions of one or more participants in a second time segment of the session, using the first set of data and the second set of data to train the other machine learning module to generate a first data structure corresponding to an embedded representation of the first set of data, generate a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure; xxi. the system is configured to split the session into a plurality of time segments based on data informative behavioral expressions of all participants in the session and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session; xxii. Dbehaviorai expr. comprises, for the given participant, a personalized data structure informative of behavioral expressions of the given participant across different sessions which included the given participant, determined based on at least one of an audio content and a video content of each of these different sessions, and wherein the system is configured to feed the personalized data structure to the one or more machine learning modules; and xxiii. Dbehaviorai expr. comprises a second set of embedded data structures including, for each given time interval of a plurality of time intervals of the session, each given time interval including a plurality of given time segments of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the given time interval determined using data informative of behavioral expressions of the one or more participants in each given time segment of the given time interval, wherein the system is configured to feed the second set of embedded data structures to the one or more machine learning modules.

[008] In accordance with certain other of the presently disclosed subject matter, there is provided a computerized system comprising a processor and memory circuitry (PMC) configured to, for a session comprising at least one of an audio content and a video content, the session involving one or more participants: obtain data Dbehavioral expr. informative of behavioral expressions of the one or more participants for each period of time of a plurality of periods of time in the session, derived from at least one of the audio content and the video content, and split the session into a plurality of time segments based on Dbehavioral expr. and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session.

[009] In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xxiv) to (xxix) below, in any technically possible combination or permutation: xxiv. the splitting criterion is informative of a consistency of at least part of the data Dbehavioral expr. in each of the time segments; xxv. Dbehavioral expr. comprises data for each of a plurality of different behavioral expressions, wherein the splitting criterion is informative of a consistency of at least part of the data in each of the time segments per behavioral expression; xxvi. Dbehavioral expr. comprises, for each period of time of a plurality of periods of time of the session, data informative of behavioral expressions of the one or more participants of the session, wherein the splitting criterion is informative, for a given period of time, of a level of inclusion of the given period of time within a time segment of the plurality of time segments; xxvii. Dbehaviorai expr. comprises, for each period of time of a plurality of periods of time of the session, first data informative of a first behavioral expression of the one or more participants of the session, and second data informative of a second behavioral expression of the one or more participants of the session, wherein the splitting criterion is informative of a likelihood that a first value of the first data and a second value of the second data occur within a common time segment; xxviii. the system is configured to determine iteratively the time segments, wherein the system is configured to, for a given iteration, merge at least two time segments determined at a previous iteration, so as to increase a score determined based on the splitting criterion; and xxix. the system is configured to split the session into a plurality of time intervals, and to split the time intervals into a plurality of time segments, wherein the plurality of time intervals is selected based on a determination of a single main speaker among the participants in each of these time intervals.

[0010] In accordance with certain other of the presently disclosed subject matter, there is provided a computerized system comprising a processor and memory circuitry (PMC), configured to obtain a first set of data informative of behavioral expressions of a participant of a session during a first time segment of the session, obtain a second set of data informative of behavioral expressions of the participant during a second time segment of the session, different from the first time segment, use the first set of data and the second set of data to train a machine learning module comprising a first network comprising a plurality of layers and a second network identical to the first network, to generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network. [0011] In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xxx) to (xxxii) below, in any technically possible combination or permutation. xxx. the machine learning module further comprises an aggregating network, configured to receive the first data structure from the first network and the second data structure from the second network; xxxi. the system is configured to use another machine learning module implementing the first network or the second network after training of the machine learning module, to convert data informative of behavioral expressions of a participant in a given time segment of a session into an embedded data structure; and xxxii. the system is configured to obtain a third set of data informative of behavioral expressions of the participant during a third time segment of the session, wherein the third time segment is different from the first time segment and from the second time segment, use the first set of data, the second set of data and the third set of data to train to a machine learning module comprising a first network comprising a plurality of layers, a second network identical to the first network, a third network identical to the first network, to generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, generate, by the third network, a third data structure corresponding to an embedded representation of the third set of data, and reconstruct the second set of data using the first data structure, the second data structure and the third data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network and to a weight of a corresponding node of a corresponding layer of the third network. [0012] In accordance with certain other of the presently disclosed subject matter, there is provided a method comprising, by a processor and memory circuitry (PMC), for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, obtaining a label for at least one participant in the session, wherein the label is informative of one or more indicators of at least one of social and emotional performance of said participant in the session, using at least part of the set of embedded data structures and the label to train one or more machine learning modules, wherein a given machine learning module of the one or more machine learning modules is usable, after its training, to determine data informative of a given indicator of said indicators for a given participant of a given session comprising at least one of an audio content and a video content.

[0013] In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (i) to (xii) above (implemented for a method), in any technically possible combination or permutation.

[0014] In accordance with certain other of the presently disclosed subject matter, there is provided a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations as described above.

[0015] In accordance with certain other of the presently disclosed subject matter, there is provided a method comprising, by a processor and memory circuitry (PMC), for at least one session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of a given participant in the session, wherein Dbehaviorai expr. comprises a set of embedded data structures including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of said behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment, feeding at least part of the set of embedded data structures to a given machine learning module of one or more machine learning modules, wherein the given machine learning module is trained to determine data informative of a given indicator of at least one of social and emotional performance of a participant in a session, and outputting data informative of the given indicator of at least one of social and emotional performance of the given participant in the session.

[0016] In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xiii) to (xxiii) above (implemented for a method), in any technically possible combination or permutation.

[0017] In accordance with certain other of the presently disclosed subject matter, there is provided a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations as described above.

[0018] In accordance with certain other of the presently disclosed subject matter, there is provided a method comprising, by a processor and memory circuitry (PMC), for a session comprising at least one of an audio content and a video content, the session involving one or more participants: obtaining data Dbehaviorai expr. informative of behavioral expressions of the one or more participants for each period of time of a plurality of periods of time in the session, derived from at least one of the audio content and the video content, and splitting the session into a plurality of time segments based On Dbehaviorai expr. and a splitting criterion, wherein at least some of the plurality of time segments is respectively informative of a different social interaction which occurred during the session. [0019] In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xxiv) to (xxix) above (implemented for a method), in any technically possible combination or permutation.

[0020] In accordance with certain other of the presently disclosed subject matter, there is provided a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations as described above.

[0021] In accordance with certain other of the presently disclosed subject matter, there is provided a method comprising, by a processor and memory circuitry (PMC): obtaining a first set of data informative of behavioral expressions of a participant of a session during a first time segment of the session, obtaining a second set of data informative of behavioral expressions of the participant during a second time segment of the session, different from the first time segment, using the first set of data and the second set of data to train a machine learning module comprising a first network comprising a plurality of layers and a second network identical to the first network, to generate, by the first network, a first data structure corresponding to an embedded representation of the first set of data, generate, by the second network, a second data structure corresponding to an embedded representation of the second set of data, and reconstruct the second set of data using the first data structure and the second data structure, wherein a value assigned to a weight of a node of a layer of the first network is identically assigned to a weight of a corresponding node of a corresponding layer of the second network.

[0022] In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xxx) to (xxxii) above (implemented for a method), in any technically possible combination or permutation.

[0023] In accordance with certain other of the presently disclosed subject matter, there is provided a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations as described above.

[0024] Among advantages of certain embodiments of the presently disclosed subject matter is to provide a computerized solution which determines, in an accurate and efficient way, data informative of social and/or emotional performance indicators or skills of people taking part to a recorded session.

[0025] According to some embodiments, the proposed solution is able to automatically determine data informative of social and/or emotional performance indicators of one or more participants, although a limited number of sessions labelled by human operator(s) is available. In other words, the proposed solution leverages a limited number of labels provided by human operator(s).

[0026] According to some embodiments, the proposed solution is able to convert a large amount of raw data collected from the content of a multimedia session into an embedded representation usable by one or more machine learning modules, thereby facilitating training and/or detection by the one or more machine learning modules. In particular, according to some embodiments, the embedded representation keeps most of the relevant information, thereby facilitating analysis of social and/or emotional performance of participants.

[0027] According to some embodiments, the proposed solution provides, for each period of time of a session involving one or more participants, an embedded representation informative of behavioral expressions of at least one participant using a machine learning module trained to take into account the context of each period of time. As a consequence, the embedded representation is more reliable and reflects more accurately the actual social interactions of the session.

[0028] According to some embodiments, the proposed solution takes into account various social interactions to determine data informative of social and/or emotional performance indicators of participants. [0029] According to some embodiments, the proposed solution proposes a plurality of dedicated modules, each specifically trained to determine data informative of a different social and/or emotional performance indicator, thereby improving accuracy.

[0030] According to some embodiments, the proposed solution enables leveraging data collected for a given participant across a plurality of different sessions.

[0031] According to some embodiments, the proposed solution enables dividing a session into a plurality of time period characteristics of different social interactions.

BRIEF DESCRIPTION OF THE DRAWINGS

[0032] In order to understand the disclosure and to see how it may be carried out in practice, embodiments will now be described, byway of non-limiting example only, with reference to the accompanying drawings, in which:

[0033] Fig. 1 illustrates a generalized block diagram of a system in accordance with certain embodiments of the presently disclosed subject matter.

[0034] Fig. 2 illustrates a generalized flow-chart of a method of determining data informative of behavioral expressions of one or more participants of the session, based on the audio and/or video content of the session.

[0035] Fig. 3 illustrates an example of data informative of non-verbal expressions of a participant derived from data informative of at least one of a body motion and a facial expression of the participant.

[0036] Fig. 4 illustrates an example of paralinguistic expressions derived from an audio content or data informative thereof associated with a participant.

[0037] Fig. 5 illustrates a generalized flow-chart of a method of splitting a session into a plurality of time intervals/time segments according to a splitting criterion.

[0038] Fig. 6 illustrates an example of a rule used in the splitting criterion, in which an overlap between the time segments and the periods of time in which data informative of behavioral expressions have been determined in Fig. 2 is taken into account.

[0039] Fig. 7 illustrates an example of a split of a session into time intervals, each time interval being split into a plurality of time segments.

[0040] Fig. 8 illustrates an example of an iterative process for determining the time segments.

[0041] Fig. 9 illustrates a generalized flow-chart of a method of determining a data structure including a structured representation informative of behavioral expressions of a given participant in a given time interval.

[0042] Fig. 10 illustrates an example of the output of the method of Fig. 9.

[0043] Fig. 11 illustrates a generalized block diagram of a machine learning module operative to convert a data structure including e.g. a structured representation informative of behavioral expressions of a given participant in a given time interval into an embedded data structure.

[0044] Fig. 11A illustrates a variant of the machine learning module of Fig. 11.

[0045] Fig. 12 illustrates a generalized flow-chart of a method of training the machine learning module of Fig. 11.

[0046] Fig. 13 illustrates a generalized flow-chart of a method of training the machine learning module of Fig. 11A.

[0047] Fig. 14 illustrates a generalized flow-chart of a method of generating an embedded data structure by embedding a data structure including data informative of behavioral expressions of a given participant in a given time interval.

[0048] Fig. 14A illustrates a generalized flow-chart of a method of using at least part of the trained machine learning module of Fig. 11 or Fig. 11A to generate an embedded data structure.

[0049] Fig. 15 illustrates a generalized flow-chart of a method of using at least part of the trained machine learning module of Fig. 11 or Fig. 11A to generate a set of embedded data structures for a session split into time segments. [0050] Fig. 16 illustrates a generalized flow-chart of a method of generating a personalized data structure informative of behavioral expressions of a given participant across different sessions.

[0051] Fig. 16A illustrates a non-limitative example of operations which can be performed in the method of Fig. 16.

[0052] Fig. 17 illustrates a generalized flow-chart of a method of training a machine learning module to generate an embedded data structure informative of behavioral expressions of one or more participants in a time interval including a plurality of time segments.

[0053] Fig. 17A illustrates a generalized flow-chart of a method of using a trained machine learning module to generate an embedded data structure informative of behavioral expressions of one or more participants in a time interval including a plurality of time segments.

[0054] Fig. 17B illustrates a non-limitative example of the method of Fig. 17.

[0055] Fig. 17C illustrates a non-limitative example of the method of Fig. 17A.

[0056] Fig. 18 illustrates a generalized flow-chart of a method of training one or more machine learning modules to determine data informative of social and/or emotional performance indicators of one or more participants in a session using data informative of behavioral expressions of the one or more participants in the session.

[0057] Fig. 19 illustrates a generalized block diagram depicting training of one or more machine learning modules according to the method of Fig. 18.

[0058] Fig. 20 illustrates a generalized flow-chart of a method of using one or more trained machine learning modules to determine data informative of social and/or emotional performance indicators of one or more participants in a session using data informative of behavioral expressions of the one or more participants in the session.

[0059] Fig. 21 illustrates a generalized block diagram depicting usage of one or more machine learning modules according to the method of Fig. 20. DETAILED DESCRIPTION OF EMBODIMENTS

[0060] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the presently disclosed subject matter.

[0061] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "obtaining", "using", "determining", "generating", "training", "feeding", "outputting", "splitting", "reconstructing", or the like, refer to the action(s) and/or process(es) of a processing unit that manipulate and/or transform data into other data, said data represented as physical, such as electronic, quantities and/or said data representing the physical objects. The term "computerized system" or "computer" should be expansively construed to cover any kind of hardware-based electronic device with data processing capabilities.

[0062] The terms "non-transitory memory" and "non-transitory storage medium" used herein should be expansively construed to cover any volatile or non-volatile computer memory suitable to the presently disclosed subject matter.

[0063] It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are described in the context of separate embodiments, can also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are described in the context of a single embodiment, can also be provided separately or in any suitable sub-combination. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the methods and apparatus. [0064] Bearing this in mind, attention is drawn to Fig. 1 illustrating a functional block diagram of a (computerized) system 100 in accordance with certain embodiments of the presently disclosed subject matter.

[0065] System 100 includes a processor and memory circuitry (PMC) 110. PMC 110 includes a processor (not shown separately) and a memory (not shown separately). The processor of PMC 104 can be configured to execute several functional modules in accordance with computer-readable instructions implemented on a non-transitory computer-readable memory comprised in the PMC. Such functional modules are referred to hereinafter as comprised in the PMC.

[0066] System 100 is operable to receive data 120 from a third party (e.g. from another computer or server), through wire and/or wireless communication. According to some embodiments, data 120 includes data informative of one or more recorded sessions. Each session includes at least one of an audio content 120i (e.g. digitized audio content recorded during the session) and a video content 1202 (recorded during the session). A session generally involves a plurality of participants, which interact socially. In some embodiments, the session can include, at least during a given period of time (or during the whole session), a single participant.

[0067] According to some embodiments, system 100 includes a machine learning module 130. As explained hereinafter, the machine learning module 130 is configured to provide an embedded representation informative of behavioral expressions of a participant during each of a plurality of time segments of the session.

[0068] As further explained hereinafter, an embedded data structure obtained by embedding a given data structure is a smaller size representation of the given data structure (which keeps, at least partially, meaning of the data present in the data structure). In some embodiments, and as explained hereinafter, the embedded data structure takes into account the context (temporal context) of the given data structure. Various methods are described hereinafter to generate embedded representations. [0069] The machine learning module 130 can be implemented e.g. by PMC 110, and embodiments oftraining/usage of the machine learning module 130 will be described hereinafter.

[0070] According to some embodiments, system 100 includes one or more additional machine learning modules 140 (see machine learning modules 140i, 1402,...,140N). AS explained hereinafter, the machine learning modules 140 are trained to determine data informative of social and/or emotional performance indicators of people (which will be discussed hereinafter), based on data derived from the audio and/or video content 120 of a session.

[0071] According to some embodiments, at least one of the machine learning modules 140 includes a deep neural network (DNN). PMC 110 can implement various additional machine learning modules (including e.g. a DNN), which can be used in the process of determining social skills of people, as explained hereinafter.

[0072] By way of a non-limiting example, the layers of the DNN can be organized in accordance with Convolutional Neural Network (CNN) architecture, Recurrent Neural Network architecture (e.g. Long Short Term Memory network architecture), Recursive Neural Networks architecture, Generative Adversarial Network (GAN) architecture, or otherwise. Optionally, at least some of the layers can be organized in a plurality of DNN sub-networks. Each layer of the DNN can include multiple basic computational elements (CE), typically referred to in the art as dimensions, neurons, or nodes.

[0073] Generally, computational elements of a given layer can be connected with CEs of a preceding layer and/or a subsequent layer. Each connection between a CE of a preceding layer and a CE of a subsequent layer is associated with a weighting value. A given CE can receive inputs from CEs of a previous layer via the respective connections, each given connection being associated with a weighting value which can be applied to the input of the given connection. The weighting values can determine the relative strength of the connections and thus the relative influence of the respective inputs on the output of the given CE. The given CE can be configured to compute an activation value (e.g. the weighted sum of the inputs) and further derive an output by applying an activation function to the computed activation. The activation function can be, for example, an identity function, a deterministic function (e.g., linear, sigmoid, threshold, or the like), a stochastic function, or other suitable function. The output from the given CE can be transmitted to CEs of a subsequent layer via the respective connections. Likewise, as above, each connection at the output of a CE can be associated with a weighting value which can be applied to the output of the CE prior to being received as an input of a CE of a subsequent layer. Further to the weighting values, there can be threshold values (including limiting functions) associated with the connections and CEs.

[0074] The weighting and/or threshold values of DNN can be initially selected prior to training, and can be further iteratively adjusted or modified during training to achieve an optimal set of weighting and/or threshold values in a trained DNN. After each iteration, a difference (also called a loss function) can be determined between the actual output produced by DNN and the target output associated with the respective training set of data. The difference can be referred to as an error value. Training can be determined to be complete when a cost or loss function indicative of the error value is less than a predetermined value, or when a limited change in performance between iterations is achieved. Optionally, at least some of the DNN subnetworks (if any) can be trained separately, prior to training the entire DNN.

[0075] According to some embodiments, system 100 is operable to receive labelled data (hereinafter label 150). The label 150 can be generated based on an input of an operator. The label is informative of one or more indicators of social and/or emotional performance of the one or more participants in the session. For example, the session may involve a plurality of students, and a teacher provides a score for one or more social and/or emotional performance indicators of the one or more participants during the session. The label 150 can be generated using the input of the teacher.

[0076] As explained hereinafter, system 100 can output, in particular, data 160 informative of one or more social and/or emotional performance indicators, determined by the system lOOforof at least one participant (or more) of a given session including at least one of an audio content and a video content.

[0077] According to some embodiments, each social and/or emotional performance indicator can characterize a type or a class of social and/or emotional behavior of participants in a social activity.

[0078] Social behavior relates to interaction(s) with others. Therefore, a given social performance indicator can be indicative of a given type of social behavior of a participant in a social activity.

[0079] Emotional behavior relates to the expression of feelings/emotions. Therefore, a given emotional performance indicator can be indicative of a given type of emotional behavior of a participant in a social activity.

[0080] Some behavior can correspond to both a social and emotional behavior (e.g. "provides feedback to others with positive attitude/encouraging"). Therefore, a given social and emotional performance indicator can be indicative of a given type of social and emotional behavior of a participant in a social activity.

[0081] The social and/or emotional performance indicators can be predefined (e.g. by an operator) and can be selected to reflect typical and relevant types of social and/or emotional behavior of participants in a social activity.

[0082] Non-limitative examples of social and/or emotional performance indicators include:

- The participant contributes ideas and promotes discussions;

- The participant answers questions effectively;

- The participant explains clearly;

- The participant acknowledges others' perspectives;

- The participant helps others when asked, etc.

[0083] According to some embodiments, the social and/or emotional performance indicators can include different types of social and/or emotional skills of people. Examples of these social and/or emotional skills can include e.g. self-awareness, selfmanagement, social awareness, relationship skills, responsible decision-making, etc. This list of social and/or emotional skills (defined by "CASEL") is not limitative and the social and/or emotional performance indicators can include skills which are defined in a different way. For example, the social and/or emotional performance indicators can include presentation skills (which can be divided into a plurality of sub-skills belonging to this category), communication skills (which can be divided into a plurality of sub-skills belonging to this category), collaboration skills (which can be divided into a plurality of sub-skills belonging to this category), etc. Definition of the social and/or emotional performance indicators can be provided e.g. by an operator.

[0084] According to some embodiments, the social and/or emotional behavior of a given participant in a session is automatically classified by system 100 using the predefined social and/or emotional performance indicators, wherein, for each social and/or emotional performance indicator, the given participant is attributed with a calculated weight/score (likelihood of correctness/certainty).

[0085] Therefore, data 160 output by system 100 can include, for each (predefined) social and/or emotional performance indicator, a prospect (such as a weight or a score). For each given social and/or emotional performance indicator, the corresponding weight or score indicates prospects that the participant complies with the definition of this given social and/or emotional performance indicator.

[0086] For example, a good presenter will be assigned a high score by the system 100 for the social performance indicator defined as: "the participant explains clearly".

[0087] Attention is drawn to Fig. 2.

[0088] As explained above, a session includes at least one of an audio content and a video content. In some embodiments, additional metadata can be associated with the session (e.g. periods of time in which each participant is speaking, identity of participants, etc.). The method of Fig. 2 includes (operation 200) obtaining at least part of the content associated with the recorded session. [0089] According to some embodiments, the method of Fig. 2 includes generating (operation 210) first data based on the content of the session.

[0090] The first data can include e.g. speech transcript (corresponding to the written transcription of the oral communication in the session) derived from the audio content of the session. Available speech recognition software can be used to generate the speech transcript (e.g. Google transcription services, Nuance library, etc.). In some embodiments, the speech transcript includes data indicative, for each portion of the speech, of the participant who formulated this portion of speech.

[0091] According to some embodiments, the first data can include raw facial expressions. Examples of raw facial expressions include e.g. Brow Furrow, Brow Raise, Cheek Raise, etc. (this list is not limitative). These facial expressions can be derived for each participant, based on the video content of the session. In some embodiments, a separate video content is associated with each participant (e.g. during a remote Internet meeting), which can be fed to an algorithm operative to recognize facial expressions (non limitative examples of face expression classification algorithms include algorithms available in libraries such as iMotions and services such as Microsoft cognitive services can be used). In a video common to a plurality of participants, each participant can be recognized using e.g. face recognition services such as Microsoft face detection.

[0092] According to some embodiments, the first data can include feature characteristics of the audio content of the session. The features can include e.g. Pitch, Loudness, Critical Band spectra, Perceptual Linear Predictive (PLP) Coefficients, etc. In some embodiments, the session can include metadata which indicates, for each period of time, the participant who is mainly speaking. For each period of time, corresponding audio data can be fed to one or more algorithms operative to compute the required features (non limitative examples of algorithms include YAAPT pitch tracking, Cepstrum Pitch Determination, etc.).

[0093] The method further includes determining (see operation 215) various data D behavioral expr. informative of behavioral expressions of one or more participants of the session. These behavioral expressions can correspond to non-verbal expressions and/or verbal expressions. These behavioral expressions can correspond e.g. to social and/or emotional expressions of the participant. Non-verbal expressions can include and/or can be derived from e.g. body expression(s) and/or face expression(s). Verbal expressions can include and/or can be derived from voice indicators, language expressions, etc. Various examples are provided hereinafter.

[0094] According to some embodiments, the method includes, for at least one participant (or for each participant), determining (operation 220), for each period of time of a plurality of periods of time of the session, data D non-verbal informative of nonverbal expressions of the participant in the period of time based on data informative of at least one of a body motion and a facial expression of the participant in the period of time.

[0095] In particular, data D non-verbal Can include various types of expressions, such as e.g. attention level (characterizing to what extent the participant is attentive to the exchanges of the session), specific expressions, such as confirmation, confusion, interest, engagement level, valence level (informative of the strength of the facial expression and its evolution over time), etc. These examples are however not limitative.

[0096] Determination of data D non-verbal based on data informative of at least one of a body motion and a facial expression of the participant can rely on various rules or patterns enabling this conversion. These rules can be predefined e.g. by one or more operators and stored in a memory. In some embodiments, these rules can be improved based on an input of communication experts (psychologists, physiologists, etc.).

[0097] A non-limitative example of data D non-verbal (see reference 300) is provided in Fig. 3 for a given participant. The representation used in Fig. 3 is not limitative. In this example, during the period of time [0:00:03.574; 0:00:04.763], it has been determined that the given participant had a low attention level (confidence score of this evaluation is 0.8). During the period of time [0:00:04.763; 0:00:08.136], it has been determined that the given participant had a high attention level (confidence score of this evaluation is 0.6). During the period of time [0:00:12.234; 0:00:18.136], it has been determined that the given participant had an expression of confirmation (confidence score of this evaluation is 0.9).

[0098] Attention level can be determined based e.g. on predefined rules. For example, a starting point of a period of time can be defined as an event in which a participant is looking directly at the camera, and an end point of the period of time can be defined as an event in which the participant is not looking directly at the camera. Value of the attention level in the period of time can be determined based on rules depending on facial expressions such as (but not limited to) head movement, brow raiser, eye focus, etc.

[0099] A rule which enables determining whether the participant had an expression of confirmation can include:

Head of the participant was down more than 0.3ms (hereinafter "nod down");

Head of the participant was up more than 0.3ms (hereinafter "nod up");

A difference in time between "nod down" and "nod up" is below a threshold (hereinafter "nod head");

Face of the participant indicates a positive feeling, such as by a cheek raiser (hereinafter "positive face gesture");

It is concluded that the participant performed a "confirmation", if "nod head" and "positive face gesture" occurred within a common time period (difference in time between the two events is below a threshold).

[00100] The above example is not limitative. Similarly, for each of the other types of expressions of D non-verbal, specific rules/patterns can be defined.

[00101] As mentioned above, data D non-verbal, Can include different behavioral expressions. The periods of time in which values of the different behavioral expressions are determined can be different among the behavioral expressions.

[00102] In addition, in a given period of time, it can occur that a plurality of different behavioral expressions is determined for a given participant. For example, in the same period of time, it can be determined that the participant had both a high attention level (first indicator of Dnon verbai) and had an expression of confirmation (second indicator of D non-verbal)-

[00103] According to some embodiments, the method includes, for at least one participant (or for each participant), determining (operation 230), for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions Dparaiinguistic based on an audio content or data informative thereof associated with the participant. Audio content, or data informative thereof, can include e.g. raw audio content of the session, raw audio features of the session (pitch, etc.), audio data determined at operation 210 (e.g. speech transcript).

[00104] Dparaiinguistic can include e.g. speaker's fluency, monotonous speech levels (this depends e.g. on the evolution of pitch over time, speed of speech, etc.), excitement levels, etc. These examples are not limitative.

[00105] According to some embodiments, a machine learning module (implementing a machine learning algorithm, such as a deep neural network) can be trained to determine Dparaiinguistic based on audio content, or data informative thereof, associated with the participant, and a label provided e.g. by an operator.

[00106] A non-limitative example of determining Dparaiinguistic is provided hereinafter.

[00107] The method can include identifying periods of time in which there is a single main speaker and his identity (see e.g. operation 250 described hereinafter).

[00108] In each of these periods of time (in which a single main speaker is present), various data can be extracted, such as maximal audio energy level, variance of pitch level in each sentence, average gap in pitch per word in all sentences, etc. This list of data is not limitative. A label (provided e.g. by an operator) can include values for various expressions of Dparaiinguistic, such as fluency of the main speaker, the monotony of the main speaker, excitement level, etc. The extracted data together with the label can be used to train the machine learning module.

[00109] Once the machine learning module is trained, it can be used to determine, based on similar data used for training the machine learning module, paralinguistic expressions similar to those defined in the label. [00110] A non-limitative example of D para iinguistic for a given participant is illustrated as reference 400 in Fig. 4. As shown, for a given participant, the session is divided into a plurality of periods of time (the periods of time do not necessarily cover the whole duration of the session). Values for one or more of the paralinguistic expressions are provided for each period of time.

[00111] According to some embodiments, the method of Fig. 2 includes, for at least one participant (orfor each participant), determining (operation 240), for each period of time of a plurality of periods of time of the session, data Di ang uage informative of language expressions used by the participant.

[00112] According to some embodiments, Di ang uage includes data informative of various feelings and/or specific social interactions of a participant that can be derived from language expressions used by the participant, such as sentiment, self-attribution, frustration, blame, excitement, profanity, etc. The method can receive as an input the speech transcript of the session (which can be segmented into sentences - each sentence being associated to a given participant, using the algorithms described above for converting voice into text).

[00113] As for the other indicators already mentioned above, various rules/patterns can be defined to enable automatic computerized detection of the various feelings and/or specific social interactions based on language expressions used by the participant. Some examples are provided below.

[00114] Self-attribution (or team attribution) can be derived using pronoun usage by the participant. For instance, "we" is indicative of a team attribution and "I" is indicative of self-attribution.

[00115] A blame can be detected if the sentence comprises:

A causality adverb (such as "because");

Attribution to another person (which can be detected by the usage of a second/third person pronoun);

A verb used in the past tense. For instance, a sentence such as "you are hungry because you refused to eat your meal" can be detected, indicative of blaming another person.

[00116] Excitement can be determined by detecting specific semantic expressions (e.g. "great", etc.). Similarly, a library can define, for each feeling to be detected, a list of semantic expression characteristics of this feeling.

[00117] According to some embodiments, Di ang uage includes indicators of the linguistic expression style, such as personal reference style, richness of language, complexity of sentences, etc. As mentioned above, these indicators can be computed for each period of time of a plurality of periods of time of the session.

[00118] According to some embodiments, the indicators of the linguistic expression style can include e.g. Flesch-Kincaid grade level, SMOG index, and multi-lingual Flesch Reading Ease. These indicators can be used to characterize e.g. complexity of language.

[00119] Personal reference style can be computed as a frequency at which a person is referring to himself (e.g. by using the pronoun "I").

[00120] Richness of language can be computed by performing statistical analysis of the number of different words used by the participant.

[00121] According to some embodiments, Dianguage includes indicators of the structure of the discourse of each participant, such as "elaboration", "feedback", etc. According to some embodiments, these indicators can be derived based on the Penn Discourse Treebank standard (PDTB). Discourse indicators include temporal indicators such as motivation and precedence, contingency indicators like cause and condition; comparison indicators like contrast and conclusion, expansion indicators such as alternative or exception. According to some embodiments, algorithms such as Neural Discourse Modeling can be used.

[00122] According to some embodiments, the method includes, determining (operation 250), for each period of time of a plurality of periods of time of the session, a (single) main speaker D main/speaker among a plurality of participants of the session. The single main speaker of a period of time (the period of time can be defined to have a duration above a threshold) corresponds to the participant who has the largest fraction of speech duration during this period of time. Operation 250 can include using speaker diarization algorithms, such as ALIZE Speaker Diarization, Audioseg, SpkDiarization, etc. In some embodiments, the session is already associated with metadata indicative of the main speaker for each period of time (determined e.g. based on the volume of the sound of the microphone of each participant).

[00123] Attention is now drawn to Fig. 5.

[00124] The method of Fig. 5 includes obtaining (operation 500) data Dbehaviorai expr. informative of behavioral expressions of one or more participants in the session. Various embodiments for computing Dbehaviorai expr. have been described with reference to Fig. 2. Dbehaviorai expr. is informative of behavioral expressions of the one or more participants in a plurality of periods of time in the session.

[00125] The method further includes splitting the session into a plurality of time segments, based on Dbehaviorai expr. and a splitting criterion.

[00126] According to some embodiments, the session is split into an ordered sequence which is informative of a plurality of different social interactions which occurred in the session. According to some embodiments, the split according to the splitting criterion enables that at least some of the plurality of time segments is respectively informative of a different social interaction (or respectively informative of a different subset of social interactions) which occurred during the session. According to some embodiments, the split according to the splitting criterion enables that, for at least some of the plurality of time segments Si to SN (these time segments are not necessarily strictly adjacent in time), each given time segment Si is mainly (or fully) associated with a social interaction (or with a subset of social interactions) which differs from the social interaction (orthe subset of social interactions) associated with the time segment SM located before the given time segment Si and/or with the time segment Si+i located after the given time segment Si.

[00127] Different social interactions can include e.g. different social interactions by the same participant and/or the same social interactions but performed by different participants. [00128] For example (these examples are not limitative), during a first time segment a participant agreed, and then during a second time segment this participant did not understand another participant, etc. A first social interaction corresponds to the agreement of the participant, and a second social interaction corresponds to the lack of understanding of another participant.

[00129] In another example, during a first time segment, a first participant agreed (first social interaction), and in a second time segment a second participant also agreed (second social interaction).

[00130] In another example, during a first time segment, a first participant agreed, and a second participant had a low attention level (first set of social interactions), and during a second time segment the first participant had a high attention level and the second participant was upset (second set of social interactions).

[00131] According to some embodiments, the split can capture, at least partially, this sequence of different social interactions/behaviors in order to divide accordingly the session over time.

[00132] Hereinafter various rules are provided for the splitting criterion. In some embodiments, for each of these rules, a score can be assigned to each time segment, and an aggregated score can be determined to select the most relevant split of the session into time segments.

[00133] In some embodiments, the splitting criterion is informative of a consistency of at least part of the data Dbehaviorai expr. in each of the time segments. This consistency can be determined using statistical analysis, such as variance analysis, etc. Indeed, if Dbehaviorai expr. is consistent within a first set of values in a first time segment, and consistent within a second set of values (different from the first set of values) in another second time segment (different from the first time segment), this can provide an indication that the first time segment and the second time segment respectively correspond to different social interactions (or different sets of social interactions).

[00134] In particular, it has been explained above that Dbehaviorai expr. can be indicative of various different behavioral expressions (as shown in the various embodiments above, each behavioral expression is associated, for a given participant, for each period of time of a plurality of periods of time, with a corresponding value). In some embodiments, the splitting criterion is informative of a consistency of the data in each time segment, per behavioral expression.

[00135] For example, assume that Dbehavioral expr. includes values for "attention", "speaker's fluency" and "blame" for the participants. For the first behavioral expression "attention", consistency of the values collected for this first behavioral expression (for at least one participant, or for all participants) is determined over the session. For the second behavioral expression , "speaker's fluency", consistency of the values collected for this second behavioral expression (for at least one participant, or for all participants) is determined over the session. For the third behavioral expression "blame", consistency of the values collected for this third behavioral expression (for at least one participant, or for all participants) is determined over the session. These indicators and their number are not limitative and different indicators (and/or a different number of behavioral expressions) can be used.

[00136] For each behavioral expression, this provides an indication of time segments for which Dbehavioral expr. is consistent. This can provide a first indication on a possible split of the session into time segments in which different social interactions occurred.

[00137] According to some embodiments, and as explained above (see Fig. 2), Dbehavioral expr. includes, for each period of time of a plurality of periods of time of the session, data informative of behavioral expressions of one or more participants of the session. The splitting criterion can be informative, for a given period of time associated with Dbehavioral expr., of a level of inclusion of the given period of time within a time segment of the plurality of time segments. In particular, the splitting criterion can be such that:

- If a given period of time is cut-out by one or more time segments, a penalty is assigned (to the given period of time or equivalently to the one or more time segments); - If a given period of time is mostly or fully located within a given time segment, a reward is assigned (to the given period of time or equivalently to the given time segment).

[00138] A non-limitative example is provided in Fig. 6. Four (candidate) time segments 640, 650 and 660 are illustrated. Dsociai includes data 600 for a first period of time (corresponding to the length of the segment 600), and data 610 for a second period of time, both fully located within the first time segment 640. Therefore, the first time segment 640 is, according to this rule of the splitting criterion, a good candidate. To the contrary, data 620 is within a third period of time cut-out by the second time segment 650 and the third time segment 660. Similarly, data 630 is within a fourth time period cut-out by the second time segment 650 and the third time segment 660. Therefore, the second time segment 650 and the third time segment 660 are not good candidates according to this rule of the splitting criterion. This example is not limitative.

[00139] According to some embodiments, and as explained above, Dsociai comprises, for each period of time of a plurality of periods of time of the session, first data informative of a first behavioral expression of one or more participants of the session, and second data informative of a second behavioral expression of one or more participants of the session (the second behavioral expression being different from the first behavioral expression).

[00140] The splitting criterion can be informative of a likelihood that a first value of the first data and a second value of the second data occur within a common time segment. The likelihood can be measured e.g. based on historical data and/or based on an input of an operator. This can be generalized to more than two different indicators.

[00141] For example, the likelihood that a given participant is, within a given time segment, both confirming ("confirmation" is a possible indicator of Dsociai, as mentioned above) and blaming ("blaming" is a possible indicator of Dsociai, as mentioned above) is low. As a consequence, a time segment which both comprises these two types of social interactions, will get a lower score. [00142] According to some embodiments, the method includes splitting the session into a plurality of time intervals and splitting each of the time intervals into a plurality of time segments. In particular, the plurality of time intervals can be selected based on a determination of a single main speaker among the participants in each of these time intervals (see e.g. operation 250 in Fig. 2). A main speaker can be defined as a speaker who communicates in the majority (or more) of the time interval, wherein the time interval has a length which is above e.g. a predefined threshold (which can be set e.g. by an operator). This is illustrated in Fig. 7, in which the session is split into a plurality of N time intervals 700i, 7002,... ,700N. In some embodiments, the time intervals can be of different sizes or of the same size. Each time interval is split into a plurality of time segments, using the splitting criterion described above. For example, as illustrated in Fig. 7, time interval 700iis split into three time segments 700i,i, 7001,2 and 7001,3.

[00143] In some embodiments, the split of the session into time segments is performed using an iterative process. At each iteration, it is attempted to determine candidate time segments, according to the splitting criterion. For a given iteration, the method can include merging at least two time segments (e.g. adjacent time segments) determined at a previous iteration, based on the splitting criterion. In some embodiments, the splitting criterion can prevent merging of time segments, based e.g. on rules such as contradicting social interactions (both "happy" and "upset"), difference in the main speaker, etc.

[00144] In some embodiments, at each iteration, a score is computed e.g. based on the splitting criterion for the time segments. The method can be repeated until a stopping criterion is met (e.g. the score is no longer improved and/or has reached a threshold).

[00145] An example of such an iterative process is illustrated in Fig. 8. Assume that data Dbehavioral expr (see reference 800) has been obtained for one or more participants of the session. As shown in Fig. 8, each data of Dbehavioral expr is associated with a corresponding period of time of the session (corresponding to the length of the illustrated box). For example, data 805 is associated with period of time 810. As explained above, data Dbehavioral expr can include various different behavioral expressions, each being associated with at least one value for a participant in a period of time. [00146] The method can include determining, using the different periods of time, all possible time segments 820. Based on this first rough split, the method includes merging a subset of the time segments 820. In particular, according to some embodiments, only adjacent time segments 820 that improve an overall score are merged. The overall score can be calculated based on the splitting criterion (the splitting criterion can be informative of the consistency of the data in the time segments and/or of other parameters mentioned above).The split of the session into time segments can be improved iteratively, by merging the time segments so as to improve the overall score, until the stopping criterion is met (e.g. the overall score is no longer improved and/or has reached a threshold).

[00147] According to some embodiments, the session can be split into a plurality of time segments using other segmentation algorithms which rely on an iterative process of improving a score computed according to the rules defined in the splitting criterion mentioned above. For example, the segmentation algorithm described in "Segmenting time series: A survey and novel approach", Keogh, Eamonn, et al., Data mining in time series databases 57 (2004): 1-22, can be used. This is however not limitative.

[00148] Attention is now drawn to Fig. 9.

[00149] According to some embodiments (see operations 900 and 910 in Fig. 9), for at least one participant of the session (or for each participant of the session), a data structure including a structured representation is determined for each period of time of a plurality of periods of time of the session. In some embodiments, the plurality of periods of time can correspond to the time segments determined using the method of Fig. 5.

[00150] For a given participant, and for a given period of time, the structured representation can be informative of behavioral expressions of the given participant in the given period of time. The structured representation can include data Dbehaviorai expr. computed using the method of Fig. 2, and/or data derived from Dbehaviorai expr. In particular, it can include e.g. at least one of D non-verbal, Dparalinguistic, Dlanguage- [00151] As explained with reference to Fig. 2, Dbehavioral expr. includes, for each period of time of a plurality of periods of time of the session, data informative of behavioral expressions of one or more participants of the session. As explained with reference to Fig. 5, the session can be divided into time segments which are different (see e.g. Figs. 6 and 8) from these periods of time. Therefore, if the structured representation mentioned in Fig. 9 is generated for each of the time segments, it can be necessary to aggregate Dbehavioral expr. in order to map Dbehavioral expr. from these periods of time into these time segments. For example, if pitch level is collected every ms and a time segment has a duration of 5ms, then the pitch level stored in the structured representation corresponds e.g. to an average value over this time segment. This example is not limitative.

[00152] In some embodiments, the structured representation is also informative of a reaction of one or more other participants different from the given participant in the given period of time. In other words, the structured representation captures the reaction of the audience to the behavior of the given participant. Reaction of the audience can be computed e.g. by aggregating data Dbehavioral expr. (or data derived from Dbehavioral expr.) obtained for the participants belonging to the audience (aggregating can include e.g. averaging or using other aggregation methods), in particular for specific behavioral expressions such as attention level, etc.

[00153] A non-limitative example of a data structure 1000 including a structured representation as mentioned with reference to Fig. 9 is illustrated in Fig. 10, for a given participant X at a given time segment T of the session. Data structure 1000 can be used as part of data Dbehavioral expr.

[00154] In this non-limitative example, data structure 1000 includes data D non-verbal (e.g. joy expression, contempt expression, etc. - see other examples with reference to Fig. 2), data Dparaiinguistic (e.g. anger tone, joyful response, etc. - see other examples with reference to Fig. 2) and data Dverbai (e.g. use of profanity, etc. - see other examples with reference to Fig. 2). Data structure 1000 also includes data informative of the reaction of the audience, such as whether peers express agreement, peers raise voice at the given participant (which can be determined based on the voice level/pitch level), peers raise voice at one another, a specific peer did something related to me, etc.

[00155] Attention is now drawn to Figs. 11, 11A and 12.

[00156] Fig. 11 depicts an architecture of a machine learning module 1101. As shown, the machine learning module 1101 includes a first network 1100 comprising a plurality of layers. The architecture of the first network 1100 can correspond e.g. to a deep neural network. Each layer comprises a plurality of computational elements (also called nodes). Computational elements of a given layer can be connected with CEs of a preceding layer and/or a subsequent layer. Each connection between a CE of a preceding layer and a CE of a subsequent layer is associated with a weighting value (also called weight). A given CE can receive inputs from CEs of a previous layer via the respective connections, each given connection being associated with a weighting value which can be applied to the input of the given connection. The weighting values can determine the relative strength of the connections and thus the relative influence of the respective inputs on the output of the given CE.

[00157] The machine learning module 1101 includes a second network 1110 comprising a plurality of layers. The second network 1110 is identical to the first network 1100. According to some embodiments, and as explained hereinafter, a value assigned to a weight of a node of a layer of the first network 1100 is identically assigned to a weight of a corresponding node of a corresponding layer of the second network 1110.

[00158] The machine learning module 1101 further includes an aggregating network 1120 (which also comprises a plurality of layers, each including nodes associated with one or more weights, as explained above). The architecture of the aggregating network 1120 can correspond e.g. to a deep neural network. In practice, according to some embodiments, the first network 1100, the second network 1110 and the aggregating network 1120 are part of the same network.

[00159] The aggregating network 1120 receives, as input 1103, the output 1104 of the first network 1100 and the output 1106 of the second network 1110. According to some embodiments, the size of the input layer of the aggregating network 1120 is equal to the sum of the size of the output layer of the first network 1100 and the size of the output layer of the second network 1110.

[00160] According to some embodiments, the size of the output layer of the aggregating network 1120 is equal to the size of the input layer of the first network 1100 (which is equal to the size of the input layer of the second network 1110). In other words, the size of the input 1108 of the first network 1100 (respectively, of the input 1109 of the second network 1110) and the size of the output 1111 of the machine learning module 1101 can be identical.

[00161] Fig. 11A describes a variant of the architecture of Fig. 11.

[00162] In this variant, the machine learning module 1101 comprises a first network 1100 comprising a plurality of layers (e.g. a DNN), a second network 1110 comprising a plurality of layers (e.g. a DNN) and a third network 1115 comprising a plurality of layers (e.g. a DNN). The first network 1100, the second network 1110, and the third network 1115 are identical.

[00163] According to some embodiments, and as explained hereinafter, a value assigned to a weight of a node of a layer of the first network 1100 is identically assigned to a weight of a corresponding node of a corresponding layer of the second network 1110 and to a weight of a corresponding node of a corresponding layer of the third network 1115.

[00164] The machine learning module 1101 further includes an aggregating network 11201, which receives, as input 11031, the output 1106 of the first network 1100, the output 1104 of the second network 1110 and the output 1107 of the third network 1115. According to some embodiments, the size of the input layer of the aggregating network 11201 is equal to the sum of the size of the output layer of the first network 1100, the size of the output layer of the second network 1110 and the size of the output layer of the third network 1115.

[00165] According to some embodiments, the size of the output layer of the aggregating network 11201 is equal to the size of the input layer of the first network 1100 (which is equal to the size of the input layer of the second network 1110 and to the size of the input layer of the third network 1115). In other words, the size of the input 1108 of the first network 1100 (respectively the size of the input 1009 of the second network 1110 and the size of the input 1114 of the third network 1115) and the size of the output lllli of the machine learning module 11011 can be identical.

[00166] More generally, the machine learning module 1101/11011 can include N identical networks (e.g. deep neural networks), all connected to an aggregating network (e.g. deep neural network), receiving, as an input, the output of the N networks. The N identical networks and the aggregating network can be part of the same network.

[00167] The size of the output layer of the aggregating network is equal to the size of the input layer of each of the N networks. A weight assigned to a given node of a given layer of a network NWi of the N networks is identically assigned to all corresponding nodes of corresponding layers of the other network NWj (with l<j<N and j different from N).

[00168] Fig. 12 illustrates a method of training the machine learning module 1101.

[00169] The method includes obtaining (operation 1200) a first set of data 1108 informative of a first time segment of a given content. The method includes obtaining (operation 1210) a second set of data 1109 informative of a second time segment of the given content.

[00170] For example, the first set of data can correspond to a data structure 1000 as depicted in Fig. 10, which is informative, for a given participant of a session, of behavioral expressions of the given participant in a first time segment, and, in some embodiments, of a reaction of one or more other participants different from the given participant in this first time segment. The second set of data 1109 can correspond to a data structure 1000 as depicted in Fig. 10, which is informative, for the given participant of the session, of behavioral expressions of the given participant in a second time segment, and, in some embodiments, of a reaction of one or more other participants different from the given participant in this second time segment. In some embodiments, and as explained with reference to Fig. 7, the session is divided into a plurality of time intervals, wherein each time interval is divided into a plurality of time segments. In some embodiments, the first time segment and the second time segment belong to the same time interval.

[00171] The example provided above is not limitative, and the first set of data 1108 and the second set of data 1109 can correspond to other data of a given content which evolves over time (the given content includes e.g. audio content, video content, etc.).

[00172] The method further includes (operation 1220) using the first set of data 1108 and the second set of data 1109 to train the machine learning module 1101. In particular, the machine learning module 1101 is trained to generate a first data structure 1106 corresponding to an embedded representation of the first set of data 1108 and to generate a second data structure 1104 corresponding to an embedded representation of the second set of data 1109. An embedding is a relatively lowdimensional space into which high-dimensional vectors can be translated. Therefore, the embedded representation 1106 of the first set of data 1108 is of lower size than the first set of data (while maintaining, as much as possible, relevant information of the first set of data 1108). Similarly, the embedded representation 1104 of the second set of data 1109 is of lower size than the second set of data 1009.

[00173] As explained above, the first data structure 1106 is generated by the first network 1100 and the second data structure 1104 is generated by the second network 1109.

[00174] The machine learning module 1101 is trained to reconstruct the second set of data 1109 using the first data structure 1106 and the second data structure 1104. In other words, it is attempted to obtain an output 1111 of the machine learning module 1101 which is identical to the second set of data 1109 (or, in some embodiments, to the first set of data 1108). The output 1111 is generated by the aggregating network 1120 using the first data structure 1106 and the second data structure 1104. [00175] The training includes generating appropriate weights for nodes of the first network, the second network and the aggregating network, to minimize e.g. a loss function reflecting the intended goal (as described above). Training methods such as "Backpropagation" can be used.

[00176] As explained above, the first network 1100 and the second network 1110 are trained in such a way that a value assigned to a weight of a node of a layer of the first network 1100 is identically assigned to the weight of a corresponding node of a corresponding layer of the second network 1110.

[00177] The method of Fig. 12 has been described with reference to the machine learning module 1101 of Fig. 11.

[00178] Fig. 13 depicts a method of training a machine learning module 11011 as depicted in Fig. 11A.

[00179] The method includes obtaining (operation 1300) a first set of data 1108 informative of a first time segment of a given content. The method includes obtaining (operation 1310) a second set of data 1109 informative of a second time segment of the given content. The method includes (operation 1315) obtaining a third set of data 1114 informative of a third time segment of the given content. In some embodiments, the second time segment is located between the first time segment and the third time segment.

[00180] As mentioned above, the first set of data, the second set of data and the third set of data can each correspond to a data structure informative of behavioral expressions of a participant in different time segments of the session, such as data structure 1000 as described e.g. with reference to Fig. 10. As a consequence, the first (respectively second, third) set of data can correspond to the data structure 1000 obtained for a given participant in the first (respectively second, third) time segment. In some embodiments, the first time segment, the second time segment and the third time segment belong to the same time interval (see a definition of a time interval with reference to Fig. 7). [00181] The method further includes using (1320) the first set of data 1108, the second set of data 1109 and the third set of data 1114 to train the machine learning module llOli to generate a first data structure 1106 corresponding to an embedded representation of the first set of data 1108, generate a second data structure 1104 corresponding to an embedded representation of the second set of data 1109 and generate a third data structure 1107 corresponding to an embedded representation of the third set of data 1114.

[00182] The machine learning module llOli is trained to reconstruct e.g. the second set of data 1109 using the first data structure 1106, the second data structure 1104 and the third data structure 1107. In other words, it is attempted to obtain an output lllli of the machine learning module 11101 which is identical to the second set of data 1109 (or, in some embodiments, to the first set of data 1108 or to the third set of data 1114). The output lllli is generated by the aggregating network 11201 using the first data structure 1106, the second data structure 1109 and the third data structure 1114.

[00183] As may be understood from the embodiments described above, training of the machine learning module 1101 or llOli is performed by taking into account the temporal context of the second set of data 1109. In the example of Fig. 11, while it is attempted to reconstruct the second of data 1109, its temporal context (the first set of data 1108) is taken into account. In the example of Fig. 11A, in which it is attempted to reconstruct the second data structure 1109, its temporal context (which corresponds to the first set of data 1108 and to the third set of data 1114) is taken into account. As a consequence, a smart training of the machine learning module is obtained, since for each piece of data, not only the piece of data is taken into account, but also its context. This is particularly useful when analyzing data informative of behavioral expressions in a session, since social behavior/social interactions correspond to interdependent pieces of data distributed over time. In particular, the training attempts to teach the machine learning module to capture a link between the different behavioral expressions (for example, a first participant smiled because he receives a positive feedback from a second participant, etc.). [00184] As explained above, in a more general configuration, the machine learning module includes N identical networks (configured to each generate an embedded data structure of their input) and an aggregating network receiving the output of the N networks. Training of this machine learning module is performed similarly to what has been described with reference to Figs. 12 and 13. In particular, the machine learning module is trained such that the N identical networks each generate an embedded data structure used by the aggregating network to reconstruct the input of a given network of the N networks. The given network can be later used in a prediction phase to generate an embedded data structure, as explained hereinafter.

[00185] Once the machine learning module 1101 or 11011 has been trained, it is possible to use e.g. at least one of the first network, the second network and the third network to generate embedded data (see Figs. 14A and 15).

[00186] Attention is now drawn to Fig. 14.

[00187] According to some embodiments, assume that a given data structure including data D behavioral expr. informative of behavioral expressions is obtained (operation 1400) for a participant for at least one time segment of the session. This given data structure can be obtained e.g. based on Dbehaviorai expr. computed for this time segment for this participant. The method of Fig. 14 can include generating (operation 1410) an embedded data structure using the given data structure, which is a smaller size representation of the given data structure. As explained hereinafter (see Fig. 19 and Fig. 21), this embedded data structure can be used as an input of one or more machine learning modules.

[00188] Operation 1410 can rely on various methods, such as Principal Component Analysis, training an Auto-Encoder to reconstruct its input and then using a network which is a subset of the Auto-Encoder and for which the output layer is of smaller size than the input layer, etc. According to some embodiments, and as explained with reference to Fig. 14A and 15, at least part of the trained machine learning module 1101 or 1101 can be used to generate the embedded data structure at operation 1410. [00189] Attention is now drawn to Fig. 14A. As shown in Fig. 14A, assume that a machine learning module 130 implements the second network 1110 of the machine learning module 1101 (or 11011) after its training. This is however not limitative, and in some embodiments, the machine learning module 130 can implement the first network 1100 or the third network 1115.

[00190] The method of Fig. 14A includes feeding (operation 1420) a given set of data informative of a time segment of a given content to the (trained) machine learning module 130.

[00191] The machine learning module 130 outputs (operation 1430) an embedded data structure (of lower size than the given set of data).

[00192] In some embodiments, the given set of data corresponds to a data structure 1000 as depicted in Fig. 10, which is informative, for a given participant of a session, of behavioral expressions of the given participant in a first time segment (and in some embodiments, of a reaction of one or more other participants different from the given participant in this first time segment). The output of the machine learning module 130 is therefore an embedded data structure derived from this data structure, for this first time segment (in compliance with operation 1410 described above).

[00193] As mentioned above, the machine learning module 130 has undergone a particular training, in which the temporal context of the data used in the training has been taken into account. Therefore, embedding enabled by the machine learning module 130 is particularly efficient for a data structure which represents behavioral expressions of a participant in a session, since social behavior/social interactions comprise interdependent pieces of data distributed over time.

[00194] Another possible usage of the machine learning module 130 is depicted in Fig. 15.

[00195] Assume that a session has been divided into a plurality of time intervals 15001 to 1500N, each divided into a plurality of time segments (see e.g. 1500i,i, 15001,2, etc.), as explained e.g. with reference to Fig. 7. [00196] For a given participant, a data structure (see 1510i,i, 1510N,2 - this data structure can be e.g. similar to the non-limitative example of Fig. 10) informative of social expressions of the participant, can be generated in each time segment.

[00197] Each data structure (1510i,i, ..., 1510N,2) of each time segment (1500i,i,..., 1500N,2) can be fed to the machine learning module 130, which outputs a corresponding embedded data structure (see 1520i,i, ..., 1520N,2). The corresponding embedded data structure 1520i,j is of lower size than the data structure 1510ij and captures its "essential" features. As a consequence, the embedded data structure 1520i,j can be more easily processed by one or more machine learning module(s) to determine social skills of participants, as explained hereinafter.

[00198] Attention is now drawn to Fig. 16.

[00199] Assume that data Dbehavioral expr. informative of behavioral expressions of a plurality of participants is obtained (operation 1600) for a plurality of different sessions (based on at least one of an audio content and a video content of each of the plurality of different sessions). The method of Fig. 16 can include using (operation 1610) this data Dbehavioral expr. to generate a personalized data structure informative of behavioral expressions of a given participant across the different sessions. In some embodiments, Dbehavioral expr. can be represented for each participant using a structured representation similar to Fig. 10, for each time segment (as determined in Fig. 5) of the session. The personalized data structure can be viewed e.g. as a vector representative of a social behavior of the participant among many sessions.

[00200] The personalized data structure attempts to capture e.g. typical social expressions of the participant in various sessions, such as his tendency to smile, etc. This is however not limitative.

[00201] A non-limitative example of the method of Fig. 16 is provided in Fig. 16A. Assume that a machine learning module 1630 implements a deep neural network, such as a LSTM Auto-Encoder (this is not limitative). Assume that M participants all participated to N sessions. Assume that for each session, data Dbehavioral expr. informative of behavioral expressions of each participant is obtained, for each time segment of the session (see e.g. the data structure of Fig. 10). The machine learning module 1630 can be fed with Dbehaviorai expr. (see reference 1640) for all sessions. The machine learning module 1630 is trained to reconstruct, for each given time interval 1650 including a plurality of time segments, and for each given participant, corresponding data Dbehaviorai expr. received as an input. In the machine learning module 1630, a layer of the network (such as the last layer before the output layer of the network) includes, for each participant, a set of weights (which can be set with a random value before training), which needs to be determined during the training. Once the training is completed, a personalized set of weights is obtained per participant, which corresponds to the personalized data structure of each participant.

[00202] The method of Fig. 16A is not limitative and other methods can be used, such as by aggregating Dbehaviorai expr. of each participant among various sessions (e.g. average) and reducing the size of the output using e.g. Principal Component Analysis (PCA).

[00203] Attention is now drawn to Fig. 17.

[00204] Assume that data Dbehaviorai expr. informative of behavioral expressions of one or more participants is obtained (operation 1700 - this data can be arranged according to the structured representation of Fig. 10) for each time segment of a session, wherein the session is divided into time intervals each including a plurality of time segments (as explained with reference to Fig. 5). The method of Fig. 17 can include using (operation 1710) this data Dbehaviorai expr. to train a machine learning module (implementing a deep neural network, such as a LSTM Auto-Encoder - this is not limitative) to generate an embedded data structure informative of behavioral expressions of the one or more participants in each time interval. In other words, this embedded data structure provides an aggregated view of the behavioral expressions overa time interval, based on a plurality of data structures informative of the behavioral expressions in a plurality of time segments of this time interval.

[00205] Once the machine learning module is trained, it can be used (see Fig. 17B) for processing data of other sessions. The method can include feeding (operation 1720) data Dbehaviorai expr. informative of behavioral expressions of one or more participants in each of a plurality of time segments of a given time interval to the trained machine learning module. The method further includes generating (operation 1730), by the trained machine learning module, at least one embedded data structure informative of behavioral expressions of the one or more participants in the given time interval.

[00206] Fig. 17B illustrates an embodiment of training a machine learning module 1740 in compliance with the method of Fig. 17.

[00207] Assume that the machine learning module 1740 is fed with data Dbehaviorai expr. (1725) informative of behavioral expressions of one or more participants in each time segment 1731, 1732, and 1733 of a given time interval 1734.

[00208] The machine learning module 1740 implements a network including a plurality of layers 1760. The machine learning module 1740 attempts to generate an embedded data structure 1750 informative of behavioral expressions of the one or more participants in the time interval 1734, which is in turn used to reconstruct, at the output 1770 of the machine learning module 1740, data Dbehaviorai expr. 1725.

[00209] This training process can be repeated for a plurality of time intervals of one or more sessions. Once the machine learning module 1740 is trained, it is possible to extract only a subset 1780 of the network of the trained machine learning module 1740, which can be used as part of another machine learning module 1781. The subset 1780 can include e.g. the network of the machine learning module 1740 up to the layer which provided the embedded data structure 1750. In prediction, this other machine learning module 1781, which implements the subset 1780, provides (see Fig. 17C), based on data Dbehaviorai expr (reference 1785-this data can be arranged according to the representation of Fig. 10) informative of behavioral expressions of one or more participants in a plurality of time segments 1781, 1782, 1783 of a time interval 1784, at least one embedded data structure 1790 structure informative of behavioral expressions of the one or more participants in the time interval 1784.

[00210] Generating an embedded data structure informative of behavioral expressions of the one or more participants in a time interval based on data Dbehaviorai expr. informative of behavioral expressions of the one or more participants for a plurality of time segments of the time interval can rely on other methods, such as using Recurrent Neural Network, using a Markov Chain algorithm, etc.

[00211] Attention is now drawn to Figs. 18 and 19.

[00212] As explained with reference to Fig. 1, system 100 includes one or more additional machine learning modules 140 (see machine learning modules 140i, 1402,...,140N), which can be used to automatically determine data informative of social and/or emotional performance indicators of participants of a session. The method of Fig. 18 depicts operations which can be performed to train at least one of the machine learning modules 140.

[00213] As explained hereinafter, the machine learning modules 140 are trained to determine data informative of social and/or emotional performance indicators of people, based on data derived (by a computerized method) from the audio and/or video content 120 of a session.

[00214] According to some embodiments, each machine learning module 140 is trained to determine data informative of a different social and/or emotional performance indicator. For example, assume that N machine learning modules are implemented, then, for a given participant of a session, data informative of N different social and/or emotional performance indicators can be predicted for this given participant. This is however not limitative, and in some embodiments, a given machine learning module can be trained to determine data informative of a plurality of social and/or emotional performance indicators.

[00215] In some embodiments, the machine learning modules 140 are divided into groups, each group being devoted to a different category of social and/or emotional performance indicator. For example, a first group 1980 of machine learning modules is trained to determine data informative of specific social and/or emotional performance indicators which correspond to skills of the category "Presentation Skills". These specific social and/or emotional performance indicators can include e.g. "Explains clearly", "Address alternate or contrasting views", "Answers questions effectively", etc. A second group 1981 of machine learning modules is trained to determine data informative of specific social and/or emotional performance indicators which correspond to skills of the category "Communication Skills". These specific social and/or emotional performance indicators can include e.g. "Actively listens while other talk", "Impolite or rude", "Actively participates in discussion", etc. A third group 1982 of machine learning modules is trained to determine data informative of specific social and/or emotional performance indicators which correspond to skills of the category "Collaboration Skills". These specific social and/or emotional performance indicators can include e.g. "Helps others when asked", "Actively cooperates to solve problems with the team", "Asks others for advice", etc. This list is not limitative and additional and/or different social and/or emotional performance indicators can be defined.

[00216] The method includes obtaining (operation 1800) data Dbehavioral expr. (see reference 1900 in Fig. 19) informative of behavioral expressions of the one or more participants in the session.

[00217] According to some embodiments, Dbehavioral expr. comprises a set of embedded data structures (see reference 1901) including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of one or more of behavioral expressions and derived from at least one of the audio content and the video content of the session in said time segment. An example of a set of embedded data structures has been described with reference to Fig. 15 (see 1520i,i,... ,1520N,2). AS explained above, each embedded data structure corresponds to an embedding of a data structure of larger size informative of a social behavior of the participant(s). The embedded data structure can be generated based on a data structure of larger size using a machine learning module. In some embodiments, the embedding is performed using a machine learning module (see reference 130) which has been trained by taking into account temporal context of the data, as explained above.

[00218] According to some embodiments, Dbehavioral expr. can include at least one of D nO n- verbai (reference 1910), D para iinguistic (reference 1920) and Dianguage (reference 1930). Various examples have been provided for this data in Fig. 2. [00219] According to some embodiments, Dbehavioral expr. can include a second set of embedded data structures (see reference 1940) including, for each time interval of a plurality of time intervals (a time interval includes a plurality of time segments) of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the time interval. Embodiments for computing this second set of embedded data structures have been described with reference to Figs. 17, 17A, 17B and 17C.

[00220] According to some embodiments, Dbehavioral expr. can include, for at least one participant (or for each participant), a personalized data structure (1950) informative of behavioral expressions of the participant across different sessions. Embodiments for computing this personalized data structure have been described with reference to Figs. 16 and 16A.

[00221] The method of Fig. 18 further includes obtaining (operation 1810) a label informative of one or more social and/or emotional performance indicators of the one or more participants in the session (this corresponds to a supervised learning).

[00222] Assume that a given machine learning module 140i of the machine learning modules 140 is to be trained. This given machine learning module 140i is trained to determine data informative of a given social and/or emotional performance indicator "Indicator". Therefore, a label is obtained which provides, for at least one participant of the session (or for each participant of the session), a value informative of the given social and/or emotional performance indicator "Indicator". The label can be provided e.g. by an operator (such as e.g. a teacher, an expert in social skills, a physiologist, etc.) who judges the social and/or emotional performance of the participant during the session, and attributes a corresponding score (e.g. a normalized score between 0 and 1) for the given indicator, used as a label. For example, assume that "Indicator" corresponds to "Recognizes the other's perspective", then the label includes a value between 0 and 1 for this indicator for one or more participants of the session, attributed by an operator.

[00223] In some embodiments, if N machine learning modules 140 are trained, and each machine learning module is trained to determine data informative of a different social and/or emotional performance indicator, then the label includes at least N different values (at least one value per indicator) for a given session.

[00224] The method further includes using (operation 1820) Dbehavioral expr. (including at least part of the set of embedded data structures 1901, and, in some embodiments, at least one of the additional data 1910, 1920, 1930, 1940 and 1950) and the label 1960 to train the one or more machine learning modules 140. The training includes feeding at least part of Dbehavioral expr. and the label to the machine learning module(s) to be trained. As explained above, the machine learning module 140 can implement e.g. neural networks such as DNN or LSTM, and corresponding training methods (using e.g. a cost function to be minimized, informative of a difference between the output of the machine learning module and the label) can be used, such as Backpropagation.

[00225] In some embodiments, a given machine learning module is first trained based on data Dbehavioral expr. pertaining to a first participant (in particular data 1901, 1910, 1920, 1930, 1950 and 1960 are specific to each participant) and a label informative of the first participant. Then, the given machine learning module is trained based on data Dbehavioral expr. pertaining to a second participant and a label informative of the second participant. This can be repeated for all participants of the session. This is however not limitative.

[00226] In some embodiments, each machine learning module 140i to 140N can be trained independently (the training of the different machine learning modules can however occur simultaneously).

[00227] In some embodiments, training the plurality of machine learning modules 140i to 140N involves using multi-task learning for at least a subset of machine learning modules 140i to 140N of the plurality of machine learning modules. Various algorithms can be used to perform multi-task learning, as explained in "en.wikipedia.org/wiki/IVIulti-task_learning", such as Microsoft IceCAPS (this is however not limitative). [00228] In the particular application described with reference to Fig. 18, in which a plurality of machine learning modules is trained to predict data informative of different social and/or emotional performance indicators, use of multi-task learning can help to improve modelling of the indicators, and in particular, their interdependency. Indeed, some social and/or emotional performance indicators are dependent, one on the other. For example, it is easier to be considered as a good presenter in front of an audience which is quiet/silent than in front of an audience which is agitated.

[00229] Therefore, the training can include using multi-task learning for at least a first machine learning module trained to determine data informative of a first social and/or emotional performance indicator, and a second machine learning module trained to predict data informative of a second social and/or emotional performance indicator different from the first indicator, wherein the first indicator and the second indicator are interrelated. An operator can e.g. select indicators considered as interrelated, and, based on this relationship, multi-task learning between the relevant machine learning modules can be performed.

[00230] Training of the machine learning modules 140 can be performed based e.g. on a plurality of labelled sessions, using the various embodiments described above.

[00231] A given machine learning module 140i of the one or more machine learning modules 140 is usable, after its training, to determine data informative of at least one given social and/or emotional performance indicator of a given participant of a given session comprising at least one of an audio content and a video content.

[00232] Attention is now drawn to Figs. 20 and 21.

[00233] Once the one or more machine learning modules 140 have been trained, it is possible to use them to automatically determine data informative of social and/or emotional performance indicator(s) of participant(s) of a session.

[00234] Assume that at least one of an audio content and a video content of a given session (which included one or more participants) is obtained. [00235] The method can include obtaining data Dbehaviorai expr. (reference 2100 - similar to data 1900) informative of behavioral expressions of the one or more participants for each a plurality of periods of time in the session, derived from at least one of the audio content and the video content. Various embodiments have been described to generate Dbehaviorai expr. based on the audio content and/or video content of the session.

[00236] Dbehaviorai expr. can include, as mentioned above, at least one of:

- for at least one participant, a set of embedded data structures (see reference 2101) including, for each time segment of a plurality of time segments of the session, at least one embedded data structure informative of behavioral expressions of the participant and derived from at least one of the audio content and the video content of the session in said time segment. A method of generating data 2101 is depicted e.g. in Figs. 14 and 15;

- for at least one participant, for each period of time of a plurality of periods of time of the session, data D non-verbal (2110 - similar to data 1910) informative of non-verbal expressions of the participant in each period of time based on data informative of at least one of a body motion and a facial expression of the participant. Methods of generating data D non-verbal have been provided above (see Fig. 2);

- for at least one participant, for each period of time of a plurality of periods of time of the session, one or more paralinguistic expressions Dparaiinguistic (2120) based on audio content, or data informative thereof, associated with the participant. Methods of generating Dparaiinguistic have been provided above (see Fig. 2);

- for at least one participant, for each period of time of a plurality of periods of time of the session, data Di ang uage (2130) informative of language expressions used by the participant. Methods of generating Dparaiinguistic have been provided above (see Fig. 2);

- a second set of embedded data structures (see reference 2140 - similar to 1940) including, for each time interval of a plurality of time intervals (a time interval includes a plurality of time segments) of the session, at least one embedded data structure informative of behavioral expressions of the one or more participants in the time interval;

- for at least one participant (or for each participant), a personalized data structure (1950) informative of behavioral expressions of the participant across different sessions.

[00237] The method further includes, for a given participant, feeding (operation 2010) at least part of D behavioral expr. to the one or more machine learning modules 140. This can include feeding at least part of the set of embedded data structures 2101 pertaining to the given participant. In some embodiments, this can include feeding at least one of the additional data 2110, 2120, 2130 and 1950 of the given participant, and in some embodiments, data 2140 (common to all participants of the session).

[00238] The method further includes (operation 2020) outputting data informative of one or more social and/or emotional performance indicators of the given participant in the session. As mentioned above, in some embodiments, each machine learning module 140 can be trained to determine a data informative of a different social and/or emotional performance indicator, and therefore, in some embodiments, if N machine learning modules have been fed with Dbehaviorai expr., data informative of N different social and/or emotional performance indicators of the given participant in the session is obtained.

[00239] According to some embodiments, data informative of social and/or emotional performance indicators output by the system can be used for the assessment of social-emotional skills. This can help the given participant to improve his social-emotional skills and/or a third party to better assess social skills of this given participant.

[00240] According to some embodiments, data informative of social and/or emotional performance indicators output by the system can be used to generate a feedback for a person in contact with the public, such as a sales person. [00241] According to some embodiments, data informative of social and/or emotional performance indicators output by the system can be used for diagnosis of a behavioral disorder.

[00242] According to some embodiments, data informative of social and/or emotional performance indicators output by the system for group therapy.

[00243] These examples are not limitative and various other applications can use the system as described above.

[00244] It is to be understood that the invention is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings.

[00245] It will also be understood that the system according to the invention may be, at least partly, implemented on a suitably programmed computer. Likewise, the invention contemplates a computer program being readable by a computer for executing the method of the invention. The invention further contemplates a non- transitory computer-readable memory tangibly embodying a program of instructions executable by the computer for executing the method of the invention.

[00246] The invention is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the presently disclosed subject matter.

[00247] Those skilled in the art will readily appreciate that various modifications and changes can be applied to the embodiments of the invention as hereinbefore described without departing from its scope, defined in and by the appended claims.