Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR TRACKING, ANALYZING AND REACTING TO USER BEHAVIOUR IN DIGITAL AND PHYSICAL SPACES
Document Type and Number:
WIPO Patent Application WO/2020/222157
Kind Code:
A1
Abstract:
The computer-implemented method for tracking, analyzing and reacting to user behavior in digital and physical spaces, comprises: a step (49) of acquiring an image using an image acquisition device (19); a step (51) of detecting the face of a user within the acquired image; a step (55) of determining a facial expression from the detected face and, from the facial expression, a relevant and predefined category of emotions; a step (54) of gaze tracking of the user's from the detected face; a step (56) of detecting the gender and age of the user from the detected face; a step (60) of activation of a reaction by means of active interaction means (36, 37) with the user, wherein the reactions are determined on the basis of data about the characteristics and behavior of the user comprising the combination the determined facial expression, the gaze detection and the detected gender and age.

Inventors:
MENGONI MAURA (IT)
GENEROSI ANDREA (IT)
Application Number:
PCT/IB2020/054078
Publication Date:
November 05, 2020
Filing Date:
April 30, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
EMOJ S R L (IT)
MENGONI MAURA (IT)
GENEROSI ANDREA (IT)
GIRALDI LUCA (IT)
International Classes:
G06K9/00; G06K9/46; G06K9/62
Foreign References:
US20170330029A12017-11-16
US20190034706A12019-01-31
Other References:
FERNÁNDEZ-CABALLERO ANTONIO ET AL: "Smart environment architecture for emotion detection and regulation", JOURNAL OF BIOMEDICAL INFORMATICS, ACADEMIC PRESS, NEW YORK, NY, US, vol. 64, 30 September 2016 (2016-09-30), pages 55 - 73, XP029835860, ISSN: 1532-0464, DOI: 10.1016/J.JBI.2016.09.015
Attorney, Agent or Firm:
GRANA, Daniele (IT)
Download PDF:
Claims:
CLAIMS

1) Computer- implemented method for tracking, analyzing and reacting to user behavior in digital and physical spaces, comprising:

- at least one step (49) of acquiring at least one image using at least one image acquisition device (19);

- at least one step (51) of detecting the face of at least one user within said at least one acquired image;

characterized by the fact that it comprises the combination of the following steps:

- at least one step (55) of determining at least one facial expression from said detected face and, from said facial expression, at least one relevant and predefined category of emotions;

- at least one step (54) of gaze tracking of said user’s from said detected face;

- at least one step (56) of detecting the gender and age of said user from said detected face;

and by the fact that it comprises at least one step (60) of activation of at least one reaction by means of active interaction means (36, 37) with said at least one user, said reactions being determined on the basis of data about the characteristics and behavior of said user comprising the combination of at least one determined facial expression, said gaze detection and said detected gender and age.

2) Computer- implemented method according to claim 1, characterized by the fact that it comprises:

- at least one step (58) of comparing said data on said user’s characteristics and behavior with at least a predefined threshold value;

- at least one step (59) of determination of at least one reaction according to said comparison.

3) Computer- implemented method according to one or more of the preceding claims, characterized by the fact that said step (49) of acquiring at least one image comprises at least the following steps: - acquisition of at least one video track comprising a plurality of images;

- extrapolation of a plurality of individual images from said video track (step 50).

4) Computer-implemented method according to one or more of the preceding claims, characterized by the fact that said step (51) of detecting the face comprises at least one step (53) of pre-processing of said at least one image for the editing of said image in a predefined format.

5) Computer-implemented method according to one or more of the preceding claims, characterized by the fact that said step (55) of determining at least one facial expression from said detected face comprises the use of a convolutional neural network trained to receive said face at input and to return at least one expression and at least one respective recognized category of emotion at output.

6) Computer-implemented method according to one or more of the preceding claims, characterized by the fact that said step of gaze detection comprises at least one of the following: eye position detection, pupil position analysis, detection of the speed of eye movement.

7) System for tracking, analyzing and reacting to user behavior in digital and physical spaces, comprising means for the performance of the method steps according to one or more of the preceding claims, characterized by the fact that it comprises:

- at least one image acquisition device (19) to acquire at least one image;

- at least one facial recognition module (25) to perform said step of facial recognition of at least one user’s face within said at least one acquired image;

- at least one emotion tracking module (26) to perform said step of determining at least one facial expression from said detected face and, from said facial expression, said at least one category of user’s emotions;

- at least one gaze tracking module (27) to perform said step of gaze tracking of said user from said detected face;

- at least one gender and age tracking module (28) to perform said step of gender and age detection of said user from said detected face; - active interaction means (36, 37) with said user, and

- at least one management module (30, 63) of the control logic of said active interaction means (36, 37), operationally connected at least to said emotion tracking module (26), said gaze tracking module (27) and said gender and age tracking module (28) and configured to perform said step of activation of at least one reaction through said active interaction means (36, 37).

8) System according to claim 7, characterized by the fact that said at least one image acquisition device (19) is selected out of the following: cameras, video camera, webcam, IP camera, RGBD camera, video surveillance camera.

9) System according to one or more of claims 7 and 8, characterized by the fact that it comprises at least one sensor (23, 24) selected out of the following: an antenna (23), a microphone (24).

10) System according to one or more of claims 7 to 9, characterized by the fact that said facial recognition module (25) comprises a first model configured for the recognition of at least one distant face within said at least one acquired image and a second model configured for the recognition of at least one nearby face within said at least one acquired image, said first and said second model being also configured to return the coordinates of at least one frame within the image containing said distant face or said nearby face.

11) System according to claim 10, characterized by the fact that said facial recognition module (25) comprises image pre-processing means configured to crop all faces from said at least one acquired image based on the coordinates provided by said first and by said second model, resize said images of the cropped faces and convert said images of the resized faces into a predefined format.

12) System according to one or more of claims 7 to 11, characterized by the fact that said emotion tracking module (26) comprises a convolutional neural network trained to receive at input the image of said face and to return at output a percentage of probability that said face can be associated with the following predefined categories of emotions: neutrality, happiness, surprise, sadness, anger, disgust, fear and contempt; said percentage of probability being associated with a respective value of intensity.

13) System according to claim 12, characterized by the fact that said emotion tracking module (26) is configured to calculate, through a weighted average of the values associated with the individual categories of detected emotions, an overall value indicative of the positivity or negativity of the emotional experience of said user portrayed in said facial image.

14) System according to one or more of claims 7 to 13, characterized by the fact that said gaze tracking module (27) comprises a convolutional neural network trained to receive at least one facial image at input and to return coordinates at output with respect to a plane observed by said user and a level of attention with respect to what said user is observing.

15) System according to one or more of claims 7 to 14, characterized by the fact that said gender and age tracking module (28) comprises a convolutional neural network trained to receive at input at least one image of said face and to return the gender and precise age of said user at output.

16) System according to one or more of claims 7 to 15, characterized by the fact that it comprises at least one of the following: a position tracking module (32) of said user within a physical space; a movements tracking module (31) of said user; a heat emission analysis module (29) of said user.

17) System according to one or more of claims 7 to 16, characterized by the fact that it comprises at least one central server (22) comprising at least one of the following: said emotion tracking module (26); said gaze tracking module (27); said gender and age tracking module (28); said position tracking module (32); said movements tracking module (31); said heat emission analysis module (29).

18) System according to claim 17, characterized by the fact that it comprises at least one computer (20) operationally connected at least to said image acquisition device (19) and to said central server (22) and configured to carry out an editing of the collected images and to send said edited images to said central server (22).

19) System according to one or more of claims 7 to 18, characterized by the fact that it comprises at least one database (33, 46) for the storage of said data determined by at least one of the following: said emotion tracking module (26); said gaze tracking module (27); said gender and age tracking module (28); said position tracking module (32); said movements tracking module (31); said heat emission analysis module (29).

20) System according to claim 19, characterized by the fact that it comprises a Web Analytics platform (34, 35, 47, 48) operationally connected to said database (33, 46).

21) System according to one or more of claims 7 to 20, characterized by the fact that said active interaction means (36, 37) comprise at least one electronic interaction device (36) placed within the environment where the user is located.

22) System according to claim 21, characterized by the fact that said electronic interaction device (36) comprises at least one of the following: LED lights (362), video player (363) and relevant monitor/projector (364), audio speaker (365), perfume diffuser (366), computer (367), headlights, interfaces for video messages, audio messages, text messages (alerts, sms, notifications), sets.

Description:
METHOD AND SYSTEM FOR TRACKING, ANALYZING AND REACTING TO USER BEHAVIOUR IN DIGITAL AND PHYSICAL SPACES

Technical Field

The present invention relates to a computer-implemented method and related system for tracking, analyzing and reacting to user behavior in digital and physical spaces, in particular through the use of Deep Learning technologies.

In particular, the present invention finds application in the field of marketing, Customer Experience, User Experience, market analysis, automation, information technology in general and relates to a system based on deep learning technologies for tracking, analyzing and reacting to user behavior in digital and physical spaces.

Background Art

In recent years there has been a lot of talk about artificial intelligence and deep learning is a research field that derives from machine learning, which teaches computers to perform a natural activity for human beings, namely“learning by example”.

The technologies that make use of deep learning are, for example, fundamental and at the basis of self-driving cars: they allow them to recognize a signal, detect a pedestrian or other obstacles on the road, but are also the basis of voice controls on devices such as mobile phones, televisions, navigators, etc..

In deep learning a computer model learns to perform classification tasks directly from images, text or sound. Model training is performed by using a large set of labeled data and neural network architectures containing multiple layers. While conventional neural networks contain only 2 or 3 hidden layers, deep networks can contain up to 150 layers and are able to learn features directly from the data without the need to extract them manually.

These technologies have also entered the world of marketing as users are becoming accustomed to increasingly smart digital systems able to anticipate their needs and cancel out the distance and need for complex interactions to get a response. Several brands are moving towards the“hyper customization” of the user experience thanks to smart“sites” and“apps” able to react to users, their choices, their tastes, and the context.

Dynamic campaign optimization and budget management automation is now possible thanks to expert systems which are able to process more complex reports faster than humans can. As a result, companies can have access to in- depth knowledge of customer behavior before, during and after the purchase, enabling them to create a highly efficient form of omni-channel customer engagement. This creates a “perception-reasoning-action” cycle typical of cognitive science which, in marketing, becomes“collection-reasoning-action” where in the first phase data from customers or potential customers are tried to be collected and acquired, in the second phase data are transformed into intelligence or intuition (use of deep learning systems) and in the third phase the results are applied through reasoning phases creating campaigns or customi ed solutions with higher probability of persuasion of the users.

Currently many multinational companies, among which Amazon, Zara, etc., are operating in the sector in order to use techniques adapted to meet the individual requests and the various tastes of users, mainly focusing on the recurring choices that hypothetical customers make, the most viewed and purchased products, etc. these actions are therefore monitored digitally starting from the websites of the various companies.

Known systems, however, do not allow the collection of information on user behavior in a comprehensive and effective way and, furthermore, they do not allow for appropriate interactions with the users.

Description of the Invention

The main aim of the present invention is to devise a computer-implemented method and related system for tracking, analyzing and reacting to user behavior in digital and physical spaces that allows collecting information relating to the intrinsic characteristics and behavior of users in a complete and effective way and, at the same time, that allows this information to be used to implement appropriate interactions with the users, so as to effectively guide them.

Another object of the present invention is to devise a computer- implemented method and related system for tracking, analyzing and reacting to user behavior in digital and physical spaces that allows for more accurate user profiling.

The aforementioned objects are achieved by the present computer-implemented method for tracking, analyzing and reacting to user behavior in digital and physical spaces according to the characteristics described in claim 1.

The aforementioned objects are achieved by the present system for tracking, analyzing and reacting to user behavior in digital and physical spaces according to the characteristics described in claim 7.

Brief Description of the Drawings

Other characteristics and advantages of the present invention will be more evident from the description of a preferred, but not exclusive, embodiment of a computer-implemented method and related system for tracking, analyzing and reacting to user behavior in digital and physical spaces, illustrated by way of an indicative, but non-limiting example, in the attached tables of drawings in which:

Figures la and lb illustrate as an example a possible application to a generic physical space of the method and system according to the invention;

Figures 2a and 2b illustrate as an example a possible application to a generic digital space of the method and system according to the invention;

Figure 3 is a general block diagram illustrating the system according to the invention applied to a generic physical space;

Figure 4 is a general block diagram illustrating the system according to the invention applied to a generic digital space;

Figure 5 is a general block diagram illustrating the method according to the invention.

Fmbodiments of the Invention

The method and the system according to the invention allow studying the intrinsic characteristics of one or more users and their behavior during the interaction with a physical and/or digital space through the use of cameras and possibly other sensors of various types connected with specific software for data processing. All information, analyzed data and obtained results are used to implement solutions and actions involving the user or are stored (except for photos of the user to avoid privacy issues) in order to be used as a database of information to be used for marketing or other actions.

According to a first possible application of the method and the system according to the invention, schematically illustrated in Figures la and lb, the space where the analysis is carried out can be a“physical space” for retail activities such as a shop, a department store, an exhibition area in general, a showroom, a trade fair, a home space or a space for private use (in this context the analysis can take place for the user’s entertainment, with the aim, e.g., to modify in real time the lighting and musical set of the environment according to one’s“mood”, or for medical purposes, (such as e.g. monitoring the emotional state of a person with special disorders) but also a physical space for socio-cultural activities such as a museum, an exhibition, a special tourist area, a hotel or the like.

Within these areas R, attended by natural persons 1 , the system according to the invention comprises cameras and/or sensors 2 positioned where it is considered most appropriate so that it can monitor and follow the persons 1 who are nearby the exhibition areas 3.

According to a further possible application of the method and of the system according to the invention, schematically illustrated in Figures 2a and 2b, the “space” within which the analysis is carried out can be digital, i.e. created through the use of web platforms, mobile applications such as tablets 5, mobile phones, laptops or digital signage applications currently widely used as a form of communication in the vicinity of stores, open or closed public spaces, including private space or home D, where the content is shown to users through digital screens 4.

With reference to this possible application, for the detection of emotions the system according to the invention comprises the video cameras 7 already present on portable devices or webcams or cameras or sensors of various kinds applied directly to electronic supports 6. With reference to any“analysis space” situation, the primary object of the method and system according to the invention is to study the intrinsic characteristics of the users and their behaviors by extrapolating facial expressions, body positions, heat emission, recurring movements and other physical, character or behavioral aspects.

This information makes it possible to extrapolate and detect, among other things, specific categories of emotions of“Joy”,“Surprise”,“Disgust”,“Anger”, “Sadness”, “Fear”, “Neutrality” and “Contempt” (preferably based on the research of the US psychologist Paul Ekman).

Advantageously, the method and the system according to the invention, unlike other existing methods, provides for the use of at least three different modes of analysis that, working simultaneously and synergistically, allow to achieve the perfect analysis of behavior and emotions.

The principles described below are applicable to both digital and physical “spaces” as specified above.

The computer-implemented method for tracking, analyzing and reacting to user behavior in digital and physical spaces, comprises the following steps:

- at least one step 49 of acquiring at least one image using at least one image acquisition device 19;

- at least one step 51 of detecting the face of at least one user within said at least one acquired image (Face Detection).

Therefore, the method according to the invention provides a first phase of detection of the face of one or more users within the images acquired through the use of cameras, video cameras, webcams, IP cameras, RGBD cameras or video surveillance cameras or other types of sensors suitable for the acquisition of images.

According to a possible and preferred embodiment, the step 49 of acquiring at least one image comprises at least the following steps:

- acquisition of at least one video track comprising a plurality of images;

- extrapolation of a plurality of individual images from the video track (step 50). In addition, still according to a possible and preferred embodiment, the method comprises at least one step 53 of pre-processing of the at least one acquired image for editing such an image in a predefined format.

Usefully, the step 51 of detecting the face is able to use an optimal Face Detection model depending on the context in which the system is applied. In particular, with reference to the application to a physical space it is required to recognize faces in not always optimal environmental conditions (low light, occlusions, etc..), moving and often at distances that can reach up to 10 meters. In these contexts it is possible to use heavy models, which make massive use of computational resources (and therefore often require processing on GPUs rather than CPUs). With reference, instead, to the application to a digital space, most of the time it is possible to work with faces in front of the cameras in front positions. At the same time, however, for privacy reasons, processing can be delegated to the platform/app client side, i.e. to a smart phone that does not always have sufficient resources. In these contexts, the selected model is as light as possible.

In particular, according to this preferred embodiment, this step 53 of pre processing comprises the straightening of the detected face(s), the cropping of the face(s), in order to make the cropped images of the recognized faces of the same format as those used for the training of the convolutional neural networks used in the following steps (described below).

It should be noted that the aforementioned steps are applicable both for the digital and physical scenario and consist in the acquisition by the cameras of the video tracks that are subsequently sent to the main server, frame by frame or, in the case of physical spaces, directly to the computer connected to the camera(s) that deals with the transformation of the video into individual images and the pre-processing of the single frames, straightening of the face and crop or“cut and resize of the image” by enlarging the detail of the face itself. In actual facts the procedure is face recognition (suitable to be then analyzed by software) technically called“Face Detection”. Advantageously, the method according to the invention also comprises the combination of the following steps:

- at least one step 55 of determining at least one facial expression (Facial Coding) from the detected face and, from that facial expression, at least one relevant and predefined category of emotions;

- at least one step 54 of eye tracking and then gaze detection from the detected face;

- at least one step 56 of detecting the gender and age of the user from the detected face.

In addition, the method according to the invention comprises at least one step 60 of activation of at least one reaction by active interaction means 36, 37 with the user, wherein these reactions are determined on the basis of data on the characteristics and behavior of the user comprising the combination of the at least one determined facial expression, the monitored gaze detection and the detected gender and age.

In particular, the step 55 of determining at least one facial expression from the detected face comprises the use of a Convolutional Neural Network (CNN) trained to receive the face at input and to return at least one expression and, therefore, at least one recognized category of emotion at output.

In particular, this expression/emotion is recognized by the convolutional neural network through a training/prediction process typical in general of Machine Learning and therefore of Deep Learning.

According to a preferred embodiment, the step 54 of gaze tracking comprises at least one of the following: eye position detection, pupil position analysis (pupillometry), detection of the eye movement speed.

Therefore, the user’s gaze detection allows collecting other data detected by the same cameras or sensors, data that complements the collected data on the user’s facial expressions/emotions.

Gaze detection is a technique that allows studying eye movements and pupil position in order to investigate the user’s attention to the vision of an object. This way it is possible to detect a person’s attention to one or more objects, the way the information received is treated and the behavior adopted by the user, for example, while exploring a web page.

Therefore, the method and the system according to the invention allow the detection, always through the same video cameras or the like, of the gender and age of the person since, in the field of marketing and market analysis, it is necessary to consider men and women not as a“single homogeneous group” but as “two distinct groups”. This is because marketing strategies must meet people’s needs, interests and preferences and to do this it is essential to diversify the various targets in order to be able to make ad hoc and very targeted choices for the person. This strategy is also defined as“Gender Marketing” and it is based on the fact that gender differences are fundamental because women have more strategic “buying” behaviors while men are inclined to impulsive purchase; women tend to consider other people’s opinions to make a decision while men consider other people’s decision as a guide to“form” their own opinions. Differentiation with respect to age groups is also very important in Strategic Marketing: until now a single target called “Millenials” was considered which comprised those born between the late 70s and 2000 but it has been realized, especially after the crisis of 2001, that it was necessary to divide this category into subcategories because the interests and life experiences were too different. It is important to note that the generation following that of the “Millenials” defined as“Generation I” (networks generation) is the generation of young people, which are deep connoisseurs and users of computer systems, to whom it is necessary to address with market strategies and means of communication completely different from those of the other categories. It is important to point out that the analyses, carried out on users, do not use invasive technologies such as electroencephalographic monitoring, pressure measurement or heartbeat monitoring, but only non-invasive technologies with the use of cameras or similar sensors; the very sophisticated software allows the detailed characterization of the data received.

With reference to the definition of possible reactions, according to a preferred embodiment, the computer-implemented method also comprises the following steps:

- at least one step (58) of comparing data on the user’s characteristics and behavior with at least a predefined threshold value;

- at least one step (59) of determination of at least one reaction according to the comparison made.

In particular, the threshold values are set according to the context in which the reactions are applied.

For example, a possible application is one that involves the activation of light and music scenarios depending on the emotions detected by the system, so as to create an interactive environment, home or not, that adapts to the characteristics and mood of the user. In this scenario, taking as reference Russel’s model based on the concepts of Valence and Arousal taken as a unit of measurement of a hypothetical Cartesian axis, the various musical genres and different songs that are part of it are categorized, associating them to Ekman's six emotions. To do this, the Valence and Arousal values associated with the different songs that make up the dataset are taken as reference and, using the centroids of the points grouped in different areas through a Machine Learning algorithm such as K- Means, it is possible to categorize the various genres and the various songs by associating them with the Ekman’s emotions. In a retail context (clothing stores in particular), another methodology adopted is to associate the characteristics of the user (gender and age), the“mood” felt (valence and engagement) and the tracked behavior based on the movements and interactions with the sales assistants to define a model, based on a dataset of previously collected behavioral data and on the application of a Bayesian network, able to set in real time the best sales strategy to be used with the customer and, retroactively, use the tracked emotions to validate or not the approach taken, always in real time. In general, the data collected through the system allow, using models usually used in the Machine Learning world (such as the above mentioned Bayesian and K- Means networks), to predict the user’s taste and liking in real time, acting and using emotions again in retroactive optics to validate the correct activation of the relevant reaction (whether it is a song, a light or a simple warning to a human operator). In this perspective, the activated reaction will be basically considered correct if the consequent mood of the user, following the activation, will be positive, vice versa if negative or neutral.

Once the detected values have exceeded the defined threshold values, depending on the application scenario, actuators connected to devices such as perfume diffusers, lights, etc. will be activated via IOT and home automation protocols, or text messages such as alerts using the http protocol, or media player or speaker management, etc. using UDP and TCP communication protocols.

Conveniently, if a negative valence threshold value is detected after the activation of the reaction, e.g. valence <-30, a threshold defined empirically on the basis of tests carried out (step 64), the method comprises a step 65 of modification of the reaction in real time.

A possible embodiment of the method according to the invention for the activation of the reactions (for both the digital and the physical context) starting from the acquisition of video/photos of one or more users is schematically shown in the flowchart illustrated in Figure 5 and is summari ed below.

Following the video/image acquisition (step 49) and the extrapolation of the individual frames in the case of video (step 50), a“Face Detection” model is applied in order to detect all the faces inside the image (step 51).

If no face is recognized in the frame in question, it is discarded (steps 52, 61), otherwise it is pre-processed (step 53) for the cropping of the face(s), in order to make the cropped images of the recognized faces of the same format as those used for the neural network training used in the following steps.

These cropped images are then fed to the neural networks used to recognize the gaze (step 54), emotions (step 55), and gender and age (step 56), which will return the data on the characteristics and behavior of the filmed users at output. These data are saved in memory (step 57) and compared in real time with predefined thresholds (step 58), in order to obtain the reaction that best suits the recognized characteristics (step 59) and activate them (step 60).

This process can be supported by the user’s previous selection choices (step 62) previously saved in memory, which allow providing a more complete picture of their profile.

Figure 3 schematically shows a representation of a possible hardware- software architecture of the system according to the invention, relating to a physical space suitable for the management of the envisaged cases.

The system for tracking, analyzing and reacting to user behavior in digital and physical spaces, in particular through the use of Deep Learning technologies, comprises means for the performance of the steps of the method described above.

In particular, the system comprises:

- at least one image acquisition device 19 to acquire at least one image;

- at least one facial recognition module 25 to perform the step 51 of detecting the face of at least one user within the at least one acquired image;

- at least one emotion tracking module 26 to perform the step 55 of determining at least one facial expression (Facial Coding) from the detected face and, from this facial expression, at least one category of the user’s emotions;

- at least one gaze tracking module 27 to perform the step 54 of gaze detection of the user from the detected face;

- at least one gender and age tracking module 28 to perform the step 56 of detecting the user’s gender and age from the detected face.

In addition, the system comprises:

- active interaction means 36, 37 with the user, and

- at least one management module 30, 63 of the control logic of the active interaction means 36, 37, operationally connected at least to the emotion tracking module 26, the gaze tracking module 27 and the gender and age tracking module 28.

The management module 30, 63 is configured to perform the step 60 of activation of at least one reaction using the active interaction means 36, 37.

In particular, the image acquisition device 19 is selected out of the following: cameras, video camera, webcam, IP camera, RGBD camera, video surveillance camera or other types of sensors suitable for image capturing.

In addition, the system can comprise at least one sensor 23, 24 selected out of the following: an antenna (23) (RFID or BEACON type), a microphone (24), a thermal imaging camera.

This allows, in a physical space, to“locate” and talk with the person(s) or the like through RFID technologies that use airborne electromagnetic waves for the automatic, massive and remote detection of people and objects and technologies of IBeacon or the like. The latter technology allows, in a physical space, using a special Bluetooth emitter device, to interact with the smart phones of customers entering the physical environment; this allows sending news, video messages, customized promotions to users and this is very important in the retail sector. According to a possible and preferred embodiment, the facial recognition module 25 comprises a first model configured for the recognition of at least one distant face within the acquired image and a second model configured for the recognition of at least one nearby face within the acquired image.

The first and second models are also configured to return the coordinates of at least one frame within the image containing the distant face or the nearby face.

In addition, the facial recognition module 25 comprises image pre-processing means configured to crop all faces from the acquired image based on the coordinates provided by the first and by the second model, resize the images of the cropped faces and convert the images of the resized faces into a predefined format, compatible with the format with which the convolutional neural networks that make up the subsequent processing modules have been trained. The emotion tracking module 26 comprises a convolutional neural network (CNN) trained to receive at input the image of the face (previously recognized and processed) and to return at output a percentage of probability that this face can be associated with the following predefined categories of emotions: neutrality, happiness, surprise, sadness, anger, disgust, fear and contempt.

Each percentage of determined probability is also associated with a respective value of intensity of the emotion felt by the person in the frame.

The emotion tracking module 26 is configured to calculate, through a weighted average of the values associated with the individual categories of detected emotions, an overall value, preferably ranging from -100 to 100, indicative of the positivity or negativity of the emotional experience of the user portrayed in the facial image.

The gaze tracking module 27 comprises a convolutional neural network trained to receive at least one facial image at input and to return coordinates (preferably in cm) at output with respect to a plane observed by the user and a level of attention with respect to what the user is observing.

The gender and age tracking module 28 comprises a convolutional neural network trained to receive at least one facial image at input and to return the user’s gender and precise age (from 0 to 101) at output.

In addition, the system can comprise at least one of the following: a user’s position tracking module 32 within a physical space; a user’s movements tracking module 31 ; a user’s heat emission analysis module 29.

Again according to a possible and preferred embodiment, shown in Figure 3, the system comprises a central server 22 within which at least one of the following is implemented: the emotion tracking module 26, the gaze tracking module 27, the gender and age tracking module 28; the position tracking module 32; the movements tracking module 31; the heat emission analysis module 29.

In addition, the system can comprise at least one computer 20 operationally connected at least to the image acquisition device 19 and to the central server 22 and configured to pre-process the collected images and to send the edited images to the central server 22.

In addition, the system comprises at least one database 33, 46 for the storage of data determined by at least one of the following: emotion tracking module 26, gaze tracking module 27, gender and age tracking module 28; position tracking module 32; movements tracking module 31; heat emission analysis module 29. Conveniently, the system can comprise a Web Analytics platform 34, 35, 47, 48 operationally connected to the database 33, 46.

The active interaction means 36, 37 comprise at least one electronic interaction device 36 placed within the user’s environment. In particular, the electronic interaction devices 36 that can be used comprise at least one of the following: LED lights 362 and respective control driver 361, video player 363 and relevant monitor/projector 364, audio speaker 365, perfume diffuser 366, computer 367, headlights, interfaces for video messages, audio messages, text messages (alerts, sms, notifications), sets.

The operation of the system shown in Figure 3 is shown below.

The information coming from cameras 19 or sensors (antenna 23 and/or microphone 24) is sent either directly to the central server 22, for further processing, or it first passes through other computers 20 or similar processors that perform the first pre-processing tasks and then send the data to the central server 22.

This passage, carried out using standard data transmission technologies such as USB or Ethernet and specific editing software, makes it possible to streamline and increase data processing speed in order to make the system more reactive and immediate.

Once the data has been sent to the central server 22, the software implemented within the server itself makes it possible to monitor all user’s interactions in the physical space by means of data coming from the cameras 19 and the sensors 23, 24.

Using HTTPS protocol or other communication methodology, images, sounds and all other actions performed by the user and detected by the system, are sent to the central server 22, analyzed and decoded by the software to obtain the necessary data including the image formats (jpeg or the like) which are then stored in the physical memory of the server itself.

Each detected image or data is then divided and sent to the“Tracker modules”, each of which performs a different processing function. Each module is specialized in a different type of analysis and“prediction”.

The facial recognition module 25 is the module that manages the data received from the cameras 19, converted to digital, and that through a first model configured for the recognition of distant faces and a second model configured for the recognition of nearby faces, has the task of recognizing all human faces present in one of the images that make up the video stream received at input. The first and the second model return the coordinates of a frame within the image which, according to their prediction, would contain a human face.

In the next step, the facial recognition module 25 pre-processes the image. In particular, the facial recognition module 25 crops out all the faces from the original image based on the coordinates provided by the first model and the second model, resizes and converts the images of the faces thus obtained into the same format with which the convolutional neural networks that make up the subsequent modules have been trained.

The emotion tracking module 26 consists of a convolutional neural network that receives at input the image of a face previously recognized and processed and that returns at output the percentage of probability that such a face belongs to the categories of neutrality, happiness, surprise, sadness, anger, disgust, fear and contempt together with the intensity of the emotion felt by the person in the frame.

The emotion tracking module 26 also has the task of calculating, through a weighted average of the values associated with the individual detected emotions, the“valence”, i.e. an indicator ranging from -100 to 100 of the positivity or negativity of the emotional experience felt by the person portrayed in the image. In this calculation Happiness and Surprise are considered as positive emotions, while all the others as negative emotions.

The gaze tracking module 27 consists of a convolutional neural network that, again starting from the image of a face cropped by the facial recognition module 25, predicts the coordinates with respect to the plane observed by a person (the screen of a PC or of a smart phone for digital or a shelf for the physical) and the level of attention with respect to what one is observing.

The gender and age tracking module 28 consists of a convolutional neural network that, again starting from the image of a face cropped by the facial recognition module, predicts the gender and precise age of a person.

The position tracking module 32 of a person within the physical space is configured to communicate, e.g., with technologies such as RFID or iBeacon. The movements tracking module 31 is configured to communicate, e.g., with RGBD cameras (such as Kinect or the like).

The heat emission analysis module 29 is configured to communicate, e.g., with at least one thermal imaging camera.

All modules are used in order to improve the prediction of the emotional state. After having been extrapolated, through the use of the different modules, the data about the“satisfaction” and emotional state of the customer, as previously discussed, are stored in a database 33.

The images and what is detected by the sensors are permanently deleted from computer memories.

The collected data can be used in two ways: in real time through the reaction system, or later to be used by a Web Analytics platform, i.e. a Business Intelligence platform that will show a series of statistics related to the “Customer and User Experience” of a customer/user during their experience. This way you can have, in a hypothetical application scenario, wherein for example a person is in a store and the person interacts with a series of products on sale and in general with brands, to be able to analyze and redesign the physical spaces and elements contained therein to improve the customer experience and promote the connection with the brand(s).

The Web Analytics platform, at an architectural level, comprises a Backend module 34 implemented on a remote server (e.g. on Cloud), and a client module 35, implemented on a computer that will allow the display of the user interface of the Web Analytics platform.

With reference e.g. to“physical” contexts, the data saved in the database 33 will provide a knowledge base on which the management module 30 and the relevant control logic 21 will be based.

In particular, as schematically shown in Figure 3, the control logic 21, using e.g. home automation protocols such as MQTT, or through simple HTTP/HTTPS, is able to manage in real time or later, the different electronic interaction devices 36 placed in the environment where the user is located.

The actions that will be undertaken concern the activation of electronic interaction devices 36 such as LED lights, headlights, room scenters, interfaces for video and audio messages, sets and anything else that can attract user’s attention and involve the user.

This way, customized multimedia messages adapted to the profile of the analyzed user will be offered, also based on any previous knowledge accumulated from previous interactions resulting from the use of fidelity cards, questionnaires, choices of particular products or scenarios through the use of internet portals. The activation of lights, sounds, smells, videos, etc. will depend on whether or not default or real-time adaptive thresholds are reached depending on the context.

A possible application example relates to the use of emotion recognition technology together with the gender and age recognition within a“tunnel” shaped structure that, using cameras installed in some key points, will analyze who will venture inside the tunnel and, in real time, will propose multimedia contents (video, music, voice messages and lights) based on the emotional percentages and the predicted characteristics by neural networks. This way, if, for example, a woman between 30 and 40 years old is detected at the entrance and in a given moment she shows to be surprised with an intensity of 80%, a soft red light will light up, with music and videos dedicated to that particular range of detected values.

The choice of thresholds and multimedia content is based on analyses carried out through the training of particular algorithms that make use of machine learning models such as decision trees, SVM and especially algorithms such as K- nearest neighbors and K- Means. Depending on the situation, one model can be better than another: these techniques are used, for training, through data collected during direct experiences in which one or more human observers hand recorded the behavior of some samples of selected users with the most heterogeneous characteristics possible. This set of data substantially allows the machine to recognize, automatically and in real time, which actions are best taken on the basis of patterns obtained during the training phase, just as it happens for the recognition of emotions, gender, age, etc.. With reference to the present invention, the possibility is provided to exploit structures such as pedestrian and road routes, environments, shopping malls, exhibition spaces in commercial or social areas such as museums, exhibitions, cinemas, theatres, discos or anything else frequented by people, within which the above mentioned devices can be activated in order to interact with the users present.

The same concept concerning the choice of thresholds and scenarios for “reactions” as previously mentioned with reference to a“physical” space also applies to applications in the“digital” space.

With reference to Figure 4, all the data stored coming from the user’s use of devices 37 such as mobile phones, tablets, personal computers or whatever else can have access to computer connection networks that allows the connection and communication of local computer networks and databases, are used to improve“business intelligence” or“social intelligence”, i.e. all those processes of data collection and analysis that are fundamental for strategic business decisions. It is in fact possible to send customi ed services in real time and specific offers or communications for each user, generally to activate marketing strategies.

The indications coming from the analysis carried out during the use of the internet portals by the users, and stored in a remote database 46, will also be used to provide indications for the design of“Customer” and“User Experience” i.e. for the design of user interfaces, applications, screens and the like, which can be easier to use, more understandable, more attractive and of greater interest to users.

Similarly to the“physical” context, a web platform is provided for this purpose, which, at architectural level, consists of a backend side 47 and a customer side 48, i.e. the computer of the platform user.

The technical operation of this process is in fact the one described for the physical context and declined in an architecture called“by services”, wherein the images/video of the user interacting with a particular digital content, are decoded, blurred and sent via HTTPS protocol to a remote server that exposes API 38 interfaces used to receive this data. The images are then decoded again to obtain an image file readable by the software and temporarily stored in a memory 39, and then immediately deleted at the end of the process.

Each of the received images is then associated with an identifier stored in their respective queues 40, 41, 42 specialized in providing a quick access memory to the images and an order in which they have arrived for the three main modules described for the physical context: the gaze tracking module 43, the emotion tracking module 44 and the gender and age tracking module 45.

In addition, the management module 63 is configured to manage the reactions through the active interaction means, specifically consisting in the devices 37.

It has in practice been found that the described invention achieves the intended objects.