Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
EDGE COMPUTING BASED FACE AND GESTURE RECOGNITION
Document Type and Number:
WIPO Patent Application WO/2022/139683
Kind Code:
A1
Abstract:
System for image-based recognition applications. The system comprising: a remote server system for storing and training a plurality of recognition models for image-based recognition applications and a labelling algorithm. A plurality of local computer systems in communication with the remote server system. Each local computer system comprises an associated image capture device and stores a set of trained recognition models from the plurality of recognition models. The remote server labels the captured images using the labelling algorithm and updates recognition models of the plurality of recognition models using the labelled, captured images and exports the updated recognition models to one or more of the plurality of local computers.

Inventors:
MADATHUPALYAM CHINNAPPAN GOKUL (SG)
LONG YAOGUANG DON BRYAN (SG)
CHOUBEY AMIT KUMAR (IN)
Application Number:
PCT/SG2021/050809
Publication Date:
June 30, 2022
Filing Date:
December 21, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NAT UNIV SINGAPORE (SG)
International Classes:
G06V10/82; G06F9/50
Foreign References:
CN110705684A2020-01-17
CN110687619A2020-01-14
US20190370687A12019-12-05
Other References:
GUO Y. ET AL.: "Distributed and Efficient Object Detection via Interactions Among Devices, Edge, and Cloud", IEEE TRANSACTIONS ON MULTIMEDIA, vol. 21, no. 11, 22 April 2019 (2019-04-22), pages 2903 - 2915, XP011752248, [retrieved on 20220303], DOI: 10.1109GAMMAGAMMAMM.2019.2912703>
Attorney, Agent or Firm:
DAVIES COLLISON CAVE ASIA PTE. LTD. (SG)
Download PDF:
Claims:
- 48 -

Claims

1. A system for image-based recognition applications, comprising: a remote server system for storing and training a plurality of recognition models for image-based recognition applications, the remote server also comprising a labelling algorithm; a plurality of local computer systems in communication with the remote server system, each local computer system comprising an associated image capture device and storing a set of trained recognition models from the plurality of recognition models, wherein images captured by the associated image capture devices (captured images) are sent by the local computer systems to the remote server, the remote server labels the captured images using the labelling algorithm (labelled, captured images), updates recognition models of the plurality of recognition models (the updated recognition models) using the labelled, captured images and exports the updated recognition models to one or more of the plurality of local computers to implement image-based recognition applications.

2. The system of claim 1, wherein one or more said recognition models is a neural network model.

3. The system of claims 1 or 2, wherein for each of one or more of the local computer systems, the trained recognitions models are recognition models, from the plurality of recognition models, trained by the remote server using images captured by the image - 49 - capture device associated with the respective local computer system (the associated images).

4. The system of claim 3, wherein, for each of one or more of the local computer systems, the remote server is configured to train the recognition models using the associated images, by: training the recognition models using images captured from all devices except the associated images, and tuning hyperparameters of the trained recognition models to achieve a best fit with the associated images; training the recognition models using only the associated images; training the recognition models using images captured from all devices, and increasing a learning rate for the associated images; and updating a previously trained recognition model using the associated images.

5. The system of any one of claims 1 to 4, wherein each local computer system comprises computing resources, the respective local device sharing the computing resources between implementing the image-based recognition applications and sending images to the remote server.

6. The system of claim 5, wherein one or more of the local computer systems has: a first mode of operation in which the local computing resources are preferentially allocated to image-based recognition applications; and - 50 - a second mode of operation in which the local computer resources are preferentially allocated to sending images to the remote server.

7. The system of claim 6, wherein the first mode of operation is a static recognition mode.

8. The system of claim 6 or 7, wherein the second mode of operation is a live streaming mode.

9. The system of any one of claims 1 to 8, wherein one or more local computer systems are configured to locally update the respective trained recognition models during at least one of: an off-peak period in which there is low burden on computer resources for the image-based recognition applications; and prolonged disconnect of communications between the respective local computer system and the remote server.

10. The system of any one of claims 1 to 9, wherein the plurality of recognition models stored by the remote server comprises one of: a respective, non-overlapping set of recognition models for each local computer system; and a set of models for common export to all local computer systems.

11. The system of any one of claims 1 to 10, wherein the remote server stores at least one of: - 51 - a plurality of versions of one or more recognition models of the plurality of recognition models; and a plurality of different recognition models, each said different recognition model being for performing a common detection application. The system of claim 11, wherein the remote server is configured to compare performance said multiple versions of each respective recognition model, and select a best performing one of said versions for export to a relevant said local computer system. The system of any one of claims 1 to 12, wherein one or more of the local computer system stores at least one of: a plurality of versions of one or more said trained recognition models; and a plurality of different, trained recognition models, each said different trained recognition model being for performing a common detection application. The system of claim 13, wherein each local computer system is configured to compare performance of said multiple versions of each respective trained recognition model, and select a best performing one of said versions for implementing the respective image-based recognition application. The system of any one of claims 1 to 14, wherein for each local computer system, the image-based recognition applications comprise facial recognition and gesture recognition. The system of claim 15, wherein each local computer system is configured to perform facial recognition to localise one or more people (localised people) in the images captured by the image capture device associated with the respective local computer system (the associated images), and perform gesture recognition by locating one or more hands of the one or more localised people. The system of claim 16, wherein, in respect of each localised person, gesture recognition is performed by: performing facial recognition to identify a face the localised person, thereby to localise the localised person; performing skeleton detection based on the face, to detect a skeleton of the localised person; and performing gesture recognition for the localised person based on the skeleton. The system of any one of claims 1 to 17, wherein the image-based recognition applications comprise facial recognition. A method for image-based recognition, comprising: capturing images at a plurality of image capture devices associated with respective ones of a plurality of local computer systems; sending the plurality of images from the local computer systems to a remote server; updating, at the remote server: the images using a labelling algorithm to produce labelled, captured images; and a plurality of recognition models stored and trained at the remote server (the updated recognition models) using the labelled, captured images; and exporting the updated recognition models to one or more of the plurality of local computers to implement image-based recognition applications.

20. The method of claim 19, further comprising implementing the image-based recognition applications: performing facial recognition to identify at least one face thereby to localise a person corresponding to each said face (localised person); performing skeleton detection based on the face, to detect a skeleton of the respective localised person; and performing gesture recognition for the respective localised person based on the skeleton.

Description:
Edge Computing based Face and Gesture Recognition

Technical Field

The present invention relates, in general terms, to systems and methods for image-based recognition applications.

Background

With the fall in the cost of computer systems and imaging devices, computer systems and imaging devices have become ubiquitous. They can be deployed publically as part of a public information broadcast system, an advertising system, or a digital billboard. Computer systems such as edge computing devices often have limited computational power, limited power supply (limited battery capacity), and limited access to network bandwidth. Despite these limitations, consumers and applications often desire or require high-speed processing and inferencing based on captured data, including image and audio data. Further, environments where imaging devices may be deployed often experience significant changes in lighting or imaging conditions. The limitations of edge computing devices and changing imaging conditions present challenges in providing accurate, fast and computationally efficient systems and methods for image-based recognition applications.

It would be desirable to overcome or ameliorate at least one of the above-described problems, or at least to provide a useful alternative. Summary

Disclosed is a system for image-based recognition applications, comprising: a remote server system for storing and training a plurality of recognition models for image-based recognition applications, the remote server also comprising a labelling algorithm; a plurality of local computer systems in communication with the remote server system, each local computer system comprising an associated image capture device and storing a set of trained recognition models from the plurality of recognition models, wherein images captured by the associated image capture devices (captured images) are sent by the local computer systems to the remote server, the remote server labels the captured images using the labelling algorithm (labelled, captured images), updates recognition models of the plurality of recognition models (the updated recognition models) using the labelled, captured images and exports the updated recognition models to one or more of the plurality of local computers to implement image-based recognition applications.

Recognition models can be trained by the remote server using images captured by the image capture device (e.g. a camera of video camera) associated with the respective local computer system (the associated images).

The remote server may be configured to train the recognition models using the associated images, by: training the recognition models using images captured from all devices except the associated images, and tuning hyperparameters of the trained recognition models to achieve a best fit with the associated images; training the recognition models using only the associated images; training the recognition models using images captured from all devices, and increasing a learning rate for the associated images; and updating a previously trained recognition model using the associated images.

Each local computer system comprises computing resources. Each local device can share the computing resources between implementing the image-based recognition applications and sending images to the remote server.

The one or more of the local computer systems may have: a first mode of operation in which the local computing resources are preferentially allocated to image-based recognition applications; and a second mode of operation in which the local computer resources are preferentially allocated to sending images to the remote server.

Also disclosed is a method for image-based recognition, comprising: capturing images at a plurality of image capture devices associated with respective ones of a plurality of local computer systems; sending the plurality of images from the local computer systems to a remote server; updating, at the remote server: the images using a labelling algorithm to produce labelled, captured images; and a plurality of recognition models stored and trained at the remote server (the updated recognition models) using the labelled, captured images; and exporting the updated recognition models to one or more of the plurality of local computers to implement image-based recognition applications.

Advantageously, recognition models are trained to be suitable for deployment on local computer systems (edge computer systems). In some cases, the recognition models are based on a cumulative knowledge from a plurality of individual models and/or individual local computer systems.

Advantageously, recognition models deployed on a local computer system may be biased to the specific location, data or imaging condition and hence be computationally faster and lighter, or more accurate for their location when compared with a genericized model.

Advantageously, the systems and methods of some embodiments implement throttling operations between video streaming and data processing operations performed by local computer systems (edge computer systems).

Advantageously, the systems and methods of some embodiments recognise hand gestures in captured images, including a plurality of hand gestures for a group of individuals in captured images. These hand gestures can be used to deduce or infer a response of a viewer to something displayed on a local device, or to affect control of the behaviour of the local device - e.g. scrolling through an advertisement or between advertisements.

Brief description of the drawings

Embodiments of the present invention will now be described, by way of non-limiting example, with reference to the drawings in which:

Figure 1 is a block diagram of a system for image-based recognition;

Figure 2 illustrates a method for image-based recognition suitable for execution by the recognition system of Figure 1;

Figure 3 illustrates a schematic diagram of some components of the recognition system of Figure 1;

Figure 4 illustrates a local computer system architecture according to some embodiments;

Figure 5 illustrates a schematic diagram of a version control pipeline implemented by the recognition system;

Figure 6 illustrates a schematic diagram of the Haar cascade algorithm for face detection according to some embodiments; Figure 7 illustrates an image on which a face detection operation was performed according to the embodiments;

Figure 8 illustrates an example of the results of a haar cascade based hand detection model identifying hands in an image;

Figure 9 illustrates an example image on which fingertip detection is performed;

Figure 10 illustrates an example of an image subjected to gesture recognition based on a convex angle extraction method;

Figure 11 illustrates an example of an image subjected to gesture detection using MediaPipe according to some embodiments;

Figure 12 illustrates an example of various steps of gesture recognition; and

Figure 13 illustrates another example of various steps of gesture recognition.

Detailed description

Disclosed are systems and methods for image-based recognition applications. The image-based recognition applications include applications for electronic billboards or electronic advertising systems that interact with audiences using imaging or audio data. The systems enable interaction using gestures or audio or a combination of both gestures and audio. The recognition operations includes recognition of faces, determination of an identity associated with a face and gestures performed by individuals. Gestures include a specific orientation of hands or the limbs or fingertips of an individual. Gestures encode information and serve as a means for efficient communication with a computer system. The recognition applications can enable interaction of an individual with a computer system including performing transactions using gestures.

The system includes a remote server system in communication with a plurality of local computer systems. The local computer system corresponds to the computer system deployed on-site with an imaging device as part of a digital advertising system or digital billboard - local computers can be edge computing devices, with resources sufficient only to run image recognition models locally, and to update models in off-peak periods (e.g. when the image recognition requirements are lower, such as when the local computer is in a shopping mall that is closed for the day).

The local computer systems include or are in communication with an imaging device and an audio device such as a microphone. The imaging device includes one or more cameras. The imaging device will generally be a camera or video camera, including conventional image based cameras, 3D cameras, stereoscopic cameras etc. The imaging device can also include more than one camera or other device, and the audio device may similarly include one or more microphones. The remote server system orchestrates recognition operations by the local computer systems based on images captured by the camera. Those recognition operations are performed by recognition models stored in one or both of the remote server system and each of the plurality of local computer systems.

The remote server system has a greater amount of computing resources than the local computers and thus can perform more computationally intensive operations that a local computer system may not be capable of performing efficiently. The remote server system labels images received from the local computer systems using a labelling algorithm. The remote server system updates recognition models deployed on the plurality of local computer systems using the labelled, captured images, rather than requiring local computers to perform that update. After updating, the remote server system exports the updated recognition models to the local computer systems to implement image-based recognition applications, or at least to the local computer system or systems that require updating. The remote server system operating in concert with the plurality of local computer systems enables progressive improvement of the recognition models deployed on the local computer systems by performing the more computationally intensive operations of labelling of images and updating recognition models. The updated models can then be deployed to the various local computing devices.

As each local computer system is deployed at a specific location, the images captured by the imaging device (hereinafter referred to as a camera) associated with each local computer system can be subjected to differing imaging or environmental conditions prevalent in those specific locations. For example, some cameras can be positioned outdoors and be subjected to varying daylight, whereas others can be indoors under steadier, but often lower, light. The remote server system can train or modify a recognition model applicable to a particular local computer system to factor in the distinctive imaging or environmental conditions and improve the accuracy of the recognition operation for that particular local computer system. For example, the remote server system may first train a recognition model for a local computer system using images captured from all cameras except the camera associated with the local computer system. Hyperparameters of the trained recognition model are then tuned to achieve the best fit with the images from the local computer system. Those images are then used to train, rather than test or validate the recognition model. In addition, or in the alternative, the remote server system can increase a learning rate value when training a recognition model with images captured from the camera associated with the local computer system, when compared with the learning rate for images from other local computer systems.

Accordingly, each recognition model is updated for each local computer system, to account for the lighting conditions prevalent at the local computer system, to ethnicity and other factors of people located where the local computer system is located, and so on. Training using an increased learning rate can bias a recognition model for the distinctive imaging conditions that a camera associated with the local computer system may be subjected to. Accordingly, the disclosed systems strike a balance between generalization using a larger dataset and biasing for specific local imaging conditions.

Figure 1 illustrates an example of a system 100 for image-based recognition applications. System 100 comprises at least one remote server system 110. The remote server system 110 may be implemented as a standalone server or may be part of a cluster of servers. The remote server system 110 could be implemented using a cloud computing service such as Amazon Web Services.

The remote server system 110 is configured to communicate with a plurality of local computer systems 120 over a network 150. The remote service system 110 stores recognition models for use by the local computer systems 120, and can also update those recognition models using greater resources than are available at each local computer system 120.

The remote server system 110 may also be configured to communicate with a content originating computer system 140. The content originating system 140 can store or supply content, such as advertising content and other information, that the remote server system 110 can distribute to the local computer systems 120.

The remote server system 110 communicates over a network 150. The network 150 can be any suitable network or hybrid network, such as one or more public networks - e.g. the Internet -, telecommunication networks and private networks such as a local area network, to enable communication between the remote server system 110 and the local computer systems 120.

Each local computer system 120 includes a display 128. The content originating system 140 is then used - e.g. by a presenter or influencer - to generate or supply audio-visual content for broadcast on one or more local computer systems 120. The audio-visual content generated by the content originating system 140 may be transmitted in a live or nearly live transmission mode to one or more local computer systems 120.

Each local computer system 120 comprises computing resources such as a processor or processors 122 and memory 124. The computing resources can be shared between the implementation of the imagebased recognition applications and sending of images to the remote server 110.

System 100 enables multiple local computer systems 120 to be connected to the remote server system 110. Each of the local computer systems 120 stores in its memory a set of prediction machine learning models (local recognition models 126). The machine learning models are used locally - i.e. at a local or edge computer system 120 - to process images captured by a corresponding camera 130 and perform recognition operations on the captured images. The recognition operations include detection and recognition of one or more faces and any corresponding gesture performed by individuals, including hand gestures. The local computer system 120 may have limited computational power and accordingly, the local recognition models 126 will often be lightweight models that deliver high accuracy for the specific data received by the local computer system 120 but are less capable of generalisation - e.g. a local recognition model 126 may be used inside a shopping mall, where lighting conditions are stable, but not outside. By using lightweight models on the local computer system 120, the system 100 advantageously reduces the size of the model required to perform recognition operations and increases the devicespecific accuracy of the results of the recognition operations. The local recognition models 126 and the master recognition model 116 are implemented using neural networks. Multiple neural network models may perform each distinct recognition operation. All or most of the recognition operations could be performed in a single pass over the image and/or audio data (single-shot processing) to reduce latency.

Various object detection neural network architectures can be used to implement the local recognition models 126 and the master recognition model 116. The neural network architectures include Faster R.-CNN - tested to perform object detection on Raspberry Pi 4 with 4GB Ram running at 0.14FPS at full CPU usage, Single Shot Detector Mobile Net - processing at the rate of 1FPS in a test environment, You look only once Neural network (YOLO) - processing between 0.3 FPS to 1.5 FPS in a test environment.

While the above neural network models were tested on a Raspberry Pi 4, whichever local computer system 120 is used it will, when deployed, have sufficient computing power, memory and network bandwidth to allow the recognition system 100 to perform recognition operations and display content in the display 128 while meeting the requirements for processing the data being received and streaming content.

Though the local recognitions models 126 are trained for the specific local computer system 120, the trained models are effected by changing imaging conditions such as changing backgrounds, weather and other uncontrolled parameters. To respond to the changing imaging conditions, the local recognition models 126 are retrained on a timely basis to maintain the accuracy of the inferences generated by the models. Moreover, the local computers may be capable of storing multiple image recognition models for use in different conditions or at different times of day. In each case, retraining can be achieved by either retraining the models on the local computer system 120 or by pushing the local computer system specific data to the remote server system 110 to retrain the models in the remote server system 110. For ease of access by local computer systems 120, the remote server system 110 may be implemented in a cloud environment. When retraining occurs on the local system 120, it can be scheduled to take place when demand for computational resources is low - e.g. after hours in a shopping mall, or at night.

The local computer system 120 implements a system for gesture control by using the camera 130 and by processing the video feed captured by the camera 130 to detect gestures performed by individuals in the captured images. The image processing operations performed by the local computer system 120 include pre-processing the images received from the camera 130, identification of one or more region of interest in the images, image segmentation, and the recognition of gestures from among an enumerated list of possible gestures.

The similarity of the input images and, in particular, similarity in the gestures makes accurate gesture recognition a computationally complex task. Moreover, where different gestures are prevalent in different geographical locations. Therefore, the gesture recognition models used at one location may differ from those used at another location. To provide a robust gesture detection system, machine learning and artificial intelligence tools are used to improve the accuracy and speed of recognition. Recognition systems disclosed herein can recognize multiple gestures of multiple individuals in a crowd in front of the image capture device 130. Advantageously, such systems implement the recognition operations using a resource-constrained local, edge computer system 120. The embodiments advantageously enable the detection of complex gestures performed by multiple individuals while addressing the computational resource constraints of the local computer system 120.

The local computer system 120 uses edge-computing principles to power gesture and face recognition operations using the lightweight local recognition models 126 that may be trained on the remote server system 110 and tested on the local computer system 120. The use of local recognition models 126 reduces the dependency on the cloud infrastructure for continuous data processing and reduces the data that may need to be transferred from the local computer system 120 to the remote server system 110. The local computer system communicates with the remote server system 110 over network 150, which can be a cellular network such as a 3G, 4G or 5G network or any other networks as mentioned above.

By performing recognition operations on the local computer system 120, the recognition system 100 reduces the computational costs on the cloud (remote server system 110) and communication platforms. By using the local recognition models 126, the recognition system also reduces the latency for data processing operations and enables a more interactive gesture based human machine interaction. Conventionally, a network of digital display devices with data analytics may be implemented by connecting all the devices to send image or voice data to the cloud for processing. This type of architecture by design shifts the computational load completely to the cloud. Even though the cloud may be capable of supporting hundreds of devices, there are substantial bandwidth and reliability risks associated with such conventional architectures. The dependency on the cloud is risky as it depends on a reliable network and the data transfer traffic increases significantly with a purely cloud-based computational architecture. Gesture recognition requires high-resolution images, particularly gesture recognition for multiple individuals in an image. This further exacerbates the bandwidth issue.

The disclosed recognition systems enable deployment of advertising signage that has embedded face and gesture recognition so that recognition operations can be performed using local recognition models 126 that are optimised by the remote server system 110. The invention thus strikes a balance between performing computations in the local computer system 120 and the remote server system 110 to provide an optimum recognition architecture/framework.

The recognition system 100, by implementing the remote server system 110 in communication with a plurality of local computer systems 120, enables the creation of a state-of-the-art generalizable recognition model (master recognition model 116) that is robust and complex, which is also generalizable for all the local computer systems 120. The master recognition model 116 comprises a set of models for common export to all local computer systems 120 - i.e. the same set of models may be exported to all local computer system 120. This master recognition model 116 can be regarded as a culmination or accumulation of all knowledge available in each of the individual models deployed on each local computer system 120. The master recognition model 116 may be obtained by training a recognition model using all or a majority of the imaging data captured by each image capture device 130. Accordingly, the master recognition model 116 may comprise a larger number of parameters to model the myriad imaging conditions and scenarios that each imaging device 130 may be exposed to. The local recognition models 126 are derived or generated based on the master recognition model 116. The local recognition models 126 are lightweight versions of the master recognition model 116 that are specifically trained to take into account the imaging conditions and images captured the imaging device 130 where the model will be deployed. Thus, by using this two-pronged approach, the recognition system maximizes the utilisation of the computational resources available to both the remote server system 110 and the local server system 120 while providing greater accuracy and reduced latency when compared with conventional mass recognition model implementations.

Conventional systems may accept inputs in the form of text from keyboards or clicks as a form of trigger or feedback for the operation of programs. Conventional systems may also rely on touch-based sensors to receive input. The disclosed recognition systems advantageously provide an alternative to such conventional systems by enabling machine interaction using gestures and/or voice. This is particularly advantageous where touching surfaces is ill-advised - e.g. during a viral pandemic where people endeavour not to touch surfaces that are touched by many other people.

In addition, or as an alternative, to gesture recognition by images analysis from an image capture device 130 incorporated into the local computer system 120, gesture recognition could be performed using data generated by handheld devices embedded with motion sensing or inertial measurement units (IMU). Such recognition systems require an additional input device that records the motion, derives the gesture and sends the derived gesture information as an input to the local computer system 120. Such embodiments involve configuring two subsystems that work together to form a gesture-based user feedback mechanism. The add-on IMU device comprises its own processing power. The reliability of such embodiments depends on reliability of the add-on device.

Edge-based network

Processing data by the remote server system 110 (which may be implemented in the cloud), in particular, image data captured by a plurality of image capture devices 130 relies on transmission of the data to the cloud, processing it in the cloud and receiving a response by the local computer system 120. Requirement for the computational resources of the cloud (remote server system 110) may peak at certain times of the day - e.g. when the local system is in a shopping mall, lunch breaks and start/end of business day will be peak periods. To provide a more efficient and low latency recognition system, different computational operations may be performed on the remote server system 110 and the local computer system 120 at different times of day. Off-Peak Computational Requirements

The disclosed recognition applications require continuous data analytics on the video and audio feed received from the image capture device 130. Data generated during the low-fidelity off-peak periods may be expensive if processed in the cloud. The recognition system 100 redistributes computational operations to the dedicated edge computing resources (local computer systems 120) to avoid using computational resources on the remote server system 110 during off- peak periods. This also reduces data transmission to the remote server system 110, enabling the remote server system 110 to service other local computer systems 120 faster.

Speed

The time taken for recognition operations at edge computing resources is relatively low compared to cloud implementation for network- constrained architectures like that presently described. Hence, an edge implementation is faster.

The architecture of the recognition system

Figure 1 illustrates an exemplary architecture of a recognition system 100. The architecture comprises edge computing enabled digital signage boards (display 128 integrated with local computer systems 120) that are connected to the cloud (remote server system 110) via the network 150, part of which may include a cellular network such as a 3G, 4G or 5G network. The cloud (remote computer system 110) controls at least a part of the content that is presented on display 128. The local computer system 120 controls the order, interaction and data collection in response to the content presented on display 128. The content presented on display 128 includes static content or live streaming content. For each of these types of content, the local computer system 120 has a recognition mode - i.e. behaves in a particular way.

Static recognition mode

The display of static content (non-live content) includes the following steps:

1. A user may select display sites (specific local computer systems 120 deployed to a particular location) to present the static content. The remote server system 120 may propose sites based on demographic parameters chosen by the user.

2. The user uploads the content assets to the remote server system 110. The content assets may include a video, images or audio.

3. The assets are downloaded by individual devices (local computer system 120).

4. The interaction engine of the local computer system 120 records the events happening on the local computer system 120 and sends analysed data to the remote server system 110. The analysed data includes data indicating one or more recognised gestures, or one or more recognised faces, for example.

5. The data received by the remote server system 110 is then presented to the user in a dashboard, for example.

In the static recognition mode, the local computing resources are preferentially allocated to image-based recognition applications because the display of static content does not consume significant computational resources. Live Streaming Mode

A presenter such as an influencer may live stream content to the display 128 of one or more than one local computer systems 120. The display of live streaming content on the display 128 may include the following steps:

1. Live content is generated by the content originating computer system 140. The content originating computer system 140 includes an end-user computer system such as a laptop, desktop, smartphone or tablet.

2. The live content is transmitted to one or more designated local computer systems 120 over the network 150. The transmission or direction of the content is controlled by the remote server system 110.

3. As interactions occur in response to the content presented on the display 128, the local computer system 120 processes the images captured by the image capture device 130 to perform recognition operations, for example, face detection or gesture detection.

4. The output of the recognition operations performed by the local server system 120 is transmitted to the remote server system 110 or the content originating computer system 140 or both. The recognition system 100 enables a live conference between one or more local computer systems 120 and the content originating computer system 140, thereby enabling a live conference.

In the live streaming mode, the local computer resources are preferentially allocated to sending images to the remote server to enable a more engaging and responsive interaction with the live content. Architecture for Training

The architecture for the training of the models for recognition comprises at least two components: the local recognition model 126 deployed on the local computer system 120 and the master recognition model 116 stored in the remote server system (cloud).

As illustrated in Figure 1, the local computer system 120 may comprise four models namely:

1. Face Recognition Model 127 to perform recognition of faces, including detection of bounding boxes around faces (localised people), conversion of facial data into feature vectors suitable for comparison with a facial feature identity database enabling determination of an identity of the recognised faces.

2. Skeleton Recognition Model 129 to perform recognition of at least a part of a skeleton of a person in an image. The recognised skeleton comprises information regarding an estimated position of limbs or joints of an individual to assist gesture recognition. The skeleton recognition is performed based on the faces recognised by the face recognition model 127.

3. Gesture Recognition Model 131 to recognise gestures performed by an individual. Gestures may be recognised based on a specific orientation of digits or limbs of an individual over a period of time in images captured by the image capture device 130. Gesture recognition is performed with the assistance of the skeletons recognised by the skeleton recognition model 129, for example, based on a hand part of the identified skeleton. 4. Audio Analysis Model 132 to analyse any audio captured by microphone 134 including transcription of the captured audio into text.

The above-noted models may be lightweight models that are stored as a data file in the local computer system 120. The lightweight models have a smaller memory footprint and are less computationally demanding and hence suitable for execution by the local computer system 120. Each local computer system 120 comprises its own distinct version of the local recognition models 126 that are specifically biased for better performance in response to the data captured by the associated image capture device 130. A copy of the local recognition models 126 is also stored on the cloud (remote server system 110) enabling a comparison of the results of the local recognition models 126 with results produced by the master recognition model 116.

The master recognition model 116 in the cloud is trained and retrained on all the videos (image data) received from all the image capture devices 130. The master recognition model 116 may be updated/retrained at a faster rate (higher frequency) than the local recognition models 126. The more frequent updates/retraining to the master recognition model 116 ensures the master recognition model 116 factors in the most recent imaging data captured by the plurality of image capture devices 130 and is robust. The master recognition model 116 is used to retrain the local recognition models 126 by serving as a labelling method (labelling algorithm) for generating a training dataset. The master recognition model 116 improves over time through the retraining process and may deliver recognition output with an accuracy of up to 99%. More frequent updates to the master recognition model 116 based on input data captured from the plurality of cameras 130 advantageously provide a more robust master recognition model 116. The robustly trained master recognition model 116 can then be used as a labelling algorithm for labelling image data captured by a specific image capture device and using the labelled data to train the corresponding local recognition models 126. The local recognition models 126 could be implemented as quantised versions of the master recognition model 116.

Figure 2 illustrates a method 200 for image-based recognition suitable for execution by the recognition system 100 of Figure 1. Various steps of method 200 are performed by the several components of system 100.

At step 210, the camera 130 captures images. The camera 130 could be part of an electronic advertising system that captures images of the reactions of individuals observing a display 128. Image capture can be triggered by the presentation of content on the display 128 - for example, image capture may be performed to recognise responses to content. Image capture can also be accompanied by audio capture, e.g. using microphone 134. The captured images and any audio are made available to the local computer system 120 for further analysis. The local computer 120 system processes the captured images and any audio using the local recognition models 126 to recognise gestures and faces, for example. Data indicating the recognised gestures and faces is transmitted to the remote server system 110 or the content originating computer system 140 as feedback gathered in response to the content displayed on the display 128.

At step 220, the local computer system 120 sends at least a subset of the plurality of images captured by the image capture device 130 to the remote server system 110. Step 220 is performed by the plurality of local computer systems 120 allowing the remote server system 110 to receive image data from a wide variety of sources to build a robust training dataset using the gathered data.

At step 230, the remote server system 110 processes the received images (and potentially any audio) using the master recognition model/labelling algorithm 116. This processing step may include applying labelling operations to identify and label gestures, faces, skeletons etc. Through the labelling operation, the labelling algorithm 116 generates a labelled dataset that is suitable for training the local recognition models 126. The remote server system 110 comprises a copy of local recognition models 118. The copy of local recognition models 118 correspond to the local recognition models 126 deployed on the various local computer systems 120. The local recognition models 118 are non-overlapping sets of recognition models for each local computer system 120.

At step 240, the remote server system 110 updates a copy of local recognition models 118 stored in the remote server system. The update may include further training the copy of local recognition models 118 based on the labelled dataset generated at step 230. Updates can occur in various ways, using data partitioned based on images captured by a particular local system, or local systems with similar images - e.g. local systems in the same geographical area, or systems within shopping malls etc. For example, a specific copy of a local recognition model 118 can be trained using images captured by the image capture device 130 associated with the respective local computer system 120 (the associated images). In such embodiments, the copy of the local recognition models 118 embodies the knowledge of only images captured by the image capture device 130 associated with the relevant local system. The remote server system 110 then compares the performance of the multiple versions of each respective recognition model 118 and selects the best performing one of those versions for export to the relevant local computer system 120.

A specific local recognition model could be first trained using images captured from all devices except the associated images - i.e. trained on all images except those from the camera associated with the relevant local computer system at which the specific local recognition model is intended to be deployed. The hyperparameters of the specific local recognition model can then be tuned to achieve the best fit with the associated images. Subsequently, the specific local recognition model is trained using only the associated images. Depending on the origin and partitioning of the images used for training, the learning rate can be adjusted - e.g. when training using only the images from the camera associated with a local system being updated, the learning rate is increased to bias a specific recognition model to embody the knowledge of the associated images. This improves the performance and accuracy of the recognition model for the images captured by the camera in question. In this way, a balance can be struck between generalisation of the local recognition models and biasing the local recognition models to their specific imaging environments. At step 250, the remote server system 110 exports the updated recognition models to local computer systems 120. Each updated recognition model, in the copy of local recognition models 118 stored in the remote server, is associated with a specific local computer system 120 and is exported to the specific local computer system at step 250.

At steps 260 to 280, the local computer system 120 performs recognition operations using the updated recognition models received at step 250. Steps 260 to 280 are preceded by the image capture device 130 capturing images and making the captured images available to the local computer system 120.

At step 260, the local computer system 120 performs face recognition operations on the images received from the image capture device using the updated facial recognition model 127. Facial recognition operations may localise a person in one or more captured images, or identify a bounding box around one or more faces in the captured images. This reduces the field required for analysis when performing gesture recognition. Step 260 can also involve determining a feature vector representing the distinct facial features in recognised faces. The feature vector can then serve as a basis for determining an identity of a recognised face by comparing facial features with a facial feature based identity database.

The following deep learning-based object detection techniques could be implemented by the face recognition model 127 to perform step 260: 1. Faster R.-CNN based neural networks

2. You Only Look Once (YOLO) neural networks

3. Single Shot Detectors (SSDs)

4. MobileNet

5. Cascade Classifier

All the above methods could also be implemented on the cloud (remote server system 110). The OpenCV platform could be used for implementing the above techniques.

Figure 7 illustrates an image 700 on which a face detection operation was performed according to the embodiments, with the bounding boxes such as bounding box 710 identifying faces detected in image 700.

Haar Cascade

Some embodiments may implement object detection for face or gesture recognition using Haar feature-based cascade classifiers. In the cascade classifiers, a cascade function is trained from several examples of positive and negative images of the object of interest. Figure 6 illustrates a schematic diagram of a Haar cascade algorithm for face detection.

The following is a list of some haar parameters that may be employed for object detection in the model cv2.CascadeClassifier.detectMultiScale(image[,scaleFactor[,m inNeigh bors[, flags[, minSize[, maxSize]]]]])

1. scaleFactor: Parameter specifying how much the image size is reduced at each image scale. 2. minNeighbors: Parameter specifying how many neighbours (e.g. candidates in neighbouring pixels or rectangles) each candidate rectangle should have to retain it. This parameter will affect the quality of the detected faces: higher value results in less detections but with higher quality.

3. flags: Parameter with the same meaning for an old cascade as in the function cvHaarDetectObjects

4. minSize: Minimum possible object size. Objects smaller than that are ignored.

5. maxSize: Maximum possible object size. Objects larger than that are ignored.

For the training of a Haar feature-based recognition model, a set of samples comprising negative and positive examples is used. Negative samples correspond to non-object images (for example non-faces/non- gestures). Positive samples correspond to images with detected objects (for example faces or valid gestures). Increasing the number of positive samples increases the generalization of the model by identifying more general features and are less likely to overfit the training data. Increasing the number of negative images reduces false positive detections. Some embodiments include application-specific background information (e.g. imaging information for a particular local computer system 120, or for particular gestures or facial features) in the training data to improve the accuracy of deployed recognition models. The use of training images of larger dimensions enables training more robust recognition models. The recognition models could be initially trained using 1000 positive images and 1000 negative images. YOLO Model in OpenCV

A YOLO based object detection/ recognition model requires at least four input arguments:

1. Input image

2. YOLO configuration file

3. A pre-trained set of YOLO weights

4. A Text file containing class names identifying the various classes of objects to be detected

MobileNet-SSD

MobileNet-SSD is a single shot detector based on mobilenet that could be implemented to detect humans in images captured by the image capture device. The mobilenet-SSD may provide processing at a high frame rate similar to the performance of a Haar features based mode. A MobileNet-SSD based object detection provided greater accuracy in comparison to a Haar feature-based model in some experiments.

A well-trained Haar detector is incorporated for face detection to provide accurate face detection while being lightweight because of the simplicity of the algorithm and its implementation. For person detection/recognition operations, some embodiments may incorporate a Mobilenet-SSD based model.

At step 270, the local computer system 120 performs skeleton detection using the updated skeleton recognition model 129. Skeleton detection identifies a skeleton or joints of one or more individuals in the captured images. The skeleton detection step enables the association of faces with particular gestures thereby allowing the local computer system 120 to identify which individual performed a particular gesture that is subsequently detected at step 280. Skeleton detection advantageously allows gesture detection in crowded urban environments where a large number of individuals interact with the local computer system 120 and several gestures may be performed simultaneously or nearly simultaneously in response to the content displayed on the display 128.

At step 280, the local computer system performs gesture recognition for localised people - i.e. people identified at a particular location in the images - and the respective skeletons detected for those people. Gesture recognition can be based on a predefined set of gestures that the individuals may be familiar with or gestures that may be prompted by the content on the display 128. Gestures may be identified based on the relative position of digits of a person, orientation of hands or limbs of a person, or a combination of the two. The identified gestures may be encoded in a form suitable for processing by other computer systems such as the remote server system 110 or the content originating computer system 140.

Gesture recognition is performed through image processing to recognise hands or parts of hands, including specific orientations of the digits of the hand or the orientation of the palm or a combination of both. A captured image is processed and one or more hands are detected in the image. Subsequently, a region of the image corresponding to the detected hand(s) is (are) segmented from the background for further recognition operations. Recognition is carried out on the segmented hand image region using various feature extraction techniques. The steps involved in gesture recognition performed by the gesture recognition module 131 include:

• Hand detection

• Segmentation of the image to isolate the detected hands or parts of hands, including fingertips

• Feature extraction to determine a gesture feature vector

• Gesture recognition or Classification based on the gesture feature vector

Haar Cascade for Hand Detection

Some embodiments may perform hand detection based on a haar cascade based model. Figure 8 illustrates an example of the results of a Haar cascade based hand detection model identifying hands in bounding boxes such as bounding box 810.

YOLO based hand detection

Some embodiments may incorporate a YOLO framework based hand detection model. A YOLO based model may detect gestures in noisy images.

Fingertip detection

Fingertips could be detected by the gesture recognition model 131 to identify a gesture. Haar based fingertip detection could be implemented by training a Haar model using positive and negative examples of images of fingertips. After identification of the fingertips in an image associated with a particular hand, the number of fingertips may be used to detect a gesture. The gesture recognition module 131 could be configured to segment or crop a part of a captured image corresponding to a hand. The fingertip detection operation is then performed on the segmented image enabling segregated fingertip detection for each detected hand. This sequential process of detecting hands followed by fingertip detection for each detected hand enables system 100 to perform gesture recognition for images with more than one individual performing gestures. Fingertip detection could be performed at the rate of 4.7 FPS, for example.

Figure 9 illustrates an example image 900 on which fingertip detection is performed to identify bounding boxes such bounding boxes 910, 920 to identify fingertips. Based on the number of identified fingertips associated with a hand, a gesture may be identified.

Convex Angle extraction

Gesture recognition could be performed by extraction of a convex angle based on a contour of a hand or a part of hand. In this method, the cropped image of a detected hand is processed to extract a contour of the hand. The extracted contour is converted to a convexity hull followed by the determination of a centre of the convexity hull. The maximum convexity point may be obtained based on the cosine rule of trigonometry. The maximum convexity points correspond to fingertip points. Then the angles of the fingertips from the centre of the hull are used to detect the gesture. Performance of gesture classification can be enhanced by obtaining the junction between the fingers based on the determined farthest points from the convex hull. The convex angle extraction provides an extra set of features that do not move relative to the centre of the hand, hence can be used to identify hand angle and position and identify a gesture.

Figure 10 illustrates an example of an image 1000 subjected to gesture recognition based on the convex angle extraction method. A bounding box 1010 designates a detected hand. A segment 1020 connects the farthest points in the contour of the detected hand representing the convexity hull. The angle of segment 1020 with respect to the contour of the hand is used to identify a thumbs-up gesture.

Unified Gesture and Fingertip detection

Gesture recognition could also be performed using a unified convolutional neural network (CNN) algorithm for both hand gesture recognition and fingertip detection at the same time. The unified algorithm uses a single network to predict both finger class probabilities (i.e. which particular fingers have been identified - such as ring finger, middle finger etc) for classification and fingertip positional outputs (i.e. the position of the fingertip of each detected finger) for regression in one evaluation.

MediaPipe - Skeleton extraction

Gesture recognition could alternatively be performed using MediaPipe for skeleton extraction on edge devices. Figure 11 illustrates an example of an image 1100 subjected to gesture detection using MediaPipe according to some embodiments.

O-Detect Algorithm The recognition operations for the detection of faces and gestures could be integrated hierarchically to improve the accuracy of gesture recognition in crowded environment or environments wherein more than one individual may be performing gestures at the same time or nearly the same time. The integrated detection of faces and gestures is also advantageously scalable to perform gesture detection for a larger number of individuals.

Face detection is a relatively computationally inexpensive process and the detected faces provide a starting point to enable skeleton detection. Skeleton detection could extend to the detection of the upper body or hands of one or more persons in the captured images. Skeleton detection could provide the endpoint of hands that can further be used to trigger hand gesture recognition.

The integrated and hierarchical method of hand and gesture recognition advantageously avoids detection of ambiguous gestures that are not part of the human or that which are not intended as a gesture. The integrated method also provides more accurate hand detection. The detected hand could be further segmented for gesture recognition. This allows the identification of the person raising the hand or performing a gesture, which can be used as a useful trigger for individual controls or operations.

Figure 12 illustrates an example of this integrated hierarchical method of gesture recognition. Image 1210 is an image wherein face recognition is performed. In image 1220, skeleton detection for each face recognised at an earlier stage is performed. In image 1230, hands corresponding to each detected skeleton at an earlier stage are segmented and the segmented hands may be subjected to gesture recognition.

Figure 13 illustrates another example of the integrated hierarchical method of gesture recognition. In image 1300, a face region 1310 may be detected by the face recognition model 127. This may be followed by the detection of a skeleton 1320 by the skeleton recognition model 129. The detected skeleton 1320 may serve as a guide for the detection of the hand region 1330, which may be further analysed by the gesture recognition model 131 to recognise a gesture.

At step 290, the local computer system 120 transmits the recognition information to the remote server system 110 or the content originating computer system 140 or both. The recognition information may comprise information regarding the recognised faces and one or more gestures associated with the recognised faces if any or any speech transcribed by the audio analysis model 132. Steps 260 to 290 may be performed at a frequency suitable for the gesture-based interactions prompted by the content being displayed on display 128.

Figure 3 illustrates a schematic diagram 300 of some components of the recognition system 100 of Figure 1 and some aspects of the training performed according to method 200 of Figure 2. The local computer system 120 processes the image data and audio data received from the image capture device 130 and microphone 134 to determine facial recognition output 302, gesture recognition output 304 and audio analysis output 306. The outputs 302, 304 and 306 are transmitted by the local computer system 120 to the remote server system 110. The face recognition output 302 may be transmitted at the rate of one frame per second to 0.1 frames per second, or at any other rate for which bandwidth is available. In addition, the local computer system 120 may also transmit video captured by the image capture device 130. The video may be transmitted at the rate of 2 frames per second to 20 frames, for example. The rate of transmission is controlled based on the bandwidth of transmission available to the local computer system 120 and the capability of the remote server system 110 to receive the data. The transmission rate of the video stream may also be based on the computational operations being performed by the local computer system 120.

The local computer system 120 operates in a first mode (processing focused mode) or a second mode (transmission-focused mode) of operation. If in the first mode the local computer system 120 is performing computationally intensive operations, then the local computer system transmits the video stream at a lower frame rate to prioritise the use of its computational and memory resources for recognition operations. In the first mode, the local computing resources are preferentially allocated to image-based recognition applications rather than transmission operations. In a second more of operation, the local computer resources of the local computer system 120 are preferentially allocated to transmission or sending images to the remote server 110.

The remote server system processes the received video stream to perform recognition operations using the master recognition model 116 to determine facial recognition output 312, gesture recognition output 314 and audio analysis output 316. Since more computational and memory resources are available in the remote server system 110, the master recognition model 116 generates more accurate results when compared with the results generated by the more lightweight local recognition models 126 of the local computer system 120. Thus, the results 312, 314 and 316 generated by the master recognition model 116 are considered a labelled training dataset for the training of the local recognition models 126. Supplemental manual labelling could be performed to further improve the accuracy of the results 312, 314 and 316 and generate a robust training dataset.

The system 100 adaptively prioritises transmission of the video stream or facilitation of recognition applications depending on the mode of operation of the system 100. There are two scenarios of operation of system 100. The first scenario comprises an operation in a grid mode wherein the plurality of local computer systems 120 send video stream data at a rate sufficient to render the video stream on a display of the content originating computer system 140. The grid mode involves the presentation of the plurality of video streams in a grid on the display of the content originating computer system 140 enabling a presenter to view and present to the plurality of local computer systems 120 simultaneously. In the grid mode, each local computer system 120 strikes a similar balance between utilization of computation power for transmission of video and processing of the video for recognition operations.

In a second scenario, the recognition system operates in a focus mode. In the focus mode, interactions between a specific local computer system 120 (focused system) and the content originating computer system 140 is prioritised. This may occur if a presenter attempts to specifically interact with an individual before the focused system. In the focus mode, the specific local computer system 120 prioritises transmission of its video stream and de-prioritises its recognition operations to facilitate a smoother interaction with the presenter. Similarly, in the focus mode, the local computer systems 120 apart from the focused system deprioritise the transmission of their video stream or transmit video at a lower framerate/resolution freeing up computational resources for recognition operations. Accordingly, throttling in response to the transition to focus mode is used to optimise the use of the computational resources of the recognition system 100.

The requirement of a higher frame rate of transmission is subject to various transmission and network constraints. Since there is always a trade-off between accuracy and response time, the recognition system 100 balances this trade-off by throttling operations and by providing the alternative of performing recognition operations on the local computer system 120.

Implementation Architecture

Figure 4 illustrates a local computer system architecture 400 according to some embodiments. A local computer system 120 is configured to communicate with an image capture device 130 and a microphone 134. There may be provided an antenna 410 and a modem 430 to enable communication with the remote server system 110. A battery 420 is provided to power the various components. There may be provided removable storage devices 440 and 450 to store data or program code including the local recognition models 126. A version control mechanism may be implemented to track and manage the updates to the master recognition model 116 and the various local recognition models deployed on the local computer systems 120. Figure 5 illustrates a schematic diagram 500 of a version control pipeline implemented by the recognition system 100. A mainline 510 may correspond to a series of versions of the master recognition model 116. As the master recognition model 116 is updated, additional versions such as Ml, M2, and M3 may be progressively added to the mainline. Forks may be created from the mainline 510 to form one or more release branches 520. Each release branch 520 may correspond to a local recognition model 126 suitable for deployment on a specific local computer system 120. Figure 5 illustrates two forks Fl and F2 as applied to models Ml-3. The model in the release branch may be evaluated for deployment and if found suitable, it may be pushed to a production branch 530. In Figure 5, model version 2.2 is found suitable for deployment, model version 2.2 being the culmination of models Ml-3 AND forks Fl-2.

The local computer system 120 can store a number of versions of the local recognition models 126. Each different local recognition model can perform the common detection/recognition operations. The local computer system 120 compares the performance of the multiple versions of the local recognition models 126, and selects the best performing one of those versions for implementing a particular imagebased recognition application.

Candidate models such as models in the release branch could be evaluated using AB testing methods. Models that do not provide greater accuracy and/or improved performance may be discarded. Training Methodology

The training architecture as illustrated in Figure 3 involves using the master recognition model 116 to detect gestures and faces and using the outcomes as labels to train the local recognition models 126. The rate or frequency of updates to the local recognition models 126 can be optimised based on the similarity with respect to the local models. The local recognition models 126 may be trained at a faster rate by just training the model with the new dataset instead of the entire dataset.

After deployment of an updated model, the updated model may be tested by copmaring results obtained from the updated models with the results of the previous model. If an updated model is unable to match the accuracy and/or performance of the previous model, then the updated model may be discarded. An AB testing methodology may be implemented for evaluating updated models.

Edge-based updates to recognition models

The local computer systems 120 may not have a persistent communication link with the remote server system 110. In such embodiments, the local recognition models 126 may be updated/trained on the local computer system 120. The training/update of the models may take a long time to complete and accordingly such operations are preferentially performed during off- peak periods such as during the night when the computational requirements for image processing of the local computer system 120 are low. Training and update of models on the local computer system 120 independently of the remote server system 110 enables progressive improvement of the local recognition models 126 despite the lack of connectivity with the remote server system 110. The local recognition models 126 may be trained/updated during a prolonged disconnection of communications between the local computer system 120 and the remote server 110. Thus, the remote server system 110 may not include the most updated local recognition models in this instance, and may not yet have received a copy of the updated local recognition models - i.e. may store models that differ from those at a local computer system 120.

Edge Algorithm for Face and Gesture detection for Edge

The processing of gesture and face recognition may occur on the edge computer. Since the computational capability of the edge computer is low, throttling methodologies are implemented to maximise performance without compromising the functionality of the local computer system 120.

The video stream is obtained from the camera (image capture device 130) by the edge computer (local computer system 120) and sent to a processor 122 frame by frame for processing. The total computational time required for processing the frame determines the overall FPS performance of the local computer system 120.

Crowd based detection

The recognition system 100 enables crowd-based gesture or face recognition to enable gesture-based interaction with content displayed on display 128. The image capture device 130 may capture images of multiple individuals consuming the content displayed by the display 128. Accordingly, some embodiments implement crowd-based gesture and face recognition methodologies.

Complexity

The recognition system 100 advantageously addresses the complexity of detection of gestures and faces on multiple individuals (crowd). When multiple hands are present in a scene, conventional object detection techniques may detect objects with varying confidence levels. The recognition system 100 detects multiple hands (multiple gestures) in a single scene with a high confidence level to take into account multiple gestures being performed by multiple individuals. Gestures that may be detected with a low confidence level by conventional object detection techniques are nevertheless recognised at a high confidence level by the recognition system 100.

The recognition system 100 also advantageously addresses hand ambiguity in gesture recognition by providing a more robust training methodology described with reference to Figure 3.

Computation

Highly accurate models are often complex and take a longer time in training and inference phases. This is due to the amount of information/knowledge that is stored in a higher-order neural network that increases the computation requirements of the network (convolution in CNNs) to infer the output for a given frame. Hence, to keep the computational load low and yet achieve high accuracy, the recognition system 100 provides lightweight models (local recognition models 126) that capture the essential information to recognise gestures or faces.

The local recognition models 126 need not be globally generalizable as the sample set on which they operate is limited by space and time where the associated image capture device 130 is positioned. The timebased factors are be addressed by periodically updating/retraining the local recognition models 126 based on incrementally gathered imaging data from the image capture device 130 that may be labelled using the master recognition model 116. The recognition system 100 strikes a balance between the use of a generalizable model and deployment of lightweight models on the local computer systems 120 to optimise the performance of the recognition processes without the need for extensive computational power deployed on the edge (local computer system 120).

Edge Implementation

Performing recognition operations in the local computer systems 120 (edge) provides multiple advantages, including lower latency and optimised utilization of the computational resources closest to where the raw data (image data or audio data) is generated. The transmission of processed data (recognised face/gesture data) reduces the data rate of transmission between the local computer system 120 and the remote server system 110. The use of edge-based recognition operations also allows continued operation of the local computer system 120 in the event of a breakdown of communication with the remote server system 110. During a live streaming session, the processed frames from the local computer system 120 are sent to the remote server system (cloud server) for media streaming through a WebRTC interface provided on the remote server system 110. A media server 119 provided on the remote server system 110 (cloud) receives the stream and may restream it to a dashboard interface accessible through other computer systems such as the content originating computer system 140.

To achieve effective utilisation of computing resources on the local computer system 120 (edge device), the computation capability of the local computer system 120 is throttled between data processing - i.e. face and gesture detection - and streaming of the video captured by the camera 130. For a scenario where a high FPS stream is required, the data processing for recognition operations is partially or completely performed by the remote server system 110 (cloud). At any particular time only a handful of local computer systems 120 (edge devices) may be operated in the high FPS mode. By dynamically switching recognition operations between the local computer systems 120 and the remote server system 110, the recognition system 100 advantageously maximises the usage of the computational resources at both the local computer systems 120 and the remote server system 110 while providing a rich and interactive experience to the users of the system.

Targeted Marketing

The ability to detect a person's face and link the detected person to a specific identity can be used to recognise individuals. By recognising individuals, marketing can be personally targeted marketing, wherein the identity of the person can be used to deliver more targeted information on display 128.

Purchase through digital display

The gesture and face recognition combination can enable direct purchase based on a specific purchase gesture in response to the content displayed on the display 128 inviting a purchase. Such embodiments may enable one-gesture purchase transactions. A specific gesture can be used to confirm a transaction or purchase. For example, when a person shows an "ok" gesture for 5 seconds a product may be purchased in response to the gesture. That person will generally need to first have been identified using facial recognition as discussed herein.

Gesture-based interaction

The system 100 enables a presentation of content to one or more displays 128. The system 100 enables engagement of audiences with the presentation using gestures. For example, a specific gesture may operate as a buzzer in a quiz game conducted using system 100. The recognition of gestures enables an interactive engagement with content on display 128.

Presenter enhancement

For a live session using the recognition system 100, the remote server system 110 provides a presenter dashboard that can dynamically present revenant information about a person the presenter is interacting with. The relevant information may include information regarding previous interactions of the presenter with the person. The term "common export" means the same models are exported to all local computer systems. Similarly, the term "common detection application" means multiple models are intended to detect the same thing - e.g. finger tips.

A "relevant local computer system" or similar is a local computer system for which a particular model has been trained by the remote server. The remote server then exports the trained model to the relevant local computer system.

The labelling algorithm stored at the remote server is state-of-the-art. It may undergo supervised training using ground-truth labels, manual labels, or other mechanisms until it is highly accurate. It may be generalised for application to images taken by any image capture device at any location. It may instead be specific to a particular site such that it is generalizable to images captured by all image capture devices from the particular site - e.g. image capture devices in signage boards throughout a mall - but not necessarily generalizable to other sites - e.g. from a site in one country to a site in another country.

By labelling the images captured by an image capture device, the algorithms for the local computer system associated with that image capture device can therefore undergo supervised training using the labelled images.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavor to which this specification relates.