Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CONTEXTUAL TRAINING SYSTEMS AND METHODS
Document Type and Number:
WIPO Patent Application WO/2019/090268
Kind Code:
A1
Abstract:
The systems and methods provide an action recognition and analytics tool for use in manufacturing, health care services, shipping, retailing and other similar contexts. Machine learning action recognition can be utilized to determine cycles, processes, actions, sequences, objects and or the like in one or more sensor streams. The sensor streams can include, but are not limited to, one or more video sensor frames, thermal sensor frames, infrared sensor frames, and or three-dimensional depth frames. The analytics tool can provide for contextual training using the one or more sensor streams and machine learning based action recognition.

Inventors:
AKELLA PRASAD NARASIMHA (US)
ASSOUL ZAKARIA IBRAHIM (US)
CHAUDHURY KRISHNENDU (US)
CHHABRA YASH RAJ (IN)
DALMIA ADITYA (IN)
NARUMANCHI SUJAY VENKATA KRISHNA (IN)
RAVINDRA CHIRAG (IN)
UGGIRALA ANANTH (US)
ASHOK ANANYA HONNEDEVASTHANA (IN)
GUPTA SAMEER (US)
Application Number:
PCT/US2018/059278
Publication Date:
May 09, 2019
Filing Date:
November 05, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DRISHTI TECH INC (US)
International Classes:
G05B19/418; G06K17/00; G06N99/00; G06Q10/06; G06Q50/04; G06T7/00; G06V10/25
Domestic Patent References:
WO2017040167A12017-03-09
Foreign References:
US20130307693A12013-11-21
US20120225413A12012-09-06
US20140326084A12014-11-06
US20120197898A12012-08-02
US20140277593A12014-09-18
US20140337000A12014-11-13
US20170232613A12017-08-17
Attorney, Agent or Firm:
MURABITO, Anthony C. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A contextual training method comprising:

accessing a representative data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to one or more sensor streams;

accessing the one or more sensor streams indexed by the representative data set;

outputting an indication of a given process of the representative data set and one or more corresponding portions of the one or more sensor streams;

receiving in real time a current data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters for a current portion of one or more sensor streams; comparing a current process in the current data set to the given process in the representative data set; and

outputting a result of the comparison of the current process in the current data set to the given process in the representative data set.

2. The method of Claim 1 , further comprising:

receiving an indication of a given one of a plurality of subjects; and

wherein accessing the representative data set further includes accessing the

representative data set for the given subject.

3. The method of Claim 1, further comprising:

outputting a next given process of the representative data set and one or more corresponding portions of the one or more sensor streams responsive to the result of the comparison of the current process to the given process indicating a successful completion of the current process;

comparing a next current process in the current data set to the next given process in the representative data set; and

outputting a result of the comparison of the next current process in the current data set to the next given process in the representative data set.

4. The method of Claim 1 , further comprising:

outputting a given correction process of the representative data set and one or more corresponding portions of the one or more sensor streams responsive to the result of the comparison of the current process to the given process indicating an unsuccessful completion of the current process;

comparing the current correction process in the current data set to the given undo process in the representative data set; and

outputting a result of the comparison of the current correction process in the current data set to the given undo process in the representative data set.

5. The method of Claim 1 , wherein comparing the current process in the current data set to the given process in the representative data set comprises determining in real time one or more differences based on one or more corresponding error bands.

6. The method of Claim 1 , wherein comparing the current process in the current data set to the given process in the representative data set comprises validating the current process conforms to the given process within one or more corresponding error bands.

7. The method of Claim 1 , wherein comparing the current process in the current data set to the given process in the representative data set comprises detecting one or more types of differences from a group including object deviations, action deviations, sequence deviations, process deviations and timing deviations.

8. The method of Claim 1 , wherein the result of the comparison of the current process in the current data set to the given process in the representative data set is output in real time to a worker.

9. One or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform a method of contextual training comprising:

accessing a representative data set from a data structure and one or more sensor streams associated with a subject, the data structure including a plurality of data sets including one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of the one or more sensor streams;

outputting given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of the one or more sensor streams of the representative data set;

receiving in real time one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams;

comparing the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of the one or more sensor streams of the representative data set; and

outputting a result of the comparison of the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of the one or more sensor streams of the representative data set.

10. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the method of contextual training according to Claim 9, wherein the subject comprises an article of manufacture, a health care service, a warehousing, a shipping, a restaurant transaction or a retailing transaction.

11. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the method of contextual training according to Claim 9, wherein the operation of comparing the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to

corresponding portions of the one or more sensor streams of the representative data set includes: generating a representation including a finite state machine and a state transition map based on the representative data set; and

inputting the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the representation including the finite state machine and the state transition map.

12. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the method of contextual training according to Claim 9, wherein the operation of comparing the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to

corresponding portions of the one or more sensor streams of the representative data set comprises determining in real time one or more differences based on one or more corresponding error bands.

13. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the method of contextual training according to Claim 9, wherein the operation of comparing the one or more indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters associated with a current portion of the plurality of sensor streams to the given indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to

corresponding portions of the one or more sensor streams of the representative data set includes determining if the one or more processes, one or more actions, or one or more sequences associated with a current portion of the plurality of sensor streams are performed within a predetermined completion time.

14. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the method of contextual training according to Claim 13, wherein outputting a result of the comparison includes outputting an indication of proficiency when the one or more processes, one or more actions, or one or more sequences associated with a current portion of the plurality of sensor streams are performed within a predetermined completion time.

15. A system comprising:

one or more sensors; one or more data storage unit; and one or more engines configured to; receive one or more sensor streams from the one or more sensors; determine one or more indicators of one or more cycles of one or more processes including one or more actions arranged in one or more sequences and performed on one or more objects, and one or more parameters in the one or more sensor streams;

access a representative data set including one or more indicators of at least one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to previous portions of the one or more sensor streams;

output an indication of a given process of the representative data set and one or more corresponding portions of the one or more sensor streams;

receive in real time a current data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, and one or more objects, and one or more parameters for a current portion of the one or more sensor streams;

compare a current process in the current data set to the given process in the representative data set; and

output a result of the comparison of the current process in the current data set to the given process in the representative data set.

16. The system of Claim 15, wherein the one or more indicators of the at least one of one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters are indexed to corresponding portions of the one or more sensor streams by corresponding time stamps.

17. The system of Claim 15, wherein the indication of the given process of the

representative data set and one or more corresponding portions of the one or more sensor streams are output in a graphical user interface to a worker.

18. The system of Claim 15, wherein the result of the comparison is output in a graphical user interface to a worker.

19. The system of Claim 15, wherein the indication of the given process of the

representative data set and one or more corresponding portions of the one or more sensor streams are output on an augmented reality display.

20. The system of Claim 15, wherein the result of the comparison are output on an augmented reality display.

21. A machine learning based ergonomics method comprising:

accessing information associated with a first actor including sensed activity information associated with an activity space;

analyzing the activity information, including analyzing activity for the first actor with respect to one or more ergonomic factors; and

forwarding feedback on the results of the analysis.

22. The method of Claim 21, wherein the results include identification of ergonomically problematic activities.

23. The method of Claim 21 , wherein the information is from sensors monitoring a work space in real time, wherein the information is accessed and analyzed in real time, and the feedback is forwarded in real time.

24. The method of Claim 22, wherein the analyzing comprises:

comparing information associated with activity of the first actor within the activity space with identified representative actions; and

identifying a deviation between the activity of the first actor and the representative standard.

25. The method of Claim 21, further comprising:

accessing information associated with a second actor including sensed activity information associated with an activity space;

analyzing the activity information, including analyzing activity for the first actor and the second with respect to the one or more ergonomic factors; and

forwarding feedback on the results of the analysis, wherein the results include a identification of a selection between the first actor and the second actor.

26. The method of Claim 25, wherein the analyzing comprises:

determining if a deviation from a representative standard associated with a respective one of the plurality of actors is within an acceptable threshold; and

identifying the respective one of the plurality of other actors as a potential acceptable candidate to be the replacement actor when the deviation associated with a respective one of the plurality of actors is within an acceptable threshold.

27. One or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform a machine learning based ergonomics method comprising:

accessing one or more data sets including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters in one or more sensor streams of a subject;

accessing one or more ergonomic factors including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters;

statistically analyzing the one or more data sets based on the one or more ergonomic factors to determine an ergonomic data set; and

adjusting at least one of the one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters of the subj ect based on the ergonomic data set.

28. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 27, further comprising:

storing the ergonomic data set indexed to corresponding portions of the one or more sensor streams; and

storing the corresponding portions of the one or more sensor streams.

29. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 28, wherein the ergonomic data set and the

corresponding portions of the one or more sensor streams are blockchained.

30. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 27, further comprising:

selecting one of a plurality of actors based on the ergonomic data set.

31. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 27, wherein:

the statistically analysis is performed in real time.

32. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 27, wherein the one or more indicators of at least one of the one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters in the one or more data sets include one or more locations of one or more portions of an actor in a workspace.

33. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 32, wherein:

the one or more ergonomic factors include hazard scores for a plurality of zones of the workspace, wherein at least two zones of the workspace have different hazard scores.

34. The one or more non-transitory computing device-readable storage mediums storing instructions executable by one or more computing devices to perform the machine learning based ergonomics method according to Claim 27, wherein the subject comprises an article of manufacture, a health care service, a warehousing, a shipping, a restaurant transaction or a retailing transaction.

35. A system comprising:

one or more data storage units; and

one or more engines configured to;

access one or more data sets including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters in one or more sensor streams of a subject;

access one or more ergonomic factors including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters;

statistically analyzing the one or more data sets based on the one or more ergonomic factors to determine an ergonomic data set storing the ergonomic data set indexed to corresponding portions of the one or more sensor streams in one or more data structures on the one or more data storage units.

36. The system of Claim 35, wherein the one or more engines are further configured to: storing the corresponding portions of the one or more sensor streams in the one or more data structures on the one or more data storage units.

37. The system of Claim 36, wherein the one or more engines are further configured to: blockchain the ergonomic data set and the corresponding portions of the one or more sensor streams.

38. The system of Claim 35, wherein:

the one or more data sets include one or more data sets for a plurality of actors; and the one or more engines are further configured to:

selecting one of a plurality of actors based on the ergonomic data set.

39. The system of Claim 35, wherein the ergonomic data set includes one or more of a reach study, a motion study, a repetitive motion study, and a dynamics study.

40. A computer implemented method of automatically determining a work task assignment for an actor based on captured actions of said actor, the method comprising:

receiving a sensor stream at a computing device, the sensor stream comprising sensor information obtained from a sensor operable to sense progress of a work task;

using the computing device executing an engine, identifying a plurality of actions recorded within the sensor stream that are performed by the actor;

using the computing device to store, in a memory resident data structure of the computing device, the received sensor stream and identities of the plurality of actions recorded therein, wherein a respective identity of each of the plurality of actions are mapped to the sensor stream;

using the computing device and the engine, characterizing each of the identified plurality of actions performed by the actor to produce determined characterizations thereof; and based on the determined characterizations of the actor performing said plurality of actions, automatically determining the work task assignment for the actor.

41. The method of Claim 40, wherein the determined characterizations comprise ergonomics of the actor used to perform each of the identified plurality of actions.

42. The method of Claim 40, wherein the determined characterizations comprise a skill level of the actor used to perform each of the identified plurality of actions.

43. The method of Claim 40, wherein the determined characterizations comprise a time required for the actor to perform each of the identified plurality of actions.

44. The method of Claim 40, further comprising:

based on the determined characterizations of the actor performing said plurality of actions, automatically determining a certification expertise indicating that the actor is certified to a standard.

45. The method of Claim 40, wherein the sensor stream comprises video frames.

46. The method of Claim 40, wherein the sensor stream comprises thermal sensor data.

47. The method of Claim 40, wherein the sensor stream comprises force sensor data.

48. The method of Claim 40, wherein the sensor stream comprises audio sensor data.

49. The method of Claim 40, wherein the sensor stream comprises light sensor data.

50. A computer implemented method of determining a work task assignment for an actor within an automated production line, the method comprising:

receiving a sensor stream at a computing device, the sensor stream comprising sensor information obtained from a sensor operable to sense progress of a work task performed by a plurality of actors;

receiving with the computing device an identity of each of the plurality of actors identified within the sensor stream;

using the computing device and an engine to identify an action within the sensor stream that is performed by each of the plurality of actors performing the work task;

using the computing device to store, in a data structure, the received sensor stream, an identity of each action, and an identity of each of the plurality of actors;

using the computing device to map respective actions performed by each of the plurality of actors to the sensor stream;

using the computing device and the engine to characterize the respective actions performed by each of the plurality of actors to produce determined characterizations thereof; and based on the determined characterizations of the plurality of actors performing the action, automatically determining the work task assignment which assigns an actor of said plurality of actors to perform said action.

51. The method of Claim 50, wherein the determined characterizations comprise ergonomics of each of the plurality of actors used to perform the action.

52. The method of Claim 50, wherein the determined characterizations comprise a skill level of each of the plurality of actors used to perform the action.

53. The method of Claim 50, wherein the determined characterizations comprise a time required for each of the plurality of actors to perform the action.

54. The method of Claim 50, further comprising:

using the determined characterizations to determine when each of the plurality of actors are certified to a standard.

55. The method of Claim 50, wherein the sensor stream comprises one of: video frames, thermal sensor data, force sensor data, audio sensor data, and light sensor data.

56. A system comprising:

a processor coupled to a bus;

a sensor, in communication with said bus, and operable to sense progress of a work task; and

a memory coupled to said bus and comprising instructions that when executed cause the system to implement a method of automatically determining a work task assignment for an actor, the method comprising:

receiving a sensor stream comprising sensor information obtained from the sensor; the processor executing an engine, to identify a plurality of actions within the sensor stream that are performed by the actor;

storing, in a memory resident data structure of the memory, the received sensor stream and identities of the plurality of actions, wherein respective identities of each of the plurality of actions are mapped to the sensor stream;

using the engine to characterize each of the identified plurality of actions performed by the actor and to produce determined characterizations thereof; and

based on the determined characterizations of the actor performing said plurality of actions, automatically determining the work task assignment for the actor.

57. The system of Claim 56, wherein the determined characterizations comprise ergonomics of the actor used to perform each of the identified plurality of actions.

58. The system of Claim 56, wherein the determined characterizations comprise a skill level of the actor used to perform each of the identified plurality of actions.

59. The system of Claim 56, wherein the determined characterizations comprise a time required for the actor to perform each of the identified plurality of actions.

60. The system of Claim 56, wherein the method further comprises:

based on the determined characterizations of the actor performing said plurality of actions, automatically determining a certification expertise indicating that the actor is certified to a standard.

61. The system of Claim 56, wherein the sensor stream comprises one of: video frames, thermal sensor data, force sensor data, audio sensor data, and light sensor data.

Description:
Contextual Training Systems and Methods

BACKGROUND OF THE INVENTION

[0001] As the world's population continues to grow, the demand for goods and services continues to increase. Industries grow in lockstep with the increased demand and often require an ever-expanding network of enterprises employing various processes to accommodate the growing demand for goods and services. For example, an increased demand in automobiles can increase the need for robust assembly lines, capable of completing a larger number of processes in each station on the assembly line while minimizing anomalies and reducing completion times associate with each process. Typically, process anomalies are the result of an operator deviating from or incorrectly performing one or more actions. In addition, variances in the completion times of a process can be attributed to inadequate designs that result in an operator being challenged to execute the required actions in the required time. Quite often, if the number of actions per station increases either due to an increase in the complexity of the actions or a decrease in the time available in each station, the cognitive load on the operator increases, resulting in higher deviation rates.

[0002] Common quality improvement and process optimization methodologies, for use by manufacturing organizations, include Toyota's Toyota Production System and Motorola's Six- Sigma. The optimization methodologies such as Lean Manufacturing and Six-Sigma rely on manual techniques to gather data on human activity. The data gathered using such manual techniques typically represent a small and incomplete data set. Worse, manual techniques can generate fundamentally biased data sets, since the persons being measured may be "performing" for the observer and not providing truly representative samples of their work, which is commonly referred to as the Hawthorne and Heisenberg effect. Such manual techniques can also be subject to substantial delays between the collection and analysis of the data.

[0003] There is currently a growth in the use of Industrial Internet of Things (II oT) devices in manufacturing and other contexts. However, machines currently only perform a small portion of tasks in manufacturing. Therefore, instrumenting machines used in manufacturing with electronics, software, sensors, actuators and connectivity to collect, exchange and utilize data is centered on a small portion of manufacturing tasks, which the Boston Consulting Group estimated in 2016 to be about 10% of the task or action that manufactures use to build products. Accordingly, IIoT devices also provides an incomplete data set.

[0004] Accordingly, there is a continuing need for systems and methods for collecting information about manufacturing, health care services, shipping, retailing and other similar context and providing analytic tools for improving the performance in such contexts. Amongst other reasons, the information could for example be utilized to improve the quality of products or services being delivered, for training employees, for communicating with customers and handling warranty claims and recalls.

SUMMARY OF THE INVENTION

[0005] The present technology may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the present technology directed toward real-time anomaly detection.

[0006] In aspects, an action recognition and analytics system can be utilized to determine cycles, processes, actions, sequences, objects and or the like in one or more sensor streams. The sensor streams can include, but are not limited to, one or more frames of video sensor data, thermal sensor data, infrared sensor data, and or three-dimensional depth sensor data. The action recognition and analytics system can be applied to any number of contexts, including but not limited to manufacturing, health care services, shipping, warehousing and retailing. The sensor streams, and the determined cycles, processes, actions, sequences, objects, parameters and or the like can be stored in a data structure. The determined cycles, processes, actions, sequences objects and or the like can be indexed to corresponding portions of the sensor streams. The action recognition and analytics system can provide for process validation, anomaly detection and in-process quality assurance in real-time.

[0007] In one embodiment, a contextual training method can include accessing a representative data set including one or more deep learning determined indicators of at least one of one or more processes, one or more actions, one or more sequences, one or more objects and or one or more parameters. The indicators of the one or more processes, actions, sequences, objects, parameters of the representative data set can be indexed to corresponding portions of one or more sensor streams. The one or more indexed sensor streams can also be accessed. An indication of a given process from the representative data set and one or more corresponding portions of the one or more sensor streams can be output as contextual training content. A current data set including one or more deep learning determined identifiers of at least one of one or more processes, actions, sequences, objects, parameters and or the like for a current portion of one or more sensor streams can be received in real time. A current process in the current data set to the given process in the representative data set can be compared. The result of the comparison of the current process in the current data set to the given process in the representative data set can be output as an additional portion of the contextual training content.

[0008] In another embodiment, an action recognition and analytics system can include a plurality of sensors disposed at one or more stations, one or more data storage and one or more engines. The one or more engines can be configured to receive sensor streams from the plurality of sensors and determine one or more indicators of one or more cycles of one or more processes including one or more actions arranged in one or more sequences and performed on one or more objects, and one or more parameters thereof in the sensor streams. The one or more engines can be configured to access a representative data set including one or more indicators of at least one of one or more one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters indexed to previous portions of one or more sensor streams. The one or more engines can be configured to then output an indication of a given process of the representative data set and one or more corresponding portions of the one or more sensor streams. The one or more engines can be configured to receive in real time a current data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters for a current portion of one or more sensor streams. The one or more engines can be configured to then compare a current process in the current data set to the given process in the representative data set and output a result.

[0009] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] Embodiments of the present technology are illustrated by way of example and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which: FIG. 1 shows an action recognition and analytics system, in accordance with aspect of the present technology.

FIG. 2 shows an exemplary deep learning type machine learning back-end unit, in accordance with aspects of the present technology.

FIG. 3 shows an exemplary Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) Recurrent Neural Network (RNN), in accordance with aspects of the present technology.

FIG. 4 shows an exemplary method of detecting actions in a sensor stream, in accordance with aspects of the present technology.

FIG. 5 shows an action recognition and analytics system, in accordance with aspects of the present technology.

FIG. 6 shows an exemplary method of detecting actions, in accordance with aspects of the present technology.

FIG. 7 shows an action recognition and analytics system, in accordance with aspects of the present technology.

FIG. 8 shows an exemplary station, in accordance with aspects of the present technology

FIG. 9 shows an exemplary station, in accordance with aspects of the present technology

FIG. 10 shows an exemplary station activity analysis method, in accordance with one embodiment.

FIGS. 11 A, 1 IB and 11C show a contextual training method, in accordance with aspects of the present technology.

FIG. 12 shows an exemplary presentation of contextual training content, in accordance with aspects of the present technology.

FIGS. 13A and 13B show exemplary presentation of contextual training content, in accordance with aspects of the present technology.

FIG. 14 shows an exemplary worker profile, in accordance with aspects of the present technology.

FIG. 15 shows a machine learning based ergonomics method, in accordance with aspects of the present technology. FIG. 16 shows a machine learning based ergonomics method, in accordance with aspects of the present technology.

FIG. 17 shows an exemplary ergonomics reach diagram, in accordance with aspects of the present technology.

FIG. 18 shows an exemplary ergonomics reach diagram, in accordance with aspects of the present technology.

FIG. 19 shows an exemplary ergonomics reach diagram, in accordance with aspects of the present technology.

FIG. 20 shows an exemplary ergonomics data table, in accordance with aspects of the present technology.

FIG. 21 shows a machine learning based ergonomics method, in accordance with aspects of the present technology.

FIG. 22 shows an exemplary work space and hazard zone, in accordance with aspects of the present technology.

FIG. 23 shows an exemplary computer system for automatically observing and analyzing actions of an actor based on data previously captured by one or more sensors in accordance with various embodiments of the present disclosure.

FIG. 24 show a flow chart depicting an exemplary sequence of computer implemented steps for automatically observing and analyzing actor activity in real-time in accordance with various embodiments of the present disclosure.

FIG. 25 shows a block diagram and data flow diagram of an exemplary computer system that automatically assigns processes or actions to actors in real-time based on observed data in accordance with various embodiments of the present disclosure.

FIG. 26 shows a flow chart depicting an exemplary sequence of computer implemented steps for automatically observing actor activity and assigning processes or actions to actors in realtime based on observed data in accordance with various embodiments of the present disclosure.

FIG. 27 shows an exemplary job assignment input user interface according to embodiments of the present invention.

FIG. 28 shows an exemplary job assignment output according to embodiments of the present invention. FIG. 29 shows an exemplary worker profile and certificates according to embodiments of the present invention.

FIG. 30 shows an exemplary computing device, in accordance with aspects of the present technology.

DETAILED DESCRIPTION OF THE INVENTION

[0011] Reference will now be made in detail to the embodiments of the present technology, examples of which are illustrated in the accompanying drawings. While the present technology will be described in conjunction with these embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present technology, numerous specific details are set forth in order to provide a thorough understanding of the present technology. However, it is understood that the present technology may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present technology.

[0012] Some embodiments of the present technology which follow are presented in terms of routines, modules, logic blocks, and other symbolic representations of operations on data within one or more electronic devices. The descriptions and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. A routine, module, logic block and/or the like, is herein, and generally, conceived to be a self-consistent sequence of processes or instructions leading to a desired result. The processes are those including physical manipulations of physical quantities. Usually, though not necessarily, these physical manipulations take the form of electric or magnetic signals capable of being stored, transferred, compared and otherwise manipulated in an electronic device. For reasons of convenience, and with reference to common usage, these signals are referred to as data, bits, values, elements, symbols, characters, terms, numbers, strings, and/or the like with reference to embodiments of the present technology.

[0013] It should be borne in mind, however, that all of these terms are to be interpreted as referencing physical manipulations and quantities and are merely convenient labels and are to be interpreted further in view of terms commonly used in the art. Unless specifically stated otherwise as apparent from the following discussion, it is understood that through discussions of the present technology, discussions utilizing the terms such as "receiving," and/or the like, refer to the actions and processes of an electronic device such as an electronic computing device that manipulates and transforms data. The data is represented as physical (e.g., electronic) quantities within the electronic device's logic circuits, registers, memories and/or the like, and is transformed into other data similarly represented as physical quantities within the electronic device.

[0014] As used herein, the use of the disjunctive is intended to include the conjunctive. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, a reference to "the" object or "a" object is intended to denote also one of a possible plurality of such objects. It is also to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

[0015] As used herein the term process can include processes, procedures, transactions, routines, practices, and the like. As used herein the term sequence can include sequences, orders, arrangements, and the like. As used herein the term action can include actions, steps, tasks, activity, motion, movement, and the like. As used herein the term object can include objects, parts, components, items, elements, pieces, assemblies, sub-assemblies, and the like. As used herein a process can include a set of actions or one or more subsets of actions, arranged in one or more sequences, and performed on one or more objects by one or more actors. As used herein a cycle can include a set of processes or one or more subsets of processes performed in one or more sequences. As used herein a sensor stream can include a video sensor stream, thermal sensor stream, infrared sensor stream, hyperspectral sensor stream, audio sensor stream, depth data stream, and the like. As used herein frame based sensor stream can include any sensor stream that can be represented by a two or more dimensional array of data values. As used herein the term parameter can include parameters, attributes, or the like. As used herein the term indicator can include indicators, identifiers, labels, tags, states, attributes, values or the like. As used herein the term feedback can include feedback, commands, directions, alerts, alarms, instructions, orders, and the like. As used herein the term actor can include actors, workers, employees, operators, assemblers, contractors, associates, managers, users, entities, humans, cobots, robots, and the like as well as combinations of them. As used herein the term robot can include a machine, device, apparatus or the like, especially one programmable by a computer, capable of carrying out a series of actions automatically. The actions can be autonomous, semi- autonomous, assisted, or the like. As used herein the term cobot can include a robot intended to interact with humans in a shared workspace. As used herein the term package can include packages, packets, bundles, boxes, containers, cases, cartons, kits, and the like. As used herein, real time can include responses within a given latency, which can vary from sub-second to seconds.

[0016] Referring to FIG. 1 an action recognition and analytics system, in accordance with aspect of the present technology, is shown. The action recognition and analytics system 100 can be deployed in a manufacturing, health care, warehousing, shipping, retail, restaurant or similar context. A manufacturing context, for example, can include one or more stations 105-115 and one or more actors 120-130 disposed at the one or more stations. The actors can include humans, machine or any combination therefore. For example, individual or multiple workers can be deployed at one or more stations along a manufacturing assembly line. One or more robots can be deployed at other stations. A combination of one or more workers and/or one or more robots can be deployed additional stations It is to be noted that the one or more stations 105-1 15 and the one or more actors are not generally considered to be included in the system 100.

[0017] In a health care implementation, an operating room can comprise a single station implementation. A plurality of sensors, such as video cameras, thermal imaging sensors, depth sensors, or the like, can be disposed non-intrusively at various positions around the operating room. One or more additional sensors, such as audio, temperature, acceleration, torque, compression, tension, or the like sensors, can also be disposed non-intrusively at various positions around the operating room.

[0018] In a shipping implementation, the plurality of stations may represent different loading docks, conveyor belts, forklifts, sorting stations, holding areas, and the like. A plurality of sensors, such as video cameras, thermal imaging sensors, depth sensors, or the like, can be disposed non-intrusively at various positions around the loading docks, conveyor belts, forklifts, sorting stations, holding areas, and the like. One or more additional sensors, such as audio, temperature, acceleration, torque, compression, tension, or the like sensors, can also be disposed non-intrusively at various positions.

[0019] In a retailing implementation, the plurality of stations may represent one or more loading docks, one or more stock rooms, the store shelves, the point of sale (e.g. cashier stands, self-checkout stands and auto-payment geofence), and the like. A plurality of sensors such as video cameras, thermal imaging sensors, depth sensors, or the like, can be disposed non- intrusively at various positions around the loading docks, stock rooms, store shelves, point of sale stands and the like. One or more additional sensors, such as audio, acceleration, torque, compression, tension, or the like sensors, can also be disposed non-intrusively at various positions around the loading docks, stock rooms, store shelves, point of sale stands and the like.

[0020] In a warehousing or online retailing implementation, the plurality of stations may represent receiving areas, inventory storage, picking totes, conveyors, packing areas, shipping areas, and the like. A plurality of sensors, such as video cameras, thermal imaging sensors, depth sensors, or the like, can be disposed non-intrusively at various positions around the receiving areas, inventory storage, picking totes, conveyors, packing areas, and shipping areas. One or more additional sensors, such as audio, temperature, acceleration, torque, compression, tension, or the like sensors, can also be disposed non-intrusively at various positions.

[0021] Aspect of the present technology will be herein further described with reference to a manufacturing context so as to best explain the principles of the present technology without obscuring aspects of the present technology. However, the present technology as further described below can also be readily applied in health care, warehousing, shipping, retail, restaurants, and numerous other similar contexts.

[0022] The action recognition and analytics system 100 can include one or more interfaces 135-165. The one or more interface 135-145 can include one or more sensors 135-145 disposed at the one or more stations 105-115 and configured to capture streams of data concerning cycles, processes, actions, sequences, object, parameters and or the like by the one or more actors 120- 130 and or at the station 105-1 15. The one or more sensors 135-145 can be disposed non- intrusively, so that minimal to changes to the layout of the assembly line or the plant are required, at various positions around one or more of the stations 105-115. The same set of one or more sensors 135-145 can be disposed at each station 105-1 15, or different sets of one or more sensors 135-145 can be disposed at different stations 105-1 15. The sensors 135-145 can include one or more sensors such as video cameras, thermal imaging sensors, depth sensors, or the like. The one or more sensors 135-145 can also include one or more other sensors, such as audio, temperature, acceleration, torque, compression, tension, or the like sensors.

[0023] The one or more interfaces 135-165 can also include but not limited to one or more displays, touch screens, touch pads, keyboards, pointing devices, button, switches, control panels, actuators, indicator lights, speakers, Augmented Reality (AR) interfaces, Virtual Reality (VR) interfaces, desktop Personal Computers (PCs), laptop PCs, tablet PCs, smart phones, robot interfaces, cobot interfaces. The one or more interfaces 135-165 can be configured to receive inputs from one or more actors 120-130, one or more engines 170 or other entities. Similarly, the one or more interfaces 135-165 can be configured to output to one or more actors 120-130, one or more engine 170 or other entities. For example, the one or more front-end units 190 can output one or more graphical user interfaces to present training content, work charts, real time alerts, feedback and or the like on one or more interfaces 165, such displays at one or more stations 120-130, at management portals on tablet PCs, administrator portals as desktop PCs or the like. In another example, the one or more front-end units 190 can control an actuator to push a defective unit of the assembly line when a defect is detected. The one or more front-end units can also receive responses on a touch screen display device, keyboard, one or more buttons, microphone or the like from one or more actors. Accordingly, the interfaces 135-165 can implement an analysis interface, mentoring interface and or the like of the one or more front-end units 190.

[0024] The action recognition and analytics system 100 can also include one or more engines 170 and one or more data storage units 175. The one or more interfaces 135-165, the one or more data storage units 175, the one or more machine learning back-end units 180, the one or more analytics units 185, and the one or more front-end units 190 can be coupled together by one or more networks 192. It is also to be noted that although the above described elements are described as separate elements, one or more elements of the action recognition and analytics system 100 can be combined together or further broken into different elements.

[0025] The one or more engines 170 can include one or more machine learning back-end units 180, one or more analytics units 185, and one or more front-end units 190. The one or more data storage units 175, the one or more machine learning back-end units 180, the one or more analytics units 185, and the one or more analytics front-end units 190 can be implemented on a single computing device, a common set of computing devices, separate computing device, or different sets of computing devices that can be distributed across the globe inside and outside an enterprise. Aspects of the one or more machine learning back-end units 180, the one or more analytics units 185 and the one or more front-end units 190, and or other computing units of the action recognition and analytics system 100 can be implemented by one or more central processing units (CPU), one or more graphics processing units (GPU), one or more tensor processing units (TPU), one or more digital signal processors (DSP), one or more

microcontrollers, one or more field programmable gate arrays and or the like, and any combination thereof. In addition, the one or more data storage units 175, the one or more machine learning back-end units 180, the one or more analytics units 185, and the one or more front-end units 190 can be implemented locally to the one or more stations 105-115, remotely from the one or more stations 105-115, or any combination of locally and remotely. In one example, the one or more data storage units 175, the one or more machine learning back-end units 180, the one or more analytics units 185, and the one or more front-end units 190 can be implemented on a server local (e.g., on site at the manufacturer) to the one or more stations 105- 115. In another example, the one or more machine learning back-end units 135, the one or more storage units 140 and analytics front-end units 145 can be implemented on a cloud computing service remote from the one or more stations 105-115. In yet another example, the one or more data storage units 175 and the one or more machine learning back-end units 180can be implemented remotely on a server of a vendor, and one or more data storage units 175 and the one or more front-end units 190 are implemented locally on a server or computer of the manufacturer. In other examples, the one or more sensors 135-145, the one or more machine learning back-end units 180, the one or more front-end unit 190, and other computing units of the action recognition and analytics system 100 can perform processing at the edge of the network 192 in an edge computing implementation. The above example of the deployment of one or more computing devices to implement the one or more interfaces 135-165, the one or more engines 170, the one or more data storage units 140 and one or more analytics front-end units 145, are just some of the many different configuration for implementing the one or more machine learning back-end units 135, one or more data storage units 140. Any number of computing devices, deployed locally, remotely, at the edge or the like can be utilized for implementing the one or more machine learning back-end units 135, the one or more data storage units 140, the one or more analytics front-end units 145 or other computing units.

[0026] The action recognition and analytics system 100 can also optionally include one or more data compression units associated with one or more of the interfaces 135-165. The data compression units can be configured to compress or decompress data transmitted between the one or more interface 135-165, and the one or more engines 170. Data compression, for example, can advantageously allow the sensor data from the one or more interface 135-165 to be transmitted across one or more existing networks 192 of a manufacturer. The data compression units can also be integral to one or more interfaces 135-165 or implemented separately. For example, video capture sensors may include an integral Motion Picture Expert Group (MPEG) compression unit (e.g., H-264 encoder/decoder). In an exemplary implementation, the one or more data compression units can use differential coding and arithmetic encoding to obtain a 20X reduction in the size of depth data from depth sensors. The data from a video capture sensor can comprise roughly 30 GB of H.264 compressed data per camera, per day for a factory operation with three eight-hour shifts. The depth data can comprise roughly another 400 GB of uncompressed data per sensor, per day. The depth data can be compressed by an algorithm to approximately 20 GB per sensor, per day. Together, a set of a video sensor and a depth sensor can generate approximately 50 GB of compressed data per day. The compression can allow the action recognition and analytics system 100 to use a factory's network 192 to move and store data locally or remotely (e.g., cloud storage).

[0027] The action recognition and analytics system 100 can also be communicatively coupled to additional data sources 194, such as but not limited to a Manufacturing Execution Systems (MES), warehouse management system, or patient management system. The action recognition and analytics system 100 can receive additional data, including one or more additional sensor streams, from the additional data sources 194. The action recognition and analytics system 100 can also output data, sensor streams, analytics result and or the like to the additional data sources 194. For example, the action recognition can identify a barcode on an object and provide the barcode input to a MES for tracking.

[0028] The action recognition and analytics system 100 can continually measure aspects of the real-world, making it possible to describe a context utilizing vastly more detailed data sets, and to solve important business problems like line balancing, ergonomics, and or the like. The data can also reflect variations over time. The one or more machine leaming back-end units 170 can be configured to recognize, in real time, one or more cycles, processes, actions, sequences, objects, parameters and the like in the sensor streams received from the plurality of sensors 135- 145. The one or more machine learning back-end units 180 can recognize cycles, processes, actions, sequences, objects, parameters and the like in sensor streams utilizing deep leaming, decision tree leaming, inductive logic programming, clustering, reinforcement leaming, Bayesian networks, and or the like.

[0029] Referring now to FIG. 2, an exemplary deep learning type machine leaming back- end unit, in accordance with aspects of the present technology, is shown. The deep leaming unit 200 can be configured to recognize, in real time, one or more cycles, processes, actions, sequences, objects, parameters and the like in the sensor streams received from the plurality of sensors 120-130. The deep learning unit 200 can include a dense optical flow computation unit 210, a Convolution Neural Networks (CNNs) 220, a Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) 230, and a Finite State Automata (FSA) 240. The CNNs 220 can be based on two-dimensional (2D) or three-dimensional (3D) convolutions. The dense optical flow computation unit 210 can be configured to receive a stream of frame-based sensor data 250 from sensors 120-130. The dense optical flow computation unit 210 can be configured to estimate an optical flow, which is a two-dimension (2D) vector field where each vector is a displacement vector showing the movement of points from a first frame to a second frame. The CNNs 220 can receive the stream of frame-based sensor data 250 and the optical flow estimated by the dense optical flow computation unit 210. The CNNs 220 can be applied to video frames to create a digest of the frames. The digest of the frames can also be referred to as the embedding vector. The digest retains those aspects of the frame that help in identifying actions, such as the core visual clues that are common to instances of the action in question.

[0030] In a three-dimensional Convolution Neural Network (3D CNN) based approach, spatio-temporal convolutions can be performed to digest multiple video frames together to recognize actions. For 3D CNN, the first two dimension can be along space, and in particular the width and height of each video frame. The third dimension can be along time. The neural network can learn to recognize actions not just from the spatial partem in individual frame, but also jointly in space and time. The neural network is not just using color pattems in one frame to recognize actions. Instead, the neural network is using how the partem shifts with time (i.e., motion cues) to come up with its classification. According the 3D CNN is attention driven, in that it proceeds by identifying 3D spatio-temporal bounding boxes as Regions of Interest (Rol) and focusses on them to classify actions.

[0031] In one implementation, the input to the deep learning unit 200 can include multiple data streams. In one instance, a video sensor signal, which includes red, green and blue data streams, can comprise three channels. Depth image data can comprise another channel.

Additional channels can accrue from temperature, sound, vibration, data from sensors (e.g., torque from a screwdriver) and the like. From the RGB and depth streams, dense optical flow fields can be computed by the dense optical flow computation unit 210 and fed to the

Convolution Neural Networks (CNNs) 220. The RGB and depth streams can also be fed to the CNNs 220 as additional streams of derived data.

[0032] The Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) 230 can be fed the digests from the output of the Convolution Neural Networks (CNNs) 220. The LSTM can essentially be a sequence identifier that is trained to recognize temporal sequences of sub- events that constitute an action. The combination of the CNNs and LSTM can be jointly trained, with full back-propagation, to recognize low-level actions. The low-level actions can be referred to as atomic actions, like picking a screw, picking a screwdriver, attaching screw to screwdriver and the like. The Finite State Automata (FSA) 240 can be mathematical models of computations that include a set of state and a set of rules that govern the transition between the states based on the provided input. The FSA 240 can be configured to recognize higher-level actions 260 from the atomic actions. The high-level actions 260 can be referred to as molecular actions, for example turning a screw to affix a hard drive to a computer chassis. The CN s and LSTM can be configured to perform supervised training on the data from the multiple sensor streams. In one implementation, approximately 12 hours of data, collected over the course of several days, can be utilized to train the CNNs and LSTM combination.

[0033] Referring now to FIG. 3, an exemplary Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) Recurrent Neural Network (RNN), in accordance with aspects of the present technology, is shown. The CNNs can include a frame feature extractor 310, a first Fully Connected (FC) layer 320, a Region of Interest (Rol) detector unit 330, a Rol pooling unit 340, and a second Fully Connected (FC) layer 350. The operation of the CNNs and LSTM will be further explained with reference to FIG. 4, which shows an exemplary method of detecting actions in a sensor stream.

[0034] The frame feature extractor 310 of the Convolution Neural Networks (CNNs) 220 can receive a stream of frame-based sensor data, at 410. At 420, the frame feature extractor 310 can perform a two-dimensional convolution operation on the received video frame and generate a two-dimensional array of feature vectors. The frame feature extractor 310 can work on the full resolution image, wherein a deep network is effectively sliding across the image generating a feature vector at each stride position. Thus, each element of the 2D feature vector array is a descriptor for the corresponding receptive field (e.g., fixed portion of the underlying image). The first Fully Connected (FC) layer can flatten the high-level features extracted by the frame feature extractor 310, and provide additional non-linearity and expressive power, enabling the machine to learn complex non-linear combinations of these features.

[0035] At 430, the Rol detector unit 330 can combine neighboring feature vectors to make a decision on whether the underlying receptive field belongs to a Region of Interest (Rol) or not. If the underlying receptive field belongs to a Rol, a Rol rectangle can be predicted from the same set of neighboring feature vectors, at 440. At, 450, a Rol rectangle with a highest score can be chosen by the Rol detector unit 330. For the chosen Rol rectangle, the feature vectors lying within it can be aggregated by the Rol pooling unit 340, at 460. The aggregated feature vector is a digest/descriptor for the foreground for that video frame.

[0036] In one implementation, the Rol detector unit 330 can determine a static Rol. The static Rol identifies a Region of Interest (Rol) within an aggregate set of feature vectors describing a video frame, and generates a Rol area for the identified Rol. A Rol area within a video frame can be indicated with a Rol rectangle that encompasses an area of the video frame designated for action recognition, such as an area in which actions are performed in a process. Alternatively, the Rol area can be designated with a box, circle, highlighted screen, or any other geometric shape or indicator having various scales and aspect ratios used to encompass a Rol. The area within the Rol rectangle is the area within the video frame to be processed by the Long Short Term Memory (LSTM) for action recognition.

[0037] The Long Short Term Memory (LSTM) can be trained using a Rol rectangle that provides, both, adequate spatial context within the video frame to recognize actions and independence from irrelevant portions of the video frame in the background. The trade-off between spatial context and background independence ensures that the static Rol detector can provide clues for the action recognition while avoiding spurious unreliable signals within a given video frame.

[0038] In another implementation, the Rol detector unit 330 can determine a dynamic Rol. A Rol rectangle can encompass areas within a video frame in which an action is occurring. By focusing on areas in which action occurs, the dynamic Rol detector enables recognition of actions outside of a static Rol rectangle while relying on a smaller spatial context, or local context, than that used to recognize actions in a static Rol rectangle.

[0039] In one implementation, the Rol pooling unit 340 extracts a fixed-sized feature vector from the area within an identified Rol rectangle, and discards the remaining feature vectors of the input video frame. The fixed-sized feature vector, or foreground feature, includes the feature vectors generated by the video frame feature extractor that are located within the coordinates indicating a Rol rectangle as determined by the Rol detector unit 330. Because the Rol pooling unit 340 discards feature vectors not included within the Rol rectangle, the

Convolution Neural Networks (CNNs) 220 analyzes actions within the Rol only, thus ensuring that unexpected changes in the background of a video frame are not erroneously analyzed for action recognition.

[0040] In one implementation, the Convolution Neural Networks (CNNs) 220 can be an Inception ResNet. The Inception ResNet can utilize a sliding window style operation.

Successive convolution layers output a feature vector at each point of a two-dimensional grid. The feature vector at location (x,y) at level / can be derived by weighted averaging features from a small local neighborhood (aka receptive field) N around the (x,y) at level l-l followed by a pointwise non-linear operator. The non-linear operator can be the RELU (max(0,x)) operator. [0041] In the sliding window, there can be many more than 7x7 points at the output of the last convolution layer. A Fully Connected (FC) convolution can be taken over the feature vectors from the 7x7 neighborhoods, which is nothing but applying one more convolution. The corresponding output represents the Convolution Neural Networks (CNNs) output at the matching 224x224 receptive field on the input image. This is fundamentally equivalent to applying the CNNs to each sliding window stop. However, no computation is repeated, thus keeping the inferencing computation cost real time on Graphics Processing Unit (GPU) based machines.

[0042] The convolution layers can be shared between Rol detector 330 and the video frame feature extractor 310. The Rol detector unit 330 can identify the class independent rectangular region of interest from the video frame. The video frame feature extractor can digest the video frame into feature vectors. The sharing of the convolution layers improves efficiency, wherein these expensive layers can be run once per frame and the results saved and reused.

[0043] One of the outputs of the Convolution Neural Networks (CNNs) is the static rectangular Region of Interest (Rol). The term "static" as used herein denotes that the Rol does not vary greatly from frame to frame, except when a scene change occurs, and it is also independent of the output class.

[0044] A set of concentric anchor boxes can be employed at each sliding window stop. In one implementation, there can be nine anchor boxes per sliding window stop for combinations of 3 scales and 3 aspect ratios. Therefore, at each sliding window stop there are two set of outputs. The first set of outputs can be a Region of Interest (Rol) present/absent that includes 18 outputs of the form 0 or 1. An output of 0 indicates the absence of a Rol within the anchor box, and an output of 1 indicates the presence of a Rol within the anchor box. The second set of outputs can include Bounding Box (BBox) coordinates including 36 floating point outputs indicating the actual BBox for each of the 9 anchor boxes. The BBox coordinates are to be ignored if the Rol present/absent output indicates the absence of a Rol.

[0045] For training, sets of video frames with a per-frame Region of Interest (Rol) rectangle are presented to the network. In frames without a Rol rectangle, a dummy 0x0 rectangle can be presented. The Ground Truth for individual anchor boxes can be created via the Intersection over Union (IoU) of rectangles. For the i th anchor box b l = {x h y h w,, h,} the derived Ground Truth for the Rol presence probability can be determined by Equation 1 : 1 IoU(b 0 g) >= 0.7

0 IoU(b u g) <= 0.1

box unused for training where g = {x g , y g , w g , h g } is the Ground Truth Rol box for the entire frame.

[0046] The loss function can be determined by Equation 2:

Pi log pi (S(. 9 ) + S(y t - y g ) + S(w t - w g ) + S(h t - h g ))

where p, is the predicted probability for presence of Region of Interest (Rol) in the i th anchor box and the smooth loss function can be defined by Equation 3:

The left term in the loss function is the error in predicting the probability of the presence of a Rol, while the second term is the mismatch in the predicted Bounding Box (BBox). It should be noted that the second term vanishes when the ground truth indicates that there is no Rol in the anchor box.

[0047] The static Region of Interest (Rol) is independent of the action class. In another implementation, a dynamic Region of Interest (Rol), that is class dependent, is proposed by the CN s. This takes the form of a rectangle enclosing the part of the image where the specific action is occurring. This increases the focus of the network and takes it a step closer to a local context-based action recognition.

[0048] Once a Region of Interest (Rol) has been identified, the frame feature can be extracted from within the Rol. These will yield a background independent frame digest. But this feature vector also needs to be a fixed size so that it can be fed into the Long Short Term

Memory (LSTM). The fixed size can be achieved via Rol pooling. For Rol pooling, the Rol can be tiled up into 7x7 boxes. The mean of all feature vectors within a tile can then be determined. Thus, 49 feature vectors that are concatenated from the frame digest can be produced. The second Fully Connected (FC) layer 350 can provide additional non-linearity and expressive power to the machine, creating a fixed size frame digest that can be consumed by the LSTM 230.

[0049] At 470, successive foreground features can be fed into the Long Short Term Memory (LSTM) 230 to learn the temporal pattern. The LSTM 230 can be configured to recognize patterns in an input sequence. In video action recognition, there could be patterns within sequences of frames belonging to a single action, referred to as intra action patterns. There could also be patterns within sequences of actions, referred to as inter action patterns. The LSTM can be configured to learn both of these patterns, jointly referred to as temporal patterns. The Long Short Term Memory (LSTM) analyzes a series of foreground features to recognize actions belonging to an overall sequence. In one implementation, the LSTM outputs an action class describing a recognized action associated with an overall process for each input it receives. In another implementation, each class action is comprised of sets of actions describing actions associated with completing an overall process. Each action within the set of actions can be assigned a score indicating a likelihood that the action matches the action captured in the input video frame. Each action may be assigned a score such that the action with the highest score is designated the recognized action class.

[0050] Foreground features from successive frames can be feed into the Long Short Term Memory (LSTM). The foreground feature refers to the aggregated feature vectors from within the Region of Interest (Rol) rectangles. The output of the LSTM at each time step is the recognized action class. The loss for each individual frame is the cross entropy softmax loss over the set of possible action classes. A batch is defined as a set of three randomly selected set of twelve frame sequences in the video stream. The loss for a batch is defined as the frame loss averaged over the frame in the batch. The numbers twelve and three are chose empirically. The overall LSTM loss function is given by Equation 4:

where B denotes a batch of \\B\\ frame sequences {Si, S 2 , ... , Sp | }. S comprises a sequence of ||<¾|| frames, wherein in the present implementation \\B\\ = 3 and ||<¾|| = 12k. A denotes the set of all action classes, a t denotes the i th action class score for the t th frame from LSTM and a t * l denotes the corresponding Ground Truth.

[0051] Referring again to FIG. 1, the machine learning back-end unit 135 can utilize custom labelling tools with interfaces optimized for labeling Rol, cycles and action. The labelling tools can include both standalone application built on top of Open source Computer Vision (OpenCV) and web browser application that allow for the labeling of video segment.

[0052] Referring now to FIG. 5, an action recognition and analytics system, in accordance with aspect of the present technology, is shown. Again, the action recognition and analytics system 500 can be deployed in a manufacturing, health care, warehousing, shipping, retail, restaurant, or similar context. The system 500 similarly includes one or more sensors 505-515 disposed at one or more stations, one or more machine learning back-end units 520, one or more analytics units 525, and one or more front-end units 530. The one or more sensors 505-515 can be coupled to one or more local computing devices 535 configured to aggregate the sensor data streams from the one or more sensors 505-515 for transmission across one or more

communication links to a streaming media server 540. The streaming media server 540 can be configured to received one or more streams of sensor data from the one or more sensors 505-515. A format converter 545 can be coupled to the streaming media server 540 to receive the one or more sensor data streams and convert the sensor data from one format to another. For example, the one or more sensors may generate Motion Picture Expert Group (MPEG) formatted (e.g., H.264) video sensor data, and the format converter 545 can be configured to extract frames of JPEG sensor data. An initial stream processor 550 can be coupled to the format convert 555. The initial stream processor 550 can be configured to segment the sensor data into predetermined chucks, subdivide the chunks into key frame aligned segment, and create per segment sensor data in one or more formats. For example, the initial stream processor 550 can divide the sensor data into five minute chunks, subdivide the chunks into key frame aligned segment, and convert the key frame aligned segments into MPEG, MPEG Dynamic Adaptive Streaming over Hypertext Transfer Protocol (DASH) format, and or the like. The initial stream processor 550 can be configured to store the sensor stream segments in one or more data structures for storing sensor streams 555. In one implementation, as sensor stream segments are received, each new segment can be appended to the previous sensor stream segments stored in the one or more data structures for storing sensor streams 555.

[0053] A stream queue 560 can also be coupled to the format converter 545. The stream queue 560 can be configured to buffer the sensor data from the format converter 545 for processing by the one or more machine learning back-end units 520. The one or more machine learning back-end units 520 can be configured to recognize, in real time, one or more cycles, processes, actions, sequences, objects, parameters and the like in the sensor streams received from the plurality of sensors 505-515. Referring now to FIG. 6, an exemplary method of detecting actions, in accordance with aspects of the present technology, is shown. The action recognition method can include receiving one or more sensor streams from one or more sensors, at 610. In one implementation, one or more machine learning back-end units 520 can be configured to receive sensor streams from sensors 505-515 disposed at one or more stations.

[0054] At 620, a plurality of processes including one or more actions arranged in one or more sequences and performed on one or more objects, and one or more parameters can be detected, in the one or more sensor streams. At 630, one or more cycles of the plurality of processes in the sensor stream can also be determined. In one implementation, the one or more machine learning back-end units 520 can recognize cycles, processes, actions, sequences, objects, parameters and the like in sensor streams utilizing deep learning, decision tree learning, inductive logic programming, clustering, reinforcement learning, Bayesian networks, and or the like.

[0055] At 640, indicators of the one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters can be generated. In one implementation, the one or more machine learning back-end units 520 can be configured to generate indicators of the one or more cycles, processes, actions, sequences, objects, parameters and or the like. The indicators can include descriptions, identifiers, values and or the like associated with the cycles, processes, actions, sequences, objects, and or parameters. The parameters can include, but is not limited to, time, duration, location (e.g., x, y, z, t), reach point, motion path, grid point, quantity, sensor identifier, station identifier, and bar codes.

[0056] At 650, the indicators of the one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of the sensor streams can be stored in one or more data structures for storing data sets 565. In one implementation, the one or more machine learning back-end units 520 can be configured to store a data set including the indicators of the one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters for each cycle. The data sets can be stored in one or more data structures for storing the data sets 565. The indicators of the one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters in the data sets can be indexed to corresponding portion of the sensor streams in one or more data structures for storing sensor streams 555.

[0057] In one implementation, the one or more streams of sensor data and the indicators of the one or more of the plurality of cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters indexed to corresponding portion of the one or more streams of sensor data can be encrypted when stored to protect the integrity of the streams of sensor data and or the data sets. In one implementation, the one or more streams of sensor data and the indicators of the one or more of the plurality of cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters indexed to corresponding portion of the one or more streams of sensor data can be stored utilizing block chaining. The blockchaining can be applied across the cycles, sensor streams, stations, supply chain and or the like. The blockchaining can include calculating a cryptographic hash based on blocks of the data sets and or blocks of the streams of sensor data. The data sets, streams of sensor data and the cryptographic hash can be stored in one or more data structures in a distributed network.

[0058] Referring again to FIG. 5, the one or more analytics units 525 can be coupled to the one or more data structures for storing the sensor streams 555, one or more data structures for storing the data set 565, one or more additional sources of data 570, one or more data structures for storing analytics 575. The one or more analytics units 525 can be configured to perform statistical analysis on the cycle, process, action, sequence, object and parameter data in one or more data sets. The one or more analytics units 525 can also utilize additional data received from one or more additional data sources 570. The additional data sources 570 can include, but are not limited to, Manufacturing Execution Systems (MES), warehouse management system, or patient management system, accounting systems, robot datasheets, human resource records, bill of materials, and sales systems. Some examples of data that can be received from the additional data sources 570 can include, but is not limited to, time, date, shift, day of week, plant, factory, assembly line, sub-assembly line, building, room, supplier, work space, action capability, and energy consumption, ownership cost. The one or more analytics units 525 can be configured to utilize the additional data from one or more additional source of data 570 to update, correct, extend, augment or the like, the data about the cycles, processes, action, sequences, objects and parameters in the data sets. Similarly, the additional data can also be utilized to update, correct, extend, augment or the like, the analytics generate by the one or more analytics front-end units 525. The one or more analytics units 525 can also store trends and other comparative analytics utilizing the data sets and or the additional data, can use sensor fusion to merge data from multiple sensors, and other similar processing and store the results in the one or more data structures for storing analytics 575. In one implementation, one or more engines 170, such as the one or more machine learning back-end units 520 and or the one or more analytics units 525, can create a data structure including a plurality of data sets, the data sets including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more object and one or more parameters. The one or more engine 170 can build the data structure based on the one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more object and one or more parameters detected in the one or more sensor streams. The data structure definition, configuration and population can be performed in real time based upon the content of the one or more sensor streams. For example, Table 1 shows a table defined, configured and populated as the sensor streams are processed by the one or more machine learning back-end unit 520.

ENTITY ID DATA STUCTURE (TABLE 1)

Table 1

The data structure creation process can continue to expand upon the initial structure and or create additional data structures base upon additional processing of the one or more sensor streams.

[0059] In one embodiment, the status associated with entities is added to a data structure configuration (e.g., engaged in an action, subject to a force, etc.) based upon processing of the access information. In one embodiment, activity associated with the entities is added to a data structure configuration (e.g., engaged in an action, subject to a force, etc.) based upon processing of the access information. One example of entity status data set created from processing of above entity ID data set (e.g., motion vector analysis of image object, etc.) is illustrated in Table 2.

ENTITY STATUS DATA STRUCTURE (TABLE 2)

Table 2

embodiment, a third-party data structure as illustrated in Table 3 can be accessed. OSHA DATA STRUCTURE (TABLE 3)

Table 3

In one embodiment, activity associated with entities is added to a data structure configuration (e.g., engaged in an action, subject to a force, etc.) based upon processing of the access information as illustrated in Table 4.

ACTIVITY DATA STRUCTURE (TABLE 4)

Table 4

Table 4 is created by one or more engines 170 based on further analytics/processing of info in Table 1, Table 2 and Table 3. In one example, Table 4 is automatically configured to have a column for screwing to motherboard. In frames 1 and 3 since hand is moving (see Table 2) and screw present (see Table 1), then screwing to motherboard (see Table 3). In frame 2, since hand is not moving (see Table 2) and screw not present (see Table 1), then no screwing to

motherboard (see Table 3).

[0060] Table 4 is also automatically configured to have a column for human action safe. In frame 1 since leg not moving in frame (see Table 2) the worker is safely (see Table 3) standing at workstation while engage in activity of screwing to motherboard. In frame 3 since leg moving (see Table 2) the worker is not safely (see Table 3) standing at workstation while engage in activity of screwing to motherboard.

[0061] The one or more analytics units 525 can also be coupled to one or more front-end units 580. The one or more front-end units 575 can include a mentor portal 580, a management portal 585, and other similar portals. The mentor portal 550 can be configured for presenting feedback generated by the one or more analytics units 525 and or the one or more front-end units 575 to one or more actors. For example, the mentor portal 580 can include a touch screen display for indicating discrepancies in the processes, actions, sequences, objects and parameters at a corresponding station. The mentor portal 580 could also present training content generated by the one or more analytics units 525 and or the one or more front-end units 575 to an actor at a corresponding station. The management port 585 can be configured to enable searching of the one or more data structures storing analytics, data sets and sensor streams. The management port 585 can also be utilized to control operation of the one or more analytics units 525 for such functions as generating training content, creating work charts, performing line balancing analysis, assessing ergonomics, creating job assignments, performing causal analysis, automation analysis, presenting aggregated statistics, and the like.

[0062] The action recognition and analytics system 500 can non-intrusively digitize processes, actions, sequences, objects, parameters and the like performed by numerous entities, including both humans and machines, using machine learning. The action recognition and analytics system 500 enables human activity to be measured automatically, continuously and at scale. By digitizing the performed processes, actions, sequences, objects, parameters, and the like, the action recognition and analytics system 500 can optimize manual and/or automatic processes. In one instance, the action recognition and analytics system 500 enables the creation of a fundamentally new data set of human activity. In another instance, the action recognition and analytics system 500 enables the creation of a second fundamentally new data set of man and machine collaborating in activities. The data set from the action recognition and analytics system 500 includes quantitative data, such as which actions were performed by which person, at which station, on which specific part, at what time. The data set can also include judgements based on performance data, such as does a given person perform better or worse that average. The data set can also include inferences based on an understanding of the process, such as did a given product exited the assembly line with one or more incomplete tasks.

[0063] Referring now to FIG. 7, an action recognition and analytics system, in accordance with aspects of the present technology, is shown. The action recognition and analytics system can include a plurality of sensor layers 702, a first Application Programming Interface (API) 704, a physics layer 706, a second API 708, a plurality of data 710, a third API 712, a plurality of insights 714, a fourth API 716 and a plurality of engine layers 718. The sensor layer 702 can include, for example, cameras at one or more stations 720, MES stations 722, sensors 724, IIoT integrations 726, process ingestion 728, labeling 730, neural network training 732 and or the like. The physics layer 706 captures data from the sensor layer 702 to passes it to the data layer 710. The data layer 710, can include but not limited to, video and other streams 734, +N annotations 736, +MES 738, +OSHA database 740, and third-party data 742. The insights layer 714 can provide for video search 744, time series data 746, standardized work 748, and spatio- temporal 842. The engine layer 718 can be utilized for inspection 752, lean/line balancing 754, training 756, job assignment 758, other applications 760, quality 763, traceability 764, ergonomics 766, and third party applications 768.

[0064] Referring now to FIG. 8, an exemplary station, in accordance with aspects of the present technology, is shown. The station 800 is an areas associated with one or more cycles, processes, actions, sequences, objects, parameters and or the like, herein also referred to as activity. Information regarding a station can be gathered and analyzed automatically. The information can also be gathered and analyzed in real time. In one exemplary implementation, an engine participates in the information gathering and analysis. The engine can use Artificial Intelligence to facilitate the information gathering and analysis. It is appreciated there can be many different types of stations with various associated entities and activities. Additional descriptions of stations, entities, activities, information gathering, and analytics are discussed in other sections of this detailed description.

[0065] A station or area associated with an activity can include various entities, some of which participate in the activity within the area. An entity can be considered an actor, an object, and so on. An actor can perform various actions on an object associated with an activity in the station. It is appreciated a station can be compatible with various types of actors (e.g., human, robot, machine, etc.). An object can be a target object that is the target of the action (e.g., thing being acted on, a product, a tool, etc.). It is appreciated that an object can be a target object that is the target of the action and there can be various types of target objects (e.g., component of a product or article of manufacture, an agricultural item, part of a thing or person being operated on, etc.). An object can be a supporting object that supports (e.g., assists, facilitates, aids, etc.) the activity. There can be various types of supporting objects, including load bearing components (e.g., a work bench, conveyor belt, assembly line, table top etc.), a tool (e.g., drill, screwdriver, lathe, press, etc.), a device that regulates environmental conditions (e.g., heating ventilating and air conditioning component, lighting component, fire control system, etc.), and so on. It is appreciated there can be many different types of stations with a various entities involved with a variety of activities. Additional descriptions of the station, entities, and activities are discussed in other sections of this detailed description.

[0066] The station 800 can include a human actor 810, supporting object 820, and target objects 830 and 840. In one embodiment, the human actor 810 is assembling a product that includes target objects 830, 840 while supporting object 820 is facilitating the activity. In one embodiment, target objects 830, 840 are portions of a manufactured product (e.g., a motherboard and a housing of an electronic component, a frame and a motor of a device, a first and a second structural member of an apparatus, legs and seat portion of a chair, etc.). In one embodiment, target objects 830, 840 are items being loaded in a transportation vehicle. In one embodiment, target objects 830, 840 are products being stocked in a retail establishment. Supporting object 820 is a load bearing component (e.g., a work bench, a table, etc.) that holds target object 840 (e.g., during the activity, after the activity, etc.). Sensor 850 senses information about the station (e.g., actors, objects, activities, actions, etc.) and forwards the information to one or more engines 860. Sensor 850 can be similar to sensor 135. Engine 860 can include a machine learning back end component, analytics, and front end similar to machine learning back end unit 180, analytics unit 190, and front end 190. Engine 860 performs analytics on the information and can forward feedback to feedback component 870 (e.g., a display, speaker, etc.) that conveys the feedback to human actor 810.

[0067] Referring now to FIG. 9, an exemplary station, in accordance with aspects of the present technology, is shown. The station 900 includes a robot actor 910, target objects 920, 930, and supporting objects 940, 950. In one embodiment, the robot actor 910 is assembling target objects 920, 930 and supporting objects 940, 950 are facilitating the activity. In one embodiment, target objects 920, 930 are portions of a manufactured product. Supporting object 940 (e.g., an assembly line, a conveyor belt, etc.) holds target objects 920, 930 during the activity and moves the combined target obj ect 920, 930 to a subsequent station (not shown) after the activity. Supporting object 940 provides area support (e.g., lighting, fan temperature control, etc.). Sensor 960 senses information about the station (e.g., actors, objects, activities, actions, etc.) and forwards the information to engine 970. Engine 970 performs analytics on the information and forwards feedback to a controller 980 that controls robot 910. Engine 970 can be similar to engine 170 and sensor 960 can be similar to sensor 135. [0068] A station can be associated with various environments. The station can be related to an economic sector. A first economic sector can include the retrieval and production of raw materials (e.g., raw food, fuel, minerals, etc.). A second economic sector can include the transformation of raw or intermediate materials into goods (e.g., manufacturing products, manufacturing steel into cars, manufacturing textiles into clothing, etc.). A third sector can include the supply and delivery of services and products (e.g., an intangible aspect in its own right, intangible aspect as a significant element of a tangible product, etc.) to various parties (e.g., consumers, businesses, governments, etc.). In one embodiment, the third sector can include sub sectors. One sub sector can include information and knowledge-based services. Another sub sector can include hospitality and human services. A station can be associated with a segment of an economy (e.g., manufacturing, retail, warehousing, agriculture, industrial, transportation, utility, financial, energy, healthcare, technology, etc,). It is appreciated there can be many different types of stations and corresponding entities and activities. Additional descriptions of the station, entities, and activities are discussed in other sections of this detailed description.

[0069] In one embodiment, station information is gathered and analyzed. In one exemplary implementation, an engine (e.g., an information processing engine, a system control engine, an Artificial Intelligence engine, etc.) can access information regarding the station (e.g., information on the entities, the activity, the action, etc.) and utilizes the information to perform various analytics associated with the station. In one embodiment, engine can include a machine leaming back end unit, analytics unit, front end unit, and data storage unit similar to machine leaming back end 180, analytics 185, front end 190 and data storage 175. In one embodiment, a station activity analysis process is performed. Referring now to FIG. 10, an exemplary station activity analysis method, in accordance with one embodiment, is shown.

[0070] At 1010, information regarding the station is accessed. In one embodiment, the information is accessed by an engine. The information can be accessed in real time. The information can be accessed from monitors/sensors associated with a station. The information can be accessed from an information storage repository. The information can include various types of information (e.g., video, thermal, optical, etc.). Additional descriptions of the accessing information are discussed in other sections of this detailed description

[0071] At 1020, information is correlated with entities in the station and optionally with additional data sources. In one embodiment, the information the correlation is established at least in part by an engine. The engine can associate the accessed information with an entity in a station. An entity can include an actor, an object, and so on. Additional descriptions of the correlating information with entities are discussed in other sections of this detailed description.

[0072] At 1030, various analytics are performed utilizing the accessed information at 1010, and correlations at 1020. In one embodiment, an engine utilizes the information to perform various analytics associated with station. The analytics can be directed at various aspects of an activity (e.g., validation of actions, abnormality detection, training, assignment of actor to an action, tracking activity on an object, determining replacement actor, examining actions of actors with respect to an integrated activity, automatic creation of work charts, creating ergonomic data, identify product knitting components, etc.) Additional descriptions of the analytics are discussed in other sections of this detailed description.

[0073] At 1040, optionally, results of the analysis can be forwarded as feedback. The feedback can include directions to entities in the station. In one embodiment, the information accessing, analysis, and feedback are performed in real time. Additional descriptions of the station, engine, entities, activities, analytics and feedback are discussed in other sections of this detailed description,

[0074] It is also appreciated that accessed information can include general information regarding the station (e.g., environmental information, generic identification of the station, activities expected in station, a golden rule for the station, etc.). Environmental information can include ambient aspects and characteristics of the station (e.g., temperature, lighting conditions, visibility, moisture, humidity, ambient aroma, wind, etc.).

[0075] It also appreciated that some of types of characteristics or features can apply to a particular portion of a station and also the general environment of a station. In one exemplary implementation, a portion of a station (e.g., work bench, floor area, etc.) can have a first particular visibility level and the ambient environment of the station can have a second particular visibility level. It is appreciated that some of types of characteristics or features can apply to a particular entity in a station and also the station environment. In one embodiment, an entity (e.g., a human, robot, target object, etc.) can have a first particular temperature range and the station environment can have a second particular temperature range.

[0076] The action recognition and analytics system 100, 500 can be utilized for process validation, anomaly detection and/or process quality assurance in real time. The action recognition and analytics system 100, 500 can also be utilized for real time contextual training. The action recognition and analytics system 100, 500 can be configured for assembling training libraries from video clips of processes to speed new product introductions or onboard new employees. The action recognition and analytics system 100, 500 can also be utilized for line balancing by identifying processes, sequences and/or actions to move among stations and implementing lean processes automatically. The action recognition and analytics system 100, 500 can also automatically create standardized work charts by statistical analysis of processes, sequences and actions. The action recognition and analytics system 100, 500 can also automatically create birth certificate videos for a specific unit. The action recognition and analytics system 100, 500 can also be utilized for automatically creating statistically accurate ergonomics data. The action recognition and analytics system 100, 500 can also be utilized to create programmatic job assignments based on skills, tasks, ergonomics and time. The action recognition and analytics system 100, 500 can also be utilized for automatically establishing traceability including for causal analysis. The action recognition and analytics system 100, 500 can also be utilized for kitting products, including real time verification of packing or unpacking by action and image recognition. The action recognition and analytics system 100, 500 can also be utilized to determine the best robot to replace a worker when ergonomic problems are identified. The action recognition and analytics system 100, 500 can also be utilized to design an integrated line of humans and cobot and/or robots. The action recognition and analytics system 100, 500 can also be utilized for automatically programming robots based on observing non- modeled objects in the work space.

[0077] Referring now to FIGS. 1 1A, 1 IB and 1 1C, an action recognition and analytics method of contextual training, in accordance with aspects of the present technology, is shown. The method can include optionally receiving an indication of a given one of a plurality of subjects, at 1 105. At 11 10, a representative data set can be accessed. The representative data set can include one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters indexed to corresponding portions of one or more sensor streams. In one implementation, one or more engines 170 can be configured to access a representative data set including one or more processes, actions, sequences, objects, parameters and or the like stored in one or more data structures on one or more data storage units 175. In one implementation, the representative data set can be automatically generated, by the one or more engines 170, based upon a statistical analysis of one or more previous cycles.

[0078] At 11 15, portions of the one or more sensor streams corresponding to the representative data set can be accessed. In one implementation, the one or more engines 170 can be configured to access portions of one or more sensor streams stored in one or more data structures on the one or more data storage unit 175 and indexed by the representative data set. In another implementation, the representative data set and the previous portion of one or more sensor streams can be blockchained. The representative data set and the previous portion of one or more sensor streams can be blockchained to protect the integrity of the representative data set and the corresponding portion of the one or more sensor streams. The blockchaining can be applied across the cycles, sensor streams, stations, supply chain and or the like.

[0079] At 1120, an indication of a given process of the representative data set and one or more corresponding portions of the one or more sensor streams can be output. In one implementation, the one or more engines 170 can be configured to output an indication of a given process and one or more corresponding portions of video sensor streams to an actor at a given station at which training is being performed. For example, the indication of a given process and one or more corresponding portions of video sensor streams can be presented to the worker on a display, a haptic display, an Augment Reality (AR) interface, Virtual Reality (VR), or the like.

[0080] If an indication of a given one of the plurality of subjects is received, a

representative data set for the given subject can be accessed. For example, an indication of a first laptop PC model can be received for a first cycle. A given representative data set of the first laptop PC can be access. For a second cycle, an indication of a second laptop PC model can be received. In response to the indication of the second laptop PC model, a corresponding representative data set for the second laptop OC can be accessed. The indication of a given one of a plurality of subjects therefore allows training content, including a representative data and one or more corresponding portions of the one or more sensor streams for a specific subject to be presented.

[0081] Referring now to FIG. 12, an exemplary presentation of contextual training content, in accordance with aspects of the present technology, is shown. As illustrated, a graphical user interface 1200 can be produced by the one or more front-end units 190 on an interface 155, such as a monitor, at a given station 1 10. The graphical user interface 1200 can include a list view 1210 and or a device view 1220. For training a worker at a particular station, one or more processes, actions, sequences, objects and or parameters of the representative data set can be presented in the list view 1210. For example, a given process currently to be performed by the worker 1230 may be displayed in a first color (e.g., black), one or more processes to be performed next 1240 can be displayed in a second color (e.g., gray), one or more processes that have been successfully completed 1250 can be displayed in a third color (e.g., green), and one or more processes that were unsuccessfully completed 1260 can be displayed in a fourth color (e.g., red). In the device view 1220, a snippet of a video stream corresponding to the given process to be performed can be displayed. In addition, a textual description of the given process 1270 can also be displayed along with the corresponding video stream snippet and or images.

[0082] Referring again to FIGS. 11 A, 1 IB and 11 C, a current data set including one or indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters for a current portion of one or more sensor streams can be received in real time, at 1125. In one implementation, one or more engines 170 can be configured to receive one or more sensor streams from one or more sensors at the given station at which training is being performed. The one or more engines 170 can be configured to detect the processes being performed at the given station. The current process can include one or more actions arranged in one or more sequences and performed on one or more objects, and one or more parameter values. The one or more engines 170 can be configured to generate a current data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters currently being performed at the given station.

[0083] At 1130, a current process in the current data set can be compared to the given process in the representative data set. In one implementation, the one or more engines 170 can compare the current process in the current data set to the given process in the representative data set. In one implementation, a representation including a finite state machine and a state transition map can be generated based on the representative data set. The one or more indicators of the cycles, processes, actions, sequences, objects or parameters associated with a current portion of the plurality of sensor streams can be input to the representation including the finite state machine and the state transition map. The state transition map at each station can include a sequence of steps some of which are dependent on others. Due to this partial dependence, the process can be represented as a partially-ordered set (poset) or a directed acyclic graph (DAG) wherein the numbers of the set (node of the graph) can be used to store the representative data set values corresponding to those steps. The finite state machine can render the current state of the station into the map.

[0084] At 1135, the result of the comparison of the current process in the current data set to the given process in the representative data set can be output. In one implementation, the one or more engines 170 can be configured to output the results of the comparison on one or more interfaces 155 at the given station 1 10 at which training is being performed. For example, a graphical user interface, output on a display at the stations at which training is being performed, can provide an indication if the given process was completed successfully or unsuccessfully. The result of the comparison can be present to an actor on a display, a haptic display, an

Augment Reality (AR) interface, Virtual Reality (VR), or the like. The comparison may include determining in real time one or more differences based on one or more corresponding error bands. The comparison may, or may additionally include validating the current data set conforms to the representative data set within one or more corresponding error bands. The comparison may, or may additionally, include detecting one or more types of differences from a group including object deviations, action deviations, sequence deviations, process deviations, and timing deviations. For example, a timing deviation comparison may determine if the current process was performed within two standard deviations of the time that the given process in the representative data set was performed. A sequence deviation may determine if a certain step was performed in wrong order.

[0085] At 1140, a next given process of the representative data set can be output responsive to the result of the comparison indicating a successful complete of the current process. In one implementation, the one or more engines 170 can be configured to output an indication of a next given process of the representative data set and one or more corresponding portions of video sensor streams to an actor at a given station at which training is being performed.

[0086] At 1145 , a current data set including one or indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters for a current portion of one or more sensor streams can continue to be received in real time. In one implementation, the one or more engines 170 can be configured to continue to generate the current data set including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects, and one or more parameters currently being performed at the given station.

[0087] At 1150, the next current process in the current data set can be compared to the next given process in the representative data set. In one implementation, the one or more engine 170 can compare the next current process in the current data set to the next given process in the representative data set.

[0088] At 1155, the result of the comparison of the next current process in the current data set to the next given process in the representative data set can be output. In one implementation, the one or more engines 170 can be configured to output the results of the comparison indicating if the next given process was completed successfully or unsuccessfully on a display at the station at which training is being performed, can provide. The processes at 1140 through 1155 can be repeated responsive to each result indicating a successfully completion of the corresponding process

[0089] At 1160, optionally, a current correction process of the representative data set can be output responsive to the result of the comparison indicating an unsuccessful complete of the current process. In one implementation, the one or more engines 170 can be configured to output an indication of a current correction process and one or more corresponding portions of video sensor streams to a worker at a given station at which training is being performed.

[0090] At 1165, optionally, a current correction process in the current data set can be compared to the given correction process in the representative data set. In one implementation, the one or more engines 170 can be configured to compare the current correction process in the current data set to the given correction process in the representative data set.

[0091] At 1170, optionally, the result of the comparison of the current correction process in the current data set to the given correction process in the representative data set can be output. In one implementation, the one or engines 170 can be configured to output the results of the comparison on one or more interfaces 155 at the given station at which training is being performed. For example, a graphical user interface can provide an indication if the correction process was completed successfully or unsuccessfully.

[0092] Aspects of the present technology make it possible to readily train workers by generating contextual training content. The contextual training content can present processes, including one or more actions arranged in one or more sequences and performed on one or more objects, and one or more parameter values, along with cues in real time to coach workers. The contextual training content can also present constructive feedback to an actor during training. For example, outputting the results of the comparison of the current process in the current data set to the given process in the representative data set can include indicating in a graphical user interface a decrease in the cycle time over a predetermined number of cycles, as illustrated in FIG. 13 A. In another example, the graphical user interface can indicate when no step were missed over a predetermined number of cycles, as illustrated in FIG. 13B.

[0093] Aspects of the present technology can extract processes, actions, sequences, object, parameters and or the like from one or more sensor streams to create a representative data set. The representative data set along with the corresponding portions of the sensor streams can be utilized as contextual training content, and can represent a "golden process." The sensor streams may include one or more video sensor streams that include audio and video. The worker can therefore watch and or listen to learn important assembly and quality cues. Aspects of the present technology can also identify when a worker has successfully or unsuccessfully completed each process. The successful and unsuccessful completion of the processes can be monitored to identify when an operator has achieved a desired level of proficiency. The proficiency can also be based on successful completion of the process within a predetermined completion time.

[0094] Referring now to FIG. 14, an exemplary worker profile, in accordance with aspects of the present technology, is shown. The proficiency of a worker can be measured during the contextual training and reported to one or more additional data sources. In one implementation, the one or more engines 170 can report one or more parameters measured during the contextual training to an employee management system for use in a worker profile. In another

implementation, the action recognition and analytics system 100, 500 can also utilize the one or more parameters measured during the contextual training for line balancing, programmatic job assignments, and other similar functions.

[0095] The contextual training content, in accordance with aspects of the present technology, can be captured from sensor streams along the assembly line and applied to training workers while performing the processes at the respective stations. In contrast, conventional training materials have traditionally been created by process experts who often watch the task being performed and document it. The conventional training material is then presented to the worker in static formats, such as text or graphics, during "in-class" training or on a shop floor.

[0096] Embodiments of the present invention implement a method and system for automatically creating statistically accurate ergonomics data. Embodiments of the present invention implement a method for gathering data using modern and innovative techniques to implement a much more robust and complete dataset. Embodiments of the present invention implement an accurate dataset, that is not dependent upon whether people are conscious of being observed while performing their work. Embodiments of the present invention implement a method and a system for automatically creating statistically accurate ergonomics data.

[0097] Embodiments of the present invention implement a technology that non-intrusively digitizes actions performed by both humans and machines using machine leaming. By digitizing actions performed, embodiments of the present invention convert them into data that can be applied for process optimization. Embodiments of the present invention uses technology that enables human activities to be measured automatically, continuously and at scale. [0098] While one initial focus is in manufacturing, the technology is applicable to human activity in any setting. Embodiments of the present invention enable the creation of a fundamentally new dataset of human actions. This dataset includes quantitative data (which actions were performed by which person, at which station, on which specific part, at what time); judgments based on performance data (person X performs better/worse than average); and inferences based on an understanding of the process (product Y exited the line with incomplete tasks). This dataset enables embodiments of the present invention to fundamentally re-imagine many of the techniques used to optimize manual and automated processes from manufacturing to the service industry.

[0099] Beyond the measurement of actions, embodiments of the present invention implement a system that is capable of performing the aforementioned process validation in seconds, and that is, identifying which activity is being performed as well as the movements of the actor as they perform the activity, able to determine whether or not the action conforms to the correct process, and communicate information about process adherence/deviation back to the actor performing the action or to other interested parties. This further expands the range of business applications to which embodiments of the present invention can be applied, either new or re-imagined.

[00100] Embodiments of the present invention enable a number of applications by instrumenting a workstation, recording data (e.g., video, audio, thermal, vibration) from this workstation, labeling the actions in this data, building machine learning models, training these models to find the best predictor, and then inferring the actions observed in the data feed(s), possibly in real-time driving business processes with this data.

[00101] Embodiments of the present invention can identify spatio-temporal information of the worker performing his or her required tasks. In one embodiment, since the camera is not always placed orthogonal to the worker, the data it captures can be distorted, based on the camera's viewpoint. Using camera coordinates and image projection techniques, the worker's hand coordinates (x, y, z) can be established accurately. Embodiments of the present invention can then continuously collect spatial data of worker's body parts (e.g., hands) as she executes tasks on the assembly line. In more sophisticated analyses, (e.g., 3D, x, y, z, roll, pitch, yaw) data is collected along with a synchronized time stamp. With this spatio-temporal data, embodiments of the present invention can determine and output, reach studies using statistical analysis over time (distance), motion studies using statistical analysis over time (distance and frequency), repetitive motion analysis studies (distance, frequency and count), torque and load studies (force, frequency and count), and reports identifying unsafe conditions that are created by integrating the present technologies spatio-temporal data with OSHA ergonomic rules (reach, repetition frequency, etc.). In one embodiment, these reports can be delivered on a scheduled or an ad hoc basis when certain pre-specified conditions established by regulatory requirements or scientific work are identified.

[00102] As stated earlier, current techniques require manual generation of time and motion data, a process that is fundamentally limiting and flawed. In contrast, by generating large volumes of data and non-dimensionalizing the data in new and creative ways, embodiments of the present invention enable analysis of repetitive motion injury. This enables embodiments of the present invention to, based on a large data set, uniquely, provide insights into the inherent safety of the work environment for workers and address a top priority for most manufacturers.

[00103] Embodiments of the present invention collect time and motion data as described above. In reach studies, embodiments of the present invention plot all the points it has observed, creating a spatio-temporal dataset of places in the workspace that the operator reaches out to. Additionally, similar representations of two-dimensional projections of the three-dimensional data can be constructed to analyze worker movements in the horizontal and vertical planes. Further, embodiments of the present invention can produce continuous motion data from worker's hands starting position (e.g., xl, yl, zl) to ending position (e.g., x2, y2, z2). This data can be analyzed over a time-period for repetitive motions to identify ergonomic issues developing across long periods.

[00104] In this manner, embodiments of the present invention are directed to providing an ergonomic correct environment. In one embodiment, activity of a human actor is monitored and analyzed. The analysis can have a predictive element/nature (e.g. predict movement/result to determine if a dangerous condition/collision is going to happen before bad event happens). Feedback on a corrective action can be provided to the actors (e.g., an alarm of

collision/dangerous condition, an order for robot to stop/take evasive measure, etc.). Compliance of the actor with safety guidelines can be checked and alarm forwarded if not.

[00105] In one embodiment, the present invention is implemented as a method that includes accessing information associated with a first actor, including sensed activity information associated with an activity space, analyzing the activity information, including analyzing activity of the first actor with respect to ergonomic factors, and forwarding feedback on the results of the analysis, wherein the results includes identification of ergonomically problematic activities. [00106] Referring now to FIG. 15, a machine learning based ergonomics method, in accordance with aspects of the present technology, is shown. The method can include accessing information associated with an actor including sensed activity information within an activity space with respect to performing an activity, at 1510. At 1520, the activity information of the actor can be analyzed with respect to one or more ergonomic factors. At 1530, the results of the analysis can be output.

[00107] Referring now to FIG. 1, a machine learning based ergonomics method, in accordance with aspects of the present technology, is shown. The method can include accessing information associated with a plurality of actors including sensed activity information within an activity space with respect to performing one or more activities, at 1610. At 1620, the activity information of the plurality of actors can be analyzed with respect to one or more ergonomic factors. At 1630, a result of the analysis including a selection of one of the plurality of actors to perform the activity can be output.

[00108] Referring now to FIG. 17, a side view of an ergonomics reach data diagram is shown on a graphical user interface (GUI), in accordance with aspects of the present technology. The reach data diagram shows six locations, numbered one through six, of reaching tasks that are performed by the depicted worker 1740. The depicted worker can be male or female, tall or short and represent the diversity of humans. The GUI ergonomics reach data diagram graphs distances in both the vertical dimension and the horizontal dimension using a grid as shown. Based upon gathered data, the deep learning system first identifies the reach points, then performs reach analysis and generates the recommendations 1760. As shown, exemplary recommendations can include identifying steps which are hard-to-reach for 85 percent of women and 50 percent of men. As shown, exemplary recommendations include moving the action (e.g., step 4) closer to the worker, or alternatively, ensuring the task is performed by workers who are above 5 '5" in height.

[00109] Referring now to FIG. 18, a top down view of an ergonomics reach data diagram is shown on a graphical user interface (GUI), in accordance with aspects of the present technology. The reach data diagram shows six locations, numbered one through six, of reaching tasks that are performed by the depicted worker 1840. The depicted worker can be male or female. Based upon gathered data, the deep learning system performs reach analysis and generates the recommendations 1860. As shown, exemplary recommendations can include identifying steps which are highly repetitive actions, and which require a reach over 800 mm. As shown, exemplary recommendations include moving the action (e.g., step 3) closer to the worker, or alternatively, ensuring the task is not performed more than four hours by a worker.

[00110] Referring now to FIG. 19, a top down view of an ergonomics reach data diagram is shown on a graphical user interface (GUI), in accordance with aspects of the present technology. The reach data diagram shows a heat map of reaching tasks that are performed by the depicted worker. Based upon gathered data, the deep learning system performs reach analysis and generates ergonomics data that can include metrics such as furthest point reached, action with most frequency, highest force applied, highest for applied duration and or the like along with the corresponding portions of a sensor stream overlaid with the heat map of the reaching tasks performed by the worker.

[00111] Referring now to FIG. 20, an exemplary simple raw data table produced in accordance with aspects of the present technology is shown. The deep learning system of the present technologies can produce continuous motion data from worker's hands starting position (e.g., x,y,z) to ending position (e.g., xl , y 1 , zl). This data can be analyzed over a time-period for repetitive motions to identify ergonomic issues developing across long periods. This is shown by the dates and times, identified actions, starting coordinates, and in coordinates, the weight of the item, and the movement distance.

[00112] In this manner, aspects of the present technology implement a method comprising, accessing information associated with a first actor, including sensed activity information associated with an activity space, analyzing the activity information, including analyzing activity of the first actor with respect to ergonomic factors, and forwarding feedback on the results of the analysis, wherein the results includes identification of ergonomically problematic activities.

[00113] In one embodiment, the information is from sensors monitoring a work space in real time, wherein the information is accessed and analyzed in real time, and the feedback is forwarded in real time. In one embodiment, the analyzing comprises: comparing information associated with activity of the first actor within the activity space with identified representative actions, and identifying a deviation between the activity of the first actor and the representative standard.

[00114] In one embodiment, the analyzing comprises: determining if the deviation associated with a respective one of the plurality of actors is within an acceptable threshold (limit/parameter, measurement), and identifying the respective one of the plurality of other actors as a potential acceptable candidate to be the replacement actor when the deviation associated with a respective one of the plurality of actors is within the acceptable threshold. In one embodiment, the analysis includes creating a spatio-temporal dataset of locations in the workspace the first actor activity involves.

[00115] In one embodiment, the analyzing includes automated artificial intelligence analysis. In one embodiment, the sensed activity information is associated with a grid within the activity space. In one embodiment, the first actor is a human. In one embodiment, the sensed activity information further includes a second actor, wherein the first actor is a device and the second actor is a human.

[00116] Aspects of the technology implement a method comprising: accessing information associated with a first actor, including sensed activity information within an activity space with respect to performing an activity, analyzing the activity information, including analyzing activity of the first actor and analyzing activity of the second actor, and forwarding feedback on the results of the analysis, wherein the results include a identification of a selection between the first actor and the second actor.

[00117] In one embodiment, the accessing information includes continual spatial data associated with the first actor's activities. In one embodiment, the analysis includes the distance a part of the first actor moves (reach/bend/lean/(leg or arm) extension study using statistical analysis over time). In one embodiment, the analysis includes the distance and frequency a part of the first actor moves (motion/action study using statistical analysis over time). In one embodiment, the analysis includes the distance, frequency, and count a part of the first actor moves (repetitive motion/action study using statistical analysis over time).

[00118] In one embodiment, the analysis includes comparison with safety standards (OSHA- is safety equipment being used/worn correctly). In one embodiment, the analysis includes consideration of forces on the first actor (weight/torque/load). In one embodiment, the analysis includes consideration of characteristics of the first actor (height/weight/strength). In one embodiment, the feedback is provided in real time and on an ad hoc basis. In one embodiment, the ad hoc basis corresponds to safety regulations. In one embodiment, the analysis and feedback includes identification of omitted safety activity (sweep floor, wash food prep, etc.).

[00119] Referring now to FIG. 21, a machine learning based ergonomics method, in accordance with aspects of the present technology, is shown. The method can include accessing one or more data sets including one or more identifiers of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters in one or more sensor streams of a subject, at 2110. The subject can be an article of manufacture, a health care service, a warehousing, a shipping, a restaurant transaction, a retailing transaction or the like. The one or more sensor streams can be received from one or more stations disposed at various position in a work space of a manufacturing, health care, warehouse, shipping, restaurant, retail or the like facility.

[00120] In one implementation, the indicators of the cycles, processes, actions, sequences, objects, and parameters of the one or more data sets can include spatio-temporal information such as one or more locations of one or more portions of one or more actors, one or more moments of work (e.g., weight, torque, distance), one or more work zones and or the like. For example, the indicators of the cycles, processes, actions, sequences, objects and or parameters can include one or more locations of one or more portions of a worker captured in the one or more sensor streams. The sensor streams for example include video from one or more cameras disposed at one or more stations. Since, the camera may not always be placed orthogonal to a worker, the data from the camera can be distorted based on the viewpoint of the camera. Using the coordinates of the camera and image projection techniques, the position of the workers hands, legs, head, back and or the like can be established. The data sets can therefore include spatial data, collected at a second or sub-second rate, over one or more manufacturing cycles, for example. The position information can be captured in time and two-dimensions (such as x and y coordinates), or three-dimensions (such as combinations of x, y, z, roll, pitch, and yaw). In one implementation, a point cloud map, a heat map or the like of the places in the work space that the worker reaches out to can be constructed from the spatio-temporal data captured in the data set. In another implementation, the data set can include motion data, for example from a workers' hands starting position (e.g., x, y, z) to ending position (e.g., xl, y l, zl).

[00121] The data sets can be based on one or more cycles, one or more actors, one or more stations, and or the like and any combinations thereof. For example, the one or more data sets can include one or more data sets for a first actor and one or more data set for a second actor. In another example, the one or more data sets can include data sets for one or more cycles across each of a plurality of stations of an assembly line. In another example, the one or more data set can include data sets for a specific actor over a specified period of time.

[00122] In one implementation, the one or more engines 170 can be configured to retrieve the one or more identifiers of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters indexed to corresponding portions of the one or more sensor streams from one or more data structures stored in one or more data storage units 175. In another implementation, the one or more engine 170 can generate in real time the one or more identifiers of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters indexed to corresponding portions of the one or more sensor streams in accordance with the techniques described above with regard to FIG. 6. The position information of one or more portions of one or more actors can be determined from the one or more sensor streams by the one or more engines 170. The one or more engines 170 can be configured to perform a two-part deep learning tracker. A track start detector can perform an initial identification and localization of one or more portions of an actor, such as the hands of a worker. For each frame of a video, for example, the track start detector can utilize a three-dimensional (3D) convolution based neural network to detect the set of hands of a worker, and determine the location of each detected hand. The location can be estimated in the form of a bounding box (e.g., image axis aligned rectangle) of the detected hand object. A track continuation and end detector can determine if a detected track is continuing in the next frame of the video or if the track has ended. The track continuation and end detector can utilize a light weight, relatively shallow, fast, lost cost two-dimensional (2D) convolution based neural network. The two- dimensional (2D) convolution based neural network can take a sequence of two images at a time to determine if the detected track is continuing. If a track continuation is detected, a new bounding box of the tracked object (e.g., hand) can be generated. The image in the pair of bounding boxes (e.g., previous and new frame) can be used as the initial image and tracked object location for the next pair of images. The track start detection is a harder problem, and therefore the heavy duty three-dimensional (3D) convolution based neural network is needed. On the other hand, once a track start is detected, continuing to track it is a much easier problem. Detecting the continuation of a track is aided by the last know position of the object and also by the motion or lack thereof. Once a track start is detected, for a hand for example, the track start detector does not need to be run anymore. The track start detector therefore does not need to run again until after a track end condition is detected. In one implementation, the track start detector can employ a spatio-temporal 3D convolution with a kernel size of 3x3xd, and a stride of lxlxl. After the 3D convolution layers, a spatial 2D convolution with a kernel size of 3x3 and a stride of lxl can be performed on each frame. The spatial 2D convolution can be followed by lxl convolutions. After the lxl convolutions, a softmax layer predicting a track start or no track start in parallel with the bounding box of the track if a track start condition is detected, are provided in parallel. For the track continuation and end detector it is expected that the initial image contains at least one instance of the tracked object along with its bounding box. The initial image can be cropped to a padded version of this bounding box and resized to a fixed size to yield the first input image for the shallow neural network. The same bounding box can also be projected to the second image and a more generous padded version of it can be cropped and resized to the same fixed size. The resulting image can go into eh shallow neural network as the second input image. The images, cropped and resized, can go through different copies, with different weights but the same architecture, of the same shallow neural network. The outputs of these different copies of the neural network can be concatenated and the concatenated vector can be passed through two fully connected layers. The decision of track continued or track ended, as well as the track bounding box, if track continued, can be learned via regression loss functions.

[00123] At 2120, one or more ergonomic factors including one or more indicators of at least one of one or more cycles, one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters can be accesses. The indicators of the cycles, processes, actions, sequences, objects, and parameters of the one or more ergonomic factors can include one or more moment of work limits, hazard scores for corresponding work zones, and or the like. In one implementation, the one or more engines 170 can be configured to retrieve the ergonomic factors from one or more data structures stored on one or more data storage units 175, of from one or more additional data sources such as an Occupational Safety and Health

Administration (OSHA) database.

[00124] In one implementation, a workspace can be divided into a plurality of zones based distances and angle from a fixed reference position of an actor. Hazard scores can be assigned to the plurality of zones, where for example zones near the actor can receive a low hazard score and zones far away or at an unnatural angle of reach can receive a high hazard score, as illustrated in FIG. 18.

[00125] At 2130, the one or more data sets can be statistically analyzed based on the one or more ergonomic factors to determine an ergonomic data set. In one implementation, the one or more engines 170 can be configured to analyze the one or more data sets based on the one or more ergonomic factors to generate reach over time studies (e.g., distance), motion over time studies (e.g., distance and frequency), repetitive motion studies (e.g., distance, frequency and count), torque and load studies (e.g., force, frequency and count), and or the like.

[00126] At 2140, the ergonomic data set can optionally be stored in one or more data structures. The stored ergonomic data can be indexed to one or more corresponding portions of the one or more sensor streams. In one implementation, the ergonomic data set can be indexed to one or more corresponding portions of the one or more sensor streams by corresponding time stamps. In one implementation, the one or more engines 170 can store the ergonomics data in one or more data structures on one or more data storage units 175. In one implementation, the ergonomics data and the corresponding portions of the one or more sensor streams indexed by the ergonomics data can be blockchained to protect the integrity of the data therein. The blockchaining can be applied across the cycles, sensor streams, stations, supply chain and or the like.

[00127] In one implementation, the ergonomic data set and corresponding portions of the one or more sensor streams indexed by the ergonomic data can be stored for a record of the work performed, compliance with ergonomic regulations, and or the like, along with the sensor stream data to back it up.

[00128] At 2150, at least one or one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters of the subject can be adjusted based on the ergonomic data set. In one implementation, one or more processes, actions, sequences, object, parameters and or the like of the subject can be adjusted in addition to storing the ergonomics data set, or alternatively to storing the ergonomics data set. In one implementation, the at least one or one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters of the subject can be adjusted based on the ergonomic data set to achieve improved placement of tools in the work space, promote working style best practices, and or the like.

[00129] At 2160, one of a plurality of actors can be selected based on the ergonomic data set. In one implementation, the actor can be selected based on the ergonomic data set in addition to storing the ergonomics data set or adjusting the at least one or one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters of the subject. Similarly, the actor can be selected based on the ergonomic data set alternatively to storing the ergonomics data set or adjusting at least one or one or more processes, one or more actions, one or more sequences, one or more objects and one or more parameters of the subject.

[00130] A longstanding problem in the field of employee management is the need to assign workers to tasks in a way that optimizes efficiency. The assignment problem as it is known has been traditionally described as assigning J (a set of jobs) to W (a set of workers) so that each worker performs only one job and each job is assigned to only one worker— all while minimizing the cost, as defined by the nature of the business problem. Costs might include worker wages, time to deliver the product to the customer, product quality, and combinations of these cost functions.

[00131] Table 5 illustrates a simple example of the Assignment Problem, where 4 Jobs need to be assigned to 4 workers and worker's ability to perform the job is known.

Table 5

[00132] The assignment problem has traditionally been solved as a linear programming problem using the Simplex method (and derivatives of this like the Hungarian method) to do so efficiently in programming environments like Matlab. However, existing formulations for determining employee assignments dramatically simplify the realities of life by assuming that the workers have robot-like properties, and that those properties do not change on a temporal basis. The real-world is fundamentally variable— with changes being introduced every minute and on every shift— and measuring these changes is imprecise. This fundamental variance is rarely explicitly modeled in the assignment process.

[00133] Accordingly, embodiments of the present invention continually measure aspects of the real-world, making it possible to describe the performance of an actor (e.g., a human worker or robot) as a distribution function that reflects variations in performance over time. By using a vastly more detailed data set, embodiments of the present invention apply more relevant versions of mathematical programming techniques to more efficiently assign actors to actions. For example, parallel representations or multi-stage optimization techniques may be used to solve the multi-objective problem while finding optimal solutions to the expected cost functions.

[00134] In one example, for a set of m workers wo, wi, ... w m and a set of n tasks to, ti, t n , assigning a task t j to worker w; has a cost Cy. A computer implemented sequence of steps automatically determines optimal assignments so that the total cost is minimized. This problem maps to the maximum weighted bipartite matching problem in graphs. There are two sets of vertices corresponding to workers and tasks respectively. Weighted edges run between each worker and each edge. Thus there are m x n edges. The weight of an edge between worker i and task j is 1/Ci j . The maximum edge weight (equivalently minimum cost) solution can be obtained via the Hungarian algorithm.

[00135] A neural network may be used to analyze factory floor sensor streams (e.g., videos) to estimate aggregate action completion times for each worker and each action. Using these estimates, estimates for cost (Ci j ) of assigning task j to worker i can be determined. This cost has two aspects. First, the competence mismatch takes the average (over all observations) completion time for task j taken by worker i relative to the same over all workers. In general, the larger the time taken by a worker to complete a task, the more mismatched the worker is to the task. The ergonomics cost is a second aspect of the cost Cy, where each task has an associated effort estimate. This can be estimated as the average completion time for the specific task over all workers, relative to the average completion time of all task over all workers. Each worker has a fatigue score which is the sum of the effort V required for all recent tasks.

[00136] For example, one exemplary process of assigning task j to worker i involves the following equations in Table 6: tijk = time taken by worker i to complete a task j on observation k average completion time of task j by worker i

c(i, j) = i?¾~s. competence mismatch cost effort required for task j fatigue score for worker i

Effort required for task j x fatigue score for worker i = ergonomic cost e(i, j)

c(i, j) = ac(i, j) + (1 - a)e(i, j), where 0 < a < 1 ; a - relative importance between competence mismatch and ergonomics.

Table 6 [00137] According to some embodiments, the problem of assigning resources (e.g., actors) to actions or processes is represented by a linear cost function with linear constraints, and embodiments of the present invention automatically optimize the cost function based on observed data to optimize the cost function in real-time. According to some embodiments, there are more actors than works stations, and the actors are assigned to work stations in shifts where actors rotate through the stations in an optimized manner. Moreover, embodiments of the present invention can consider seniority, actor skill level, actor certification, physical characteristics of actors, quality of work associated with the actor, actor ergonomics, actor endurance or physical fitness, the speed at which an actor completes tasks, and worker compensation (e.g., overtime, difference in wages, etc.). Furthermore, embodiments of the present invention can certify actors based on the observed skill level, the ergonomics, and the speed at which an actor completes tasks, and certified actors may be prioritized over the non- certified actors.

[00138] According to some embodiments, actors include both human workers and robots working side-by-side. It is appreciated that robots do not tire as humans do, the actions of robots are more repeatable than humans, and robots are unable to perform some tasks that humans can perform.

[00139] With regard to FIG. 23, an exemplary computer system 2300 for automatically observing and analyzing actions (e.g., a task or activity) of an actor (e.g., a human worker or robot) based on data previously captured by one or more sensors is depicted according to that described herein with reference to the engine, but is not limited to such. According to an exemplary manufacturing implementation, a plurality of stations 2330-2340 may represent different work stations along an assembly line. One or more sensors 2315-2325 can be disposed non-intrusively at various positions around one or more of the stations 2330-2340. The same set of one or more sensors 2315-2325 can be disposed at each station 2330-2340, or different sets of one or more sensors 2315-2325 can be disposed at different stations 2330-2340.

[00140] The sensors 2315-2325 can include one or more sensors such as video cameras, thermal imaging sensors, depth sensors, or the like. The sensors 2315-2325 can also include one or more other sensors, such as audio, temperature, acceleration, torque, compression, tension, or the like sensors. Sensor data is processed by CPU 2305, and one or more databases store data structures including, for example, one or more sensor data streams received from the one or more sensors 2315-2325. Database 2310A depicted in FIG. 23 can include one or more data structures for storing detected cycles, processes, actions, sequences, objects, and parameters thereof indexed to corresponding portions of the one or more sensor streams in the sensor stream data structure. The engine back-end unit 180 and/or the analytics unit 185 depicted in FIG. 1 can store the sensor data streams from the one or more sensor 2315-2325 in the database 231 OA for storing the one or more sensor data streams by appending the currently received portion of the sensor data streams to the previous portions of the sensor data streams stored in the database 231 OA. The engine back-end unit 180 and/or the analytics unit 185 can also store identifiers of the detected cycles, processes, actions, sequences, objects, and parameters thereof indexed to corresponding portions of the current one or more sensor data streams, for example, in database 2310B.

[00141] The sensors 2315-2325 may be configured to continuously monitor the activities of actors 2345-2355, and the data captured by the sensors 2315-2325 can be described according to a distribution function to reflect variations in performance or steps of a process. For example, the sensor data may be provided in a sensor stream including video frames, thermal sensor data, force sensor data, audio sensor data, and/or light sensor data. In this way, embodiments of the present invention are able to apply relevant mathematical programming techniques (e.g., parallel representations or multi-stage optimization techniques) to efficiently assign actors to specific actions. For example, an actor's performance (e.g., actors 2345-2355) may be tracked over time using sensors 2315-2325 to determine/characterize the actor's skill level, the time spent at various stations, the availability of the actor, and/or the actor's physical/ergonomic ability, and mathematical programming techniques may be applied to the sensor data to efficiently assign the actor to an action. The sensor data capturing the actor's performance may by analyzed to determine if the actor is performing better or worse than average, or to determine the actor's competence level in performing actions, such as determining that a product exited the line with incomplete tasks due to a failure of the worker. The task may include performing an atomic or molecular task on an object 2330-2340, for example. According to some embodiments, the actors 2345-2355 are certified according to a worker profile and certificate as depicted in FIG. 29.

[00142] With regard to FIG. 24, an exemplary sequence of computer implemented steps 2400 for automatically observing and analyzing actor (e.g., worker) activity based on observed data (e.g., video frames, thermal sensor data, force sensor data, audio sensor data, and/or light sensor data) is depicted according to embodiments of the present invention. In the embodiment of FIG. 24, it is assumed that actors are assigned to a fixed station, and each station performs a fixed task. At step 2405, sensor data is received at a computing device. The sensor stream includes sensor information obtained from a sensor operable to sense progress of a work task. Step 2410 may optionally be performed according to some embodiments to receive an identity of actors identified within the sensor stream at the computing device. At step 2415, actions performed by an actor that have been recorded within the sensor stream are identified using one or more engines executed by the computing device. At step 2420, the received sensor stream and identities of the recorded actions are stored in the computing device, and the identity of the actions are mapped to the sensor stream. If step 810 was performed to receive an identity of actors identified within the sensor stream, step 820 may also include storing the identity of the actors in the computing device. At step 2425, the identified actions performed by the actor are characterized by the one or more engines to produce characterizations for the identified actions. The characterizations may include ergonomics of the actor, a skill level of the actor, and/or a time required for the actor to perform the identified actions.

[00143] With regard to FIG. 25, a block diagram and data flow diagram 2500 of an exemplary computer system that automatically assigns processes or actions (e.g., tasks) to actors (e.g., human workers or robots) in real-time based on observed data (e.g., video frames, thermal sensor data, force sensor data, audio sensor data, and/or light sensor data) is depicted according to embodiments of the present invention. In the embodiment of FIG. 25, it is assumed that actors are assigned to a fixed station, and each station performs a fixed task. The computer system stores and/or receives information including process information 2505 and actor information 2510 which may be stored in one or more data structures.

[00144] Processes information 2505 includes a list of processes to be performed and characteristics thereof. Actor information 2510 may include a list of actors available to perform actions and optionally characteristics of the actors. Based on the process information 2505 and the actor information 2510, an optimization step 2515 automatically determines which actor to assign to which task and for how long the task should be performed by the actor to generate list 2520. For example, the optimization step 2515 may include solving one or more cost functions with associated constraints to determine the list of actors assigned to stations. The optimization step 2515 may include determining a job assignment to an actor based on the quality of the work of the operator at a specific station, the speed of the operator at a specific station, the cumulative ergonomic load on the operator for that given period of time (e.g., a day) across one or more stations. According to some embodiments, the optimization step 2515 uses one or more equations depicted in Table 6 to determine a list of entities assigned to stations 2520. The list of actors assigned to stations 2520 is updated in real-time as new shifts of actors arrive or as actors tire over time. According to some embodiments, the entity information includes worker profile and certificates as depicted in FIG. 29.

[00145] With regard to FIG. 26, an exemplary sequence of computer implemented steps 2600 for automatically observing actor activity and assigning processes or actions (e.g., tasks) to actors (e.g., human workers or robots) in real-time based on observed data (e.g., video frames, thermal sensor data, force sensor data, audio sensor data, and/or light sensor data) is depicted according to embodiments of the present invention. The steps 2600 may be performed using one or more equations of Table 6 automatically by a processor of a computer system. In the embodiment of FIG. 26, it is assumed that tasks/actions can be moved from one station to another. At step 2605, sensor data is received at a computing device. The sensor stream includes sensor information obtained from a sensor operable to sense progress of a work task or processes. Step 2610 may optionally be performed according to some embodiments to receive an identity of actors identified within the sensor stream at the computing device. At step 2615, actions performed by an actor that have been recorded within the sensor stream are identified using one or more engines executed by the computing device. At step 2620, the received sensor stream and identity of the recorded actions are stored in the computing device, and the identity of the actions are mapped to the sensor stream. If step 2610 was performed to receive an identity of actors identified within the sensor stream, step 2620 may also include storing the identity of the actors in the computing device.

[00146] At step 2625, the identified actions performed by the actor are characterized by the one or more engines to produce characterizations for the identified actions. The

characterizations may include ergonomics of the actor, a skill level of the actor, and/or a time required for the actor to perform the identified actions. At step 2630, based on the determined characterizations of the actor performing the actions, an action (e.g., work task) or processes assignment is dynamically determined for the actor in real-time. Step 2630 may include assignment to an actor based on one or more data structures including processes information, a list of actors to assign, and a list of tasks or processes to assign to stations or actors, for example. According to some embodiments, the determined characterizations are used to determine if an actor is certified to a standard. Step 2630 may include moving an actor from one station/task to another station/task, and step 2630 may be repeated over-time to automatically optimize the assignment of actors to tasks based on real-time observations of actor performance. [00147] Referring now to FIG. 27, an exemplary job assignment input user interface 2700 is depicted according to embodiments of the present invention. The job assignment user interface 2700 receives actor line input 2705 and actor shift input 2710 for a list of available associates 2715. For example, the actor line input 2705 may be used to select a specific group or line or actors, and the actor shift input 2710 may be used to select a specific time for assigning jobs. The assign button 2720 is selected to execute a computer-implement job assignments method as described herein according to embodiments of the present invention to generate a job

assignments output as depicted in FIG. 28.

[00148] Referring now to FIG. 28, an exemplary job assignment output 2800 is depicted according to embodiments of the present invention. The job assignment output 2800 is generated using a computer-implemented job assignment method as described herein according to embodiments of the present invention. The output 2800 includes a list of associates 2805 assigned to station assignment 2810. The list of associates 2805 further includes actor skill levels indicating a good fit, an average fit, a bad fit, or not enough data to determine a skill level. The actor skill level (e.g., associate skill level 2805, station assignment 2810, skill fit 2815, station fit 2820, and ergonomic fit 2825) may be determined according to one or more equations depicted in Table 6.

[00149] Referring now to FIG. 29, an exemplary worker profile, in accordance with aspects of the present technology, is shown. The proficiency of a worker can be measured during the contextual training and reported to one or more additional data sources. In one implementation, the one or more engines 170 can report one or more parameters measured during the contextual training to an employee management system for use in a worker profile. In another

implementation, the action recognition and analytics system 100, 500 can also utilize the one or more parameters measured during the contextual training for line balancing, programmatic job assignments, and other similar functions.

[00150] Referring now to FIG. 30, a block diagram of an exemplary computing device upon which various aspects of the present technology can be implemented. In various embodiments, the computer system 3000 may include a cloud-based computer system, a local computer system, or a hybrid computer system that includes both local and remote devices. In a basic

configuration, the system 3000 includes at least one processing unit 3002 and memory 3004. This basic configuration is illustrated in Figure 30 by dashed line 3006. The system 3000 may also have additional features and/or functionality. For example, the system 3000 may include one or more Graphics Processing Units (GPUs) 3010. Additionally, the system 3000 may also include additional storage (e.g., removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in Figure 30 by removable storage 3008 and non-removable storage 3020.

[00151] The system 3000 may also contain communications connection(s) 3022 that allow the device to communicate with other devices, e.g., in a networked environment using logical connections to one or more remote computers. Furthermore, the system 3000 may also include input device(s) 3024 such as, but not limited to, a voice input device, touch input device, keyboard, mouse, pen, touch input display device, etc. In addition, the system 3000 may also include output device(s) 3026 such as, but not limited to, a display device, speakers, printer, etc.

[00152] In the example of Figure 30, the memory 3004 includes computer-readable instructions, data structures, program modules, and the like associated with one or more various embodiments 3050 in accordance with the present disclosure. However, the embodiment(s) 3050 may instead reside in any one of the computer storage media used by the system 3000, or may be distributed over some combination of the computer storage media, or may be distributed over some combination of networked computers, but is not limited to such.

[00153] It is noted that the computing system 3000 may not include all of the elements illustrated by Figure 30. Moreover, the computing system 3000 can be implemented to include one or more elements not illustrated by Figure 30. It is pointed out that the computing system 3000 can be utilized or implemented in any manner similar to that described and/or shown by the present disclosure, but is not limited to such.

[00154] The foregoing descriptions of specific embodiments of the present technology have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the present technology and its practical application, to thereby enable others skilled in the art to best utilize the present technology and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.