Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SEMANTICALLY SUPPORTED OBJECT RECOGNITION TO PROVIDE KNOWLEDGE TRANSFER
Document Type and Number:
WIPO Patent Application WO/2021/041755
Kind Code:
A1
Abstract:
Systems and methods for knowledge transfer using augmented reality. A stream of images from a camera, recorded in first-person perspective, is sent to a processor to analyze object types, their location on the frame and associated detection accuracy. The response from the processor is combined with information from a semantic data model stored in memory, that is then used to detect actions using an activity recognition engine. A structured set of work instructions is generated and stored in the memory for the task. The structured set of work instructions may then be used in learning mode to instruct new users and to give them feedback.

Inventors:
KRITZLER MAREIKE (US)
GARCIA GARCIA ELVIA KIMBERLY (US)
Application Number:
PCT/US2020/048312
Publication Date:
March 04, 2021
Filing Date:
August 28, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS AG (DE)
SIEMENS CORP (US)
International Classes:
G06K9/00
Foreign References:
US20140310595A12014-10-16
Other References:
MATSUFUJI AKIHIRO ET AL: "A Method of Action Recognition in Ego-Centric Videos by Using Object-Hand Relations", 2018 CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), IEEE, 30 November 2018 (2018-11-30), pages 54 - 59, XP033478453, DOI: 10.1109/TAAI.2018.00021
GEORGIOS KAPIDIS ET AL: "Egocentric Hand Track and Object-Based Human Action Recognition", 2019 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 1 August 2019 (2019-08-01), pages 922 - 929, XP055751310, ISBN: 978-1-7281-4034-6, DOI: 10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00185
Attorney, Agent or Firm:
BRINK JR., John D. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method for providing knowledge transfer for a task, the method comprising: capturing, with a head mounted device, image data during performance of the task by a first user; detecting, by an object recognition engine in the image data during the capturing, one or more objects during performance of the task; identifying, using a semantic data model, actions performed by the first user on the one or more objects; mapping the actions and the objects to a structured set of work instructions; and storing the structured set of work instructions in a knowledge instance for the task.

2. The method of claim 1, further comprising: providing, using the head mounted device, the structured set of work instructions in a training session to a second user.

3. The method of claim 1 , wherein each work instruction of the structured set of work instructions comprises at least two objects that are linked to an action.

4. The method of claim 1, wherein the semantic data model comprises functionality data and relationship data for each of the objects.

5. The method of claim 4, wherein the relationship data comprises a plurality of rules defining valid and invalid actions.

6. The method of claim 1, wherein the actions comprise at least hand tool actions.

7. The method of claim 1, wherein identifying actions comprises: receiving object predictions from the object recognition engine; and combining the object predictions with rules for the predicted objects from the semantic data model; and identifying actions from the combination.

8. The method of claim 1, wherein capturing image data comprises capturing a continuous stream of images.

9. The method of claim 1, wherein capturing further comprises capturing voice input data, text input data, or voice and text input data.

10. The method of claim 1, wherein the object recognition engine comprises a neural network trained to classify objects in an image.

11. A method for providing knowledge transfer for a task, the method comprising: accessing a structured set of work instructions in a knowledge instance for the task; detecting, by a head mounted device, that a plurality of required parts is available for the structured set of work instructions; providing, by the head mounted device using augmented reality, the structured set of work instructions to a user; monitoring, by the head mounted device, objects and actions in a field of view of the user during performance of the structured set of work instructions; and providing, using the head mounted device, feedback on actions performed by the user.

12. The method of claim 11, wherein the structured set of work instructions comprise a set of recorded actions performed by an experienced user and mapped using a semantic data model.

13. The method of claim 11, wherein feedback further comprises providing voice feedback, text feedback, or voice and text feedback.

14. The method of claim 11, wherein each work instruction of the structured set of work instructions comprises at least two objects that are linked to an action.

15. The method of claim 11, wherein the actions comprise at least hand tool actions.

16. A system for providing knowledge transfer for industrial tasks, the system comprising: a semantic data model stored in a memory, the semantic data model comprising one or more instances for industrial tasks, each instance of the one or more instances comprising at least a list of objects and instructions for performing an industrial task; a head mounted device comprising at least a camera and a display, the head mounted device configured to provide augmented reality visualization of the instructions for performing the industrial task; and an object recognition engine coupled with the head mounted device and the semantic data model and configured to monitor the performance of the industrial task and provide feedback to a user.

17. The system of claim 16, wherein each instance is created by observing, using the head mounted device, a respective industrial task performed by an experienced user.

18. The system of claim 16, wherein the instructions comprise at least hand tool actions.

19. The system of claim 16, wherein the semantic data model further comprises functionality data and relationship data for each object of the list of objects.

20. The system of claim 16, wherein each instruction comprises at least two objects that are linked to an action.

Description:
SEMANTICALLY SUPPORTED OBJECT RECOGNITION TO PROVIDE

KNOWLEDGE TRANSFER

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application

No. 62/893,239 filed August 29, 2019, which is hereby incorporated by reference in its entirety and relied upon.

FIELD

[0002] The present embodiments relate to semantically supported object recognition to provide knowledge transfer for industrial tasks.

BACKGROUND

[0003] Certain procedures require a worker to manually perform a task and thus require training of the worker. Many of the tasks require specific sequences and correct execution that needs to be learned and memorized. Certain knowledge may be collected by workers over the years while the workers were performing their duties. In order to provide a transfer of this knowledge, the training of workers is typically supported with written documentation, paper-training material, computer programs, and the advice and guidance of experienced peers and supervisors. The knowledge of sequences and correct execution is often implicit knowledge stored in the head of seasoned and experienced workers. Portions of the knowledge might get lost when parts of the workforce retire or are not any longer available. Transferring this knowledge from an experienced worker to a novice requires a lot of effort and planning.

[0004] The transfer of knowledge may be very time-consuming and may, for example, limit the number of trainees, as the number and expenses for teachers may be limited. Documentation and written support provide simple repetitions, but are often incomplete since the creator has to remember and reproduce every single step of a procedure. In addition, they can also be time-consuming in the creation, which is used as an excuse for creating no documentation at all. Another problem is that many documentations are not created by people who perform the actual tasks and will not be updated over time, which can lead to deviations and inconsistencies in the process. In addition, the user has to switch between the actual assembly and the reading of the documentation which can be confusing or incomplete. Finally, exit interviews, which would typically cover a worker’s responsibilities, may also contain incomplete information. This is the case because memories are related to a context and a location. Without the ties, a person typically only remembers up to 20% of their total knowledge.

[0005] In certain areas, Augmented Reality (AR) has been used to support guidance of certain tasks. Studies have shown that traditional methods such as documentation instructions and videos are outperformed in both time and accuracy (i.e. reduction of mistakes) by AR assisted assembly. Even though known AR applications may be used to train employees, the systems are hard-coded to support the learning of predefined work steps, making development or adaptation to new tasks complex. The development and generation of AR content is mostly done by coding or the use of an external application with a GUI, therefore time consuming and generated apart from the actual work.

SUMMARY

[0006] Systems and methods are provided for users to author routine and non-routine industrial tasks directly on physical objects. The objects involved during a specific task are captured via visual object detection (either head mounted device or camera). The captured task sequences are translated into a semantic data model. Trainees use the created content in order to learn new tasks with direct visual feedback and direct error prevention with the support of augmented reality. [0007] In a first aspect, a method is provided for knowledge transfer for a task. The method includes capturing, with a head mounted device or a stationary camera, image data during performance of the task by a first user; detecting, by an object recognition engine in the image data during the capturing, actions related to one or more objects during performance of the task; mapping the actions and the objects to a structured set of work instructions using a semantic data model; and storing the structured set of work instructions in a knowledge instance for the task. [0008] In an embodiment, the method further includes providing, using the head mounted device or projector, the structured set of work instructions in a training session to a second user.

[0009] In an embodiment, each work instruction of the structured set of work instructions comprises at least two objects that are linked to an action.

[0010] In an embodiment, the semantic data model comprises functionality data and relationship data for each of the objects. The relationship data may comprise a plurality of rules defining valid and invalid actions.

[0011] In an embodiment, the actions comprise at least hand tool actions.

[0012] In an embodiment, mapping actions comprises using predictions from the object recognition engine and combining the predictions with information from the semantic data model to generate the set of instructions.

[0013] In an embodiment, capturing image data comprises capturing a continuous stream of images. In an embodiment, capturing further comprises capturing voice input data, text input data, or voice and text input data.

[0014] In an embodiment, the object recognition engine comprises a neural network trained to classify objects in an image.

[0015] A second aspect provides a method for providing knowledge transfer for a task. The method includes accessing a structured set of work instructions in a knowledge instance for the task; detecting, by a head mounted device, that a plurality of required parts is available for the set of work instructions; providing, by the head mounted device using augmented reality, the set of work instructions to a user; monitoring, by the head mounted device, objects and actions in a field of view of the user during performance of the set of work instructions; and providing, using the head mounted device, feedback on actions performed by the user.

[0016] In an embodiment, the structured set of work instructions comprise a set of recorded actions performed by an experienced user and mapped using a semantic data model.

[0017] In an embodiment, feedback further comprises providing voice feedback, text feedback, or voice and text feedback. [0018] In an embodiment, each work instruction of the structured set of work instructions comprises at least two objects that are linked to an action.

[0019] In an embodiment, the actions comprise at least hand tool actions.

[0020] A third aspect provides a system for providing knowledge transfer for industrial tasks. The system includes a semantic data model, a head mounted device, and an object recognition engine. The semantic data model is stored in a memory and comprises one or more instances for industrial tasks, each instance of the one or more instances comprising at least a list of objects and instructions for performing an industrial task. The head mounted device comprises at least a camera and a display and is configured to provide augmented reality visualization of the instructions for performing the industrial task. The object recognition engine is coupled with the head mounted display and the semantic data model and is configured to monitor the performance of the industrial task and provide feedback to a user.

[0021] In an embodiment, each instance is created by observing, using the head mounted device, a respective industrial task performed by an experienced user.

[0022] In an embodiment, the instructions comprise at least hand tool actions.

[0023] In an embodiment, the semantic data model further comprises functionality data and relationship data for each of the objects.

[0024] In an embodiment, each work instruction of the structured set of work instructions comprises at least two objects that are linked to an action.

[0025] Any one or more of the aspects described above may be used alone or in combination. These and other aspects, features and advantages will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings. The present invention is defined by the following claims, and nothing in this section should be taken as a limitation on those claims. Further aspects and advantages of the invention are discussed below in conjunction with the preferred embodiments and may be later claimed independently or in combination.

[0026] BRIEF DESCRIPTION OF THE DRAWINGS

[0027] The components and the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Moreover, in the figures, like reference numerals designate corresponding parts throughout the different views. The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

[0028] Figure 1 depicts an example system for semantically supported object recognition to provide knowledge transfer according to an embodiment. [0029] Figure 2 depicts an example method for semantically supported object recognition to provide knowledge transfer according to an embodiment. [0030] Figure 3 A depicts an example of bounding boxes according to an embodiment.

[0031] Figure 3B depicts an example of valid connections according to an embodiment.

[0032] Figure 4 depicts an example ontology according to an embodiment.

[0033] Figure 5 depicts an example ontology according to an embodiment.

[0034] Figure 6 depicts an example ontology according to an embodiment.

[0035] Figure 7 depicts an example detection of statuses of a component according to an embodiment.

[0036] Figures 8A, 8B, 8C, and 8D depict an example workflow for authoring according to an embodiment.

[0037] Figures 9A, 9B, and 9C depict an example workflow for authoring according to an embodiment.

[0038] Figure 10 depicts an example view of the authoring mode according to an embodiment.

[0039] Figure 11 depicts an example view of multiple actions detected according to an embodiment.

[0040] Figure 12 depicts an example method for semantically supported object recognition to provide knowledge transfer according to an embodiment. [0041] Figure 13 depicts an example view of the learning mode according to an embodiment.

[0042] Figure 14 depicts an example view of the learning mode according to an embodiment.

[0043] Figures 15A and 15B depict example stages of the learning mode according to an embodiment.

[0044] Figure 16 depicts an example system for semantically supported object recognition to provide knowledge transfer according to an embodiment. DETAILED DESCRIPTION

[0045] Embodiments provide AR on a head mounted device and semantic technologies to record and transfer implicit human knowledge. Embodiments provide both an authoring mode (recording of knowledge) and a learning mode (transferring of knowledge). The authoring mode provides for knowledge about a task to be acquired and stored. For the authoring mode, an experienced worker wears a head mounted device while completing a task. The head mounted device captures the actions of the work and the interactions of objects used in the task. An object detection engine detects each of objects used and the actions performed thereon. The actions and objects are mapped to a structured set of work instructions and stored using a semantic data model. The authoring mode is provided without any additional programming skills or explicit expression of the actions and parts.

[0046] The learning mode provides for transferring the knowledge stored during the authoring mode to a new user. Once the instructions are mapped using the semantic data model, a novice user is provided with a head mounted device in order to leam the task. The head mounted device provides an augmented reality visualization of the instructions generated during the authoring mode. The user is guided through the completion of the task by the set of instructions that were generated by a worker that is experienced in performing the task. During the learning mode, the object recognition engine of the head mounted device may be configured to monitor each action of the novice worker. Feedback is provided regarding the performed actions, comparing the monitored actions with the stored procedures. If an action was correctly performed, the novice worker is guided to the next step. If an action was incorrect, the novice worker receives visual instructions indicating the issue until the action is executed correctly.

[0047] Embodiments provide a step-by-step sequence for both authoring and learning a task without any additional programming skills or explicit expression of the actions and parts. In an example application, when an employee retires or leaves a company, it is valuable for the company to store the employee’s knowledge in order to share it with the younger generation. The employee may use the described system or method to quickly and easily record procedures they are tasked with, without having to do any programming or write down instructions that may be difficult to interpret. As a result, a procedure performed by the leaving employee is painlessly defined and stored in a database. Subsequently to the knowledge acquisition and potentially after the employee leaves, the stored information may be used by the learning system to automatically generate a learning environment in which a new workforce can learn the recorded procedure at their own pace. Additionally, for knowledge transfer AR may also improve the learning process by monitoring and using virtual examples. For example, instead of a static and non-responsive document, the learning mode is configured to monitor and provide helpful feedback during both the authoring and learning modes.

[0048] Figure 1 depicts an example system 100 for authoring and learning implicit knowledge. Figure 1 includes a worker / user 115, an augmented reality device (AR device / HMD) 125, and a server 145. The worker / user 115 may be a human or may be a machine, robot, or other entity of performing one or more visually distinctive tasks. The AR device may be an HMD 125 or, for example a projector / camera system. In an embodiment, the worker wears the HMD 125 for authoring and learning tasks. The HMD 125 acquires and records image data using a camera 110. The AR device 125 / HMD 125 provides AR visualizations using the acquired image data. The HMD 125 further provides activity recognition for both the authoring and learning components. The server 145 assists with the activity recognition by providing a semantic data model 135. The server 145 includes a database, reasoning engine, and object detection engine. The server 145 and the speech to text component 165 may be located remotely or in the cloud.

[0049] The proposed system operates in two modes: an authoring mode and a learning mode. The authoring mode includes workflow generation and feedback generation. The learning mode includes learning environment generation, feedback generation, and animation generation. In an embodiment, both modes use the HMD 125 that provides a workplace independent usage that allows experts and trainees to perform actions with both hands. The HMD 125 is a display / interface that may either be directly worn on the user’s head, as part of a helmet, or projected into the environment. Projection mapping or projected augmented reality may be used, if for example, a camera / projector is used instead of an HMD. For projection AR, the projector projects a beam of light onto the work surface or, for example, directly on the parts on which the user is interacting with. In an embodiment, the HMD 125 includes one or more locatable cameras 110 that are mounted on the device. The HMD 125 is configured to acquire and record video / image data that allows applications to retrieve images of the users view in real-time. In addition, to allow applications to reason about the position of the device in the scene, each image frame or video may be annotated with additional information about the perspective projection of the camera 110 and location in the world.

[0050] In an embodiment, an HMD 125 using spatial mapping to create a

3D mesh of polygons to model the near environment around the device. The spatial mesh can be used to render holograms on spatial positions in a room or occluding holograms behind real objects, as well as physics interactions between holograms and the real world. In addition, ray casting may be used by the HMD 125 to determine the distance to a specific point of the real-world surface. Using ray casting, a ray is cast from a given position to a specified direction and gives information about the traveled path and collision with other objects such as the spatial mesh or holograms.

[0051] Because previous interaction methods such as keyboards and controllers hindered an immersive AR experience, other options may be used for the HMD 125 such as a gaze method, a gesture method, or a voice input. The first input method, gaze, attempts to display a cursor, similar to a mouse cursor on a computer, in the user’s line of sight. In an embodiment, the HMD 125 uses eye tracking to determine an object or direction the user 115 is looking at. If eye tracking is not available, the gaze may be projected on the center point which the user 115 sees when he looks through the lens of the HMD 125. Therefore, the gaze-point is thus affected only by the head movement.

[0052] To enable interaction with mixed reality, hand gestures may also be used with an HMD. For example, an air tap gesture enables tapping similar to a click of a mouse. This gesture is used to interact with objects after gazing at them. The gesture is performed by pressing thumb and forefinger together. A bloom gesture is reserved by the system for going ’home’ to the main menu. The gesture is triggered by holding out the hand with palmed up fingertips, followed by opening the hand. Other gestures or motions may be used by a user 115 and may be detected by the HMD 125. For example, custom gestures may be used for specific tools or projects.

[0053] Instead of or in addition to gestures, voice commands may be used to confirm actions. A speech to text component 165 may be provided to interpret and translate voice commands. Voice commands may be useful, for example, when the user 115 has no hands free for gestures, for example, when assembling an object. As an example of a voice commands, the user 115 gazes at a hologram generated by the HMD 125 and triggers the intended command by speaking. Similarly, global voice commands may be created independently of gazing. In addition, a dictation function may be used to convert speech into text. This may be used as an alternative to the cumbersome text input via a virtual keyboard.

Other inputs such as a virtual keyboard may be used. Devices such as augmented reality gloves or haptic feedback smart gloves may be used depending on the task. In an embodiment, an HMD 125 is not used, but rather one or more cameras 110 and displays are configured to acquired image data of the workspace and provide a display for AR to be provided to the user 115. A smartphone or tablet device, for example, may be used in certain embodiments.

[0054] Each of the inputs for the HMD 125 allow a user 115 to interact with augmentations that are generated by the HMD 125. The HMD 125 uses augmented reality technologies (AR) to superimpose digital information on top of real-world objects and in the real -word environment. AR applications may run on different devices that are equipped with at least a camera and a display, such as tablets or head-mounted devices. For example, an AR application may be projected using a camera as input and a projector as output. AR applications use both object detection in images provided by the devices and hologram or projection technology to superimpose the digital information.

[0055] Object detection is a part of computer vison and image processing that can detect and label instances of real-world objects in images and video streams. Different object detection algorithms may be used such as you only look once (YOLO), single shot multibox detector, Mobilenets, among others.

[0056] In an embodiment, machine learning is used for object detection. In an example, a convolution neural network (CNN) is used to learn and detect features in an image. The output of the CNN for image classification may be a n- dimensional vector containing the probabilities for each of the n classes to be the most prominent object in the scene. A bounding box may be used with a CNN to detect and localize multiple objects. In addition, or alternatively, a region-based object detector may be used that follows a two-step process. Region proposals are generated followed by classifying each proposal into the different object categories. In an example, a window in various sizes and aspect ratios is slid over an image to create region proposals so that objects of different sizes may be detected. Subsequently, the regions are classified with a CNN.

[0057] In an embodiment, a single shot object detector is used. As opposed to the region-based object detector described above, a single shot object detector is fed an image only once in a single network that outputs bounding box coordinates and class predictions. Single shot detectors (SSD) are much quicker than region- based object detectors, but lose some accuracy. YOLO is a single shot detector that improves the performance and accuracy of the detection. YOLO treats the detection as a regression problem by dividing the input image into an SxS grid to determine which cell is responsible for a prediction by looking at the center of the object. For each grid a fixed number of bounding boxes with their associated box confidence score are predicted. The confidence score indicates an objectiveness score and the accuracy of the bounding box location. In an embodiment, YOLO is used with one or more additional features, for example a model for hand recognition.

[0058] The output of the object detection is a classification of one or more objects that are in the field of view of the user 115. The classifications may be input into the semantic data model 135 which adds meaning and relationships to the one or more objects. Semantic technologies are used to create machine understandable representations of the objects or actions for the procedure. Using the semantic data model 135, a newly recognized object is related to other pieces of information or other objects creating the description of the data, i.e., giving meaning to it. The semantic knowledge model is used to represent object data by adding meaning to the object classification of the objects detected by the object detection engine. While in ordinary database systems context is only defined by the structure like properties and relationships with other objects, here the semantic modeling is based on an inherent structure where properties of data may be linked to other objects. The data is represented in a meaningful way within the context of the model itself. For this purpose, the semantic knowledge model uses a graph- based approach to represent the data. When input enough data regarding a task from the object recognition engine, the semantic knowledge model becomes a useable ontology that provide extensible vocabulary leading to interoperability by enforcing the usage of a specific terminology. The ontology is used to support the activity recognition engine providing function, purpose, and relationships to the recognized objects in the acquired images.

[0059] The semantic data model 135 includes three main components, an interface, an inference engine, and a knowledge store / database. The first component is the knowledge base. The knowledge base stores the knowledge of the system (i.e. ontologies) and may be deployed publicly on the web or on a local server 145. The knowledge base may be shared between multiple different systems and across different platforms. One example is a companywide knowledge base that includes data and rules for each and every possible component used by the company or division. The knowledge base is configured to solve domain specific problems using one or more stored rules. In an example, a rule includes an IF and THEN part with two logical expressions. An example rule would be: if x is a janitor then x is a human. Using rules new knowledge may be deduced and added to the knowledge base from existing facts stored in the system. The inference engine (also referred to as a reasoning engine) provides a methodology to inference and formulate conclusions about the existing knowledge in the knowledge base. The inference engine operates in two modes. The forward chaining mode tries to match rules and infer new knowledge until a given goal is reached or no further knowledge can be deduced. Backward chaining starts from a final goal and tries to find the initial path to that goal. The user 115 interface provides the interaction of a user 115 with the knowledge base. In an embodiment, with a pure machine to machine interaction this component might be used.

[0060] The output of the semantic data model 135 provides support for the activity recognition engine. For example, two objects are detected by the object recognition engine. The classification of the objects is passed to the semantic data model 135 which provides functionality and relationship data for the classified objects. The functionality and relationship data are input into the activity recognition engine which determines an action that is being performed or has been performed. The action is stored in the database for future use. The system thus leams from watching actions performed by a user 115. A user 115 may thus be able to author a set of instructions for a procedure without having to program or code individual relationships between objects or specify instructions.

[0061] Referring back to Figure 1, the authoring mode enables experts to record a procedure with programming or coding. A user 115 performs a workflow once and the system creates instructions automatically. Activity recognition is used to detect different kind of actions like attachment actions, hand actions and tool actions with the support of the semantic data model 135. During the recording of a new procedure the expert may receive feedback about what the system recognizes and is able to correct the system as needed. The feedback varies depending on which action was performed. Attachment steps may be rendered to show virtual examples of the desired result, while tool actions may be highlighted to show the return point of the tool using virtual arrows. Procedure steps may be annotated with additional information and warnings using voice commands. The procedure will be stored in a knowledge repository, containing the procedure steps in the logical order in which they were executed.

[0062] When creating or authoring new procedures, the HMD 125 stores the workflow in the semantic data model 135. A user 115 may add information and warnings to a procedure step after it has been recorded. To do this, a speech to text API may be used. Actions are detected through the activity recognition component. The activity recognition component includes an object detection engine and semantic information from a semantic data model 135. To provide the component with real-time information about objects in the scene, a continuous stream of images from the HMD 125 camera, recorded in first-person perspective, is sent to the object recognition backend to analyze object types, their location on the frame and associated detection accuracy. Subsequently, the response from the object detection backend is combined with information from the semantic data model 135, which is then used to identified actions using a developed activity recognition algorithm. The activity recognition process may be initiated with every response from the object detection backend. Provided with information from the activity recognition engine, as well as spatial mapping, the feedback generation engine suggests detected action. If several actions are recognized, the HMD 125 displays a conflict resolution menu that the expert must solve.

[0063] Figure 2 depicts an example method for authoring. The acts are performed by the system of Figures 1, 16, other systems, a workstation, a computer, and/or a server 145. The method may be performed by an HMD 125 that is configured to record images and provide an augmented reality for a user 115. The method also may be performed using a backend that includes an object detection engine and a semantic model. Additional, different, or fewer acts may be provided. The acts are performed in the order shown (e.g., top to bottom) or other orders. Certain acts may be omitted or changed depending on the results of the previous acts.

[0064] In an embodiment, anybody could be teaching something to the system as long as there are visually distinguishable tasks. As an example applications, using hand recognition described herein, the application may be used to teach sign language. In an embodiment, a human could author an instance which could be used by a robot or machine that could reproduce the steps.

[0065] At act A110, image data is captured during performance of a task by a trained user 115. A task may involve some sort of interaction between a user 115 and one or more objects. Tasks may include, for example, industrial assembly task such as assembling a product. Any task may be captured as long as the task includes visually distinguishable steps. Tasks may, for example, include operation or fixing of machinery, installing machinery or electronics, commissioning an object, taking apart an object, etc. Any task may be captured if it includes one or more actions and two or more objects.

[0066] When a user 115 is starting the recording subsystem, the user 115 may first choose an existing task or create a new one. For text input, like the name of the task, a virtual keyboard is used. Alternatively, voice input may be used. Afterwards, the expert can create a new procedure for the chosen assembly task, which automatically starts the recording process. Image data is captured using one or more cameras 110, for example mounted or included with an HMD 125. Since stationary sensors may be limiting the application area, the detection is done by observing the workspace in first person perspective using the head mounted device. In an embodiment, a projector and a camera are used instead of an HMD 125.

[0067] At act A120, a detection engine detects objects in the image data in real time during the capturing. The object detection engine may be part of the HMD 125, the backend, or stored elsewhere. The object detection engine is configured to recognize objects on an input image and to output all detected objects with their associated class. The object detection engine may be trained using machine learning and ground truth data for the objects. In an embodiment, a single neural network is applied to an image. The network divides the image into regions and predicts bounding boxes and probabilities for each region. The bounding boxes are weighted by the predicted probabilities. Different neural network configurations and workflows may be used such as CNN, deep belief nets (DBN), or other deep networks. CNN leams feed-forward mapping functions while DBN learns a generative model of data. In addition, CNN uses shared weights for all local regions while DBN is a fully connected network (e.g., including different weights for all regions of an image). The training of CNN is entirely discriminative through backpropagation. DBN, on the other hand, employs the layer-wise unsupervised training (e.g., pre-training) followed by the discriminative refinement with backpropagation if necessary. In an embodiment, the arrangement of the trained network is a fully convolutional network (FCN). Alternative network arrangements may be used, for example, a 3D Very Deep Convolutional Networks (3D-VGGNet). VGGNet stacks many layer blocks containing narrow convolutional layers followed by max pooling layers. A 3D Deep Residual Networks (3D-ResNet) architecture may be used. A Resnet uses residual blocks and skip connections to learn residual mapping.

[0068] The neural network may be defined as a plurality of sequential feature units or layers. Sequential is used to indicate the general flow of output feature values from one layer to input to a next layer. Sequential is used to indicate the general flow of output feature values from one layer to input to a next layer. The information from the next layer is fed to a next layer, and so on until the final output. The layers may only feed forward or may be bi-directional, including some feedback to a previous layer. The nodes of each layer or unit may connect with all or only a sub-set of nodes of a previous and/or subsequent layer or unit. Skip connections may be used, such as a layer outputting to the sequentially next layer as well as other layers. Rather than pre-programming the features and trying to relate the features to attributes, the deep architecture is defined to learn the features at different levels of abstraction based on the input data. The features are learned to reconstruct lower level features (i.e., features at a more abstract or compressed level). For example, features for reconstructing an image are learned. For a next unit, features for reconstructing the features of the previous unit are learned, providing more abstraction. Each node of the unit represents a feature. Different units are provided for learning different features. [0069] Various units or layers may be used, such as convolutional, pooling

(e.g., max pooling), deconvolutional, fully connected, or other types of layers. Within a unit or layer, any number of nodes is provided. For example, 100 nodes are provided. Later or subsequent units may have more, fewer, or the same number of nodes. In general, for convolution, subsequent units have more abstraction. For example, the first unit provides features from the image, such as one node or feature being a line found in the image. The next unit combines lines, so that one of the nodes is a corner. The next unit may combine features (e.g., the corner and length of lines) from a previous unit so that the node provides a shape indication.

[0070] To train the object detection engine, training images are first collected for each object that is to be detected. This may include, for example, a parts list for one assembly task or for every task possible. A parts catalog or CAD datastore may be used to train the object detection engine. To train the object detector, the learning is supervised with bounding box annotations. A box is drawn around each object to be detected and each box is labeled with an object class.

The training images are then input in the object detector, which depending on its configuration and structure, learns to classify objects in unseen images.

[0071] At act A130, actions are identified from the detected objects using a semantic data model 135. An activity detection engine detects actions related to the objects during performance of the industrial task, for example, how the objects interact with one another. Each task may include different actions, for example attachment actions, tool actions, and hand actions. The activity detection engine is configured to recognize various action types to provide an accurate recording. When information provided to the action detection engine result in multiple action detections, the user 115 may be presented with the ambiguous action detections and may be able to solve the conflict in a simple way.

[0072] A goal of the action detection engine is to infer actions using information from the object detection engine and the semantic data model 135.

The recognition process is triggered after every response by the object detection backend and is checked for each detected object if the detected object is the base component of an action. In an example, there are four basic spatial attachment actions, connected on top, connected on bottom, connected on right, and connected on left. In order to detect actions, the engine uses information about the bounding boxes provided by the object detection model combined with information from the semantic data model 135. [0073] The semantic data model 135 stores rules for each of the actions.

As an example, the rules for the connected on top action may be validated as follows. The object detection engine identifies that element A is attached on top of element B. A first rule states that the top bounding box part of element B must he between the top and bottom bounding box part of element A, while the bottom bounding box part of element B remains below the bottom bounding box part of element A. This ensures that element B is not completely inside of element A.

The activity recognition engine then checks if element A contains at least 50% of the width of element B and vice versa, where at least one of the statements must be true. The activity recognition engine distinguishes between the recognition of attachment actions in individual frames and the triggering of actions inside the application. Movements and rearrangements of objects can result in overlapping bounding boxes in individual frames without the user’s intention. In order to trigger attachment actions in the application, a threshold-based approach is used. [0074] During each step, the activity recognition engine maintains a set of action candidates, detected in one or several frames, that may be triggered in the application. For each valid candidate action j, composed of a base component, target component and action, a belief value b(j) is provided that denotes the belief that the action should be triggered. To trigger the action, b(j) has to reach an empirical selected threshold value. After each detected frame, b(j) gets either decremented by one when the action was not recognized in the current frame, or incremented if the action was detected. Due to the different and changing viewing angles, it is sometimes unlikely for the originally recognized bounding box to continuously overlap with the target bounding box over several frames. To solve this problem, in an embodiment, the systems uses not one, but four different bounding boxes per object which each include, for example, an additional pixel overhead of 10, 20 and 30 on each side respectively.

[0075] Figures 3A depicts examples of objection recognition and bounding boxes 175. In Figure 3 A, bounding boxes 175 are depicted around each detected object. As depicted, there are four separate bounding boxes 175 for each detected object. Instead of performing the recognition on only the original boxes, the activity recognition engine attempts to detect the actions between all four bounding boxes 175 of the base element with the original bounding box 175 of the target element. The activity recognition engine first attempts to detect the action with the original bounding box 175 of the base element and then move to larger bounding boxes. If the action is detected, e.g. with the second - green - bounding box, the activity recognition engine cancels the search. Depending on the size of the bounding box which detected the action, it has a different impact on the increment of the belief value b(j). The original bounding box has the largest impact on the increment with a value of 1, decreasing by 0.2 per larger bounding box to an increment of 0.4 for the red bounding box. If multiple actions hit the threshold at the same frame, a random action is declared as winner. After an action has been triggered, all belief values are set to 0 and the process starts again.

[0076] When an action is triggered in the application, the next step is to ask the semantic data model 135 if it is a valid action. Figure 3B depicts examples of valid and invalid connections. The semantic data model 135 contains restrictions that specify which actions can be performed between different components. As an example, a data model may define restrictions that a light can be connected at the bottom with a bracket and that a bracket can be connected at the top with a light thus making the action valid. If the knowledge is not defined for a component, all actions may be considered valid. Restrictions allow the activity recognition engine to ignore certain triggered actions. For example, although certain configurations of the red (box A) and blue (box B) lights in Figure 3B would trigger the connected on left and connected on right actions since their bounding boxes are overlapping, a query on the semantic data model 135 would result that these are invalid actions between the components and therefore would be discarded.

[0077] Figure 4 depicts how assembly tasks are represented in an ontology.

Figure 4 describes assembly tasks with their associated procedures and procedure steps. Besides a name via the rdfs: label property, an assembly task can have multiple procedures, leading to the same goal. The association with procedures is represented with the ssfhasProcedure property. Like assembly tasks, procedures also have names. Each procedure may include one or multiple procedure steps which are related through the ssfprocedureStep property. A procedure step includes at least two physical objects, the base and target object, combined with an action. An action always starts from the base object (e.g. hand (base object) - pick up (action) - screwdriver (target object)). The optional atinfo and atwarning properties can be used to add one or more information or warning messages to a procedure step. Finally, in order to determine the order of the procedure steps, the ssfstepOrder property describes the position of the procedure step in the procedure.

[0078] Figure 5 depicts how actions are represented in Figure 5 the differentiation between different types of actions can be seen. Hand actions describe actions which are performed by hands without the help of a tool, such as pickup, or open and close actions on sub-parts of a component. Hand tool actions contain actions which are performed with the help of a tool like screwing with a screwdriver. Attachment actions describe attachments between two components. [0079] Figure 6 depicts how objects and parts are represented in the ontology. The Phys-Obj class of fonm, which describes object types and their physical properties, is extended with three subclasses. First, assembly objects describe the individual components of our assembly tasks, like load current supply or blue signaling column. For the assembly tasks. The assembly part class expresses sub-parts of assembly objects like a lid. Finally, the tool object class represents hands and hand tools like a screwdriver, that are later required for the performance of certain actions. In addition to the classes, some restrictions may be seen, which may be queried by the system during run time to gain information about valid actions. For example, a signaling column is defined that can only include other signaling columns attached to the bottom attachment point, and that hand tools can only be target of hand actions and being used for hand tool actions. [0080] A second type of actions supported and required for certain tasks are hand tool actions like screwing. In an embodiment, hand tool actions are treated on a component level like screwing a bracket and not for example, in light of single screws and screwing the single screws into the different screw holes. However, in an embodiment, individual screw holes may be defined in the semantic data model 135 if the image acquisition and detection is configured to reliably detect actions in small areas and objects of this size. In addition, without the use of additional sensors it would not be possible, for example, to detect whether a screw is tightened tightly. Hand tool actions are detected by identifying intersections of bounding boxes between tools and other objects. A threshold- based approach may be used by triggering an intersecting action after the corresponding belief value b(j) has been reached. After an intersection action has been triggered, the process proceed as follows: when bounding boxes A and B intersect on a frame, the semantic data model 135 is queried whether object A is a tool object. If it is a tool object, a check is performed whether object B can be the target of the tool action that can be performed by (tool) object A (e.g. screwing when object A is a screwdriver). In case object A is not a tool, the same questions are checked inverse for object B. Because tools are typically partially covered by a hand during the execution of the actions, it is difficult for the object detection model to continuously recognize the tools in order to enable a reliably hand tool action detection by the application. To improve the detection, process a logic is implemented that allows a hand to inherit the functionality of a tool after it was picked up. Thus, a hand can perform the tool action of the tool it is currently holding. This logic provides a more reliable action detection by detecting the hand instead of the tool during the execution of the hand tool action.

[0081] An example hand action is the pickup action. For the detection it uses the same intersecting approach and belief value threshold as previous actions. Since a hand should only be able to perform tool actions when a tool has been picked up and is in the hand, the action detection engine must recognize when the tool is being put back to remove the capability. For this, the action detection engine stores the position at which the screwdriver was picked up and recognizes when the object was returned to its original position. The tool is often covered by the hand and therefore continuous detection is not possible. In addition, the hand and tool are not always within the camera’s field of view. To see when the tool is returned to its original position, the action detection engine continuously calculates its distance from the pickup position if it is visible. If the distance falls below a threshold, the action detection engine determines that the user 115 has returned the tool near or at its original position. The action detection engine then updates the isHolding property of the hand instance in the semantic data model 135 so that it no longer holds the tool and thus loses the ability to perform the tool action. For visual feedback, an arrow may be created at the point where the object was picked up and where the object must be returned, using the calculated world position where the detected tool object was picked up.

[0082] Besides the pickup action the action detection engine implements other types of hand actions, such as manipulation actions. Manipulation actions describe a status change of a part of a component as a result of a hand action, such as the opening or closing of a lid. Since the actions change the appearance of a component, each status of a component is trained separately into the object detection model. In order to map detected objects and their parts to the representation in the semantic data model 135, the objection detection labels are defined in a certain structure. After the name of the component, underscores separate the individual parts with their associated status. In order to recognize status changes of components, the action detection engine first keep lists of all detected objects in a frame for a certain number of previous frames. In an embodiment, the action detection engine stores the detections of the previous 200 frames, which leads to a time span of 40 seconds at a processing rate of 5 frames per second. Since not all objects are always in the FOV when using an HMD 125, a shorter time span would result in status changes often not being recognized. For each newly processed frame, the action detection engine iterates backwards through all lists of previously recognized objects for each recognized object on the frame and checks whether there has been a status change of a component. An example is illustrated in Figure 7. Here, in the current frame n a CPU with an open lid was detected. When iterating backwards frame n - 3 contained a CPU with a closed lid. The system then checks whether both objects are at the same physical position to determine that they are the same physical object (i.e. that they are not two different CPUs). After the status change has been detected, a Hand - open - CPU action is triggered in the application.

[0083] Figures 8A-D and 9A-C depicts an example of the authoring mode.

A user 115 is prompted to choose an existing assembly task or create a new assembly task for which a new procedure should be recorded. In Figure 8A, an expert user 115 collects objects for the task. Each object is recognized by the authoring mode. In Figure 8B, the expert starts the process. In Figure 8C the expert starts with the base, which in Figure 8D is added the green light. The base is identified by the authoring mode as the bottom while the green light is detected as being placed on top of the base. This process continues in 9A-C until the expert is done.

[0084] In order to record an assembly procedure, an application on the

HMD 125 is started. The HMD 125 enables workers to use both hands during the performance of the assembly task. The HMD 125 includes a camera to capture the involved objects and assembly steps performed. A backend / server 145 includes a semantic knowledge model in which information is stored. The semantic knowledge model includes previously recorded procedures with their individual work steps as well as semantic information of objects. The semantic information of objects provides, among other things, information about the structure of components. It may describe, for example, that a load current supply has a lid. It also contains information about which actions are allowed between two components. For example, a light may be combined with another light at the top and bottom, a lid may be opened and closed by a hand, and a load current supply may be tightened with a screwdriver.

[0085] During the recording process, the system is continuously checking for new procedure steps. A procedure step typically includes two objects that are linked to an action. Images are streamed from the camera to the object detection backend. Predictions from the detection model are combined with information from the semantic data model 135 to infer actions. For example, the system checks certain overlaps of the bounding boxes to identify attachment actions and then check their validity by querying the semantic data model 135. Besides attachment actions (e.g. connected on left) the system may differentiate between hand tool actions (e.g. screwing) and hand actions (e.g. pick up).

[0086] When an action is detected it will start a procedure suggestion process that shows the detected step as triplet in the upper right corner. Figure 10 depicts an example of when an action is detected. When a countdown expires the step is stored in the semantic data model 135. During a countdown the user 115 can cancel the suggestion process (e.g. to correct the system). In case an attachment action was detected, a digital twin (created from 3D models) is displayed next to the detected objects that are part of the procedure step. The 3D models may be imported into the Unity project before building the application. Using voice commands, information or warnings may be added to the previous recorded procedure step by a user 115.

[0087] If multiple actions are detected, a conflict handling menu appears, that, for example, displays the detected conflict of the attachment action to be resolved. An example can be seen in Figure 11 where the application detected a conflicting action to the “CPU - has connected on right - Digital Input/Output Module” action.

[0088] Another feature of the authoring application is the custom triggering mode, which can be used to manually trigger actions which are not detected. The custom triggering mode is entered using voice commands and is configured to display possible actions of components. First, the initiating component has to be selected, followed by choosing the action to be perform and subsequently selecting the target component of the action.

[0089] At act A140, the identified actions are mapped to a structured set of work instructions and stored in a knowledge instance for the task. The structured set of work instructions may be used in learning mode or may be improved or altered by another author. In an example, multiple sessions with different workers may be recorded for the same procedure. Each process may be evaluated for safety and efficiency. The “best” process may then be used as the training process for other workers. The authoring mode may also be used to evaluate users. For example, each user 115 may use the authoring mode to track their workflow. The stored data may be compared to a “gold” standard or correct way to perform tasks. During each session, feedback may be provided to correct mistakes or provide guidance on better ways to accomplish tasks.

[0090] The learning mode provides for trainees and workers to learn the recorded procedures. The system presents guidance in an easy-to-understand format to lower the cognitive effort required while performing the instructions. During the learning task, a custom feedback area guides the user 115 through the steps of the procedure and provides real-time feedback in the event of a mistake. Examples, such as holograms, may be used to assist the user 115 in completing the task.

[0091] Figure 12 depicts an example workflow for learning a task. The acts are performed by the system of Figures 1, 16, other systems, a workstation, a computer, and/or a server 145. The method may be performed by an HMD 125 that is configured to record images and provide an augmented reality for a user 115. The method also may be performed using a backend that includes an object detection engine and a semantic model. Additional, different, or fewer acts may be provided. The acts are performed in the order shown (e.g., top to bottom) or other orders. Certain acts may be omitted or changed depending on the results of the previous acts.

[0092] After a procedure has been recorded during authoring mode as described above, the procedure may be opened with the learning mode. The learning mode provides for users to learn the procedure using an automatically generated learning environment that guides the user 115 through the different steps of the procedure. When the learning mode application is started, the learning mode application queries the semantic data model 135 for all available assembly tasks and procedures, from which the user 115 can then choose. The learning mode application provides trainees to learn recorded procedures in a convenient way using AR. The steps are presented in a way that is easy to understand and better illustrated by automatically generated animations. Feedback may be used to alert users about wrong attachment steps. In addition, objects of the current enforcing procedure step may be marked in the AR environment in order to facilitate the search for the current objects, thus preventing wrong attachments. As such, assembly parts and tool positions are tracked in 3D space in real-time.

[0093] At act A210, a structured set of work instructions in a knowledge instance is accessed for the task. When the learning mode application is started, the application queries the semantic data model 135 for all available assembly tasks and procedures, from which the user 115 can then choose. Once a procedure has been selected, the user 115 must determine their feedback location on a display, for example, an HMD 125. The feedback location provides the user 115 with information about the individual steps and guides the user 115 through the procedure. To determine the position, the user 115 drags a virtual sphere to a desired location and submits it with the air tap gesture. Voice commands may be used to adjust the position at a later time.

[0094] At act A220, the application detects that a plurality of required parts is available for the set of work instructions using image data captured using a camera, for example, an HMD 125. After a procedure has been selected, a parts list appears at the feedback location, listing all required objects of the chosen procedure. Figure 13 depicts an example view of the parts check. The user 115 looks around to scan the environment and see if all objects are present. When an object is detected, it is automatically marked in the parts list as present. Virtual arrows at the position of the physical objects indicate the objects name to the user 115. After all required objects have been detected the application proceeds to the first procedure step.

[0095] At act A230, the application provides, using augmented reality enabled by the HMD 125, the set of work instructions to a user 115. The application proceeds to the first procedure step and starts continuously monitoring and providing feedback (steps A240, A250 respectively- below) for correct and wrong actions. During the rest of the procedure, the user 115 is able to view the instructions at the feedback area.

[0096] At act A240, the application monitors objects and actions in a field of view of the user 115 while the user 115 performs the set of work instructions. As a result of the monitoring, at act A250, feedback is provided using AR provided by the head mounted display. In an embodiment, the feedback area is split into four different regions which can be seen in Figure 14 and are marked with the number 1 to 4 respectively. The first region displays the current procedure step that must be performed. When CAD models of the objects are available, 3D-models may be rendered in addition to the static object texts. If it is also an attachment action, instead of an action text, basic animations are automatically generated and played back to facilitate the learning process. If an object required for the current procedure step is in the FOV of the HMD, an arrow appears at its physical location. In addition, the color of the corresponding object text at the feedback location is changed from white to yellow. Since in Figure 14 both objects of the current procedure step are in the field of view, there are two arrows and the object texts in yellow. When the correct action is detected, the object texts turn green and the application continues with the next process step after a few seconds. In addition, the head-up display informs the user 115 about the successful detection. If a correct procedure step from the future is detected, the user 115 also receives a feedback message at the first region of the feedback area. [0097] In an embodiment, basic animations are automatically generated for the objects. For this the 3D models must be available or capable of being generated in real time. CAD models, for example, may be used overlaid on the display and rotated or combined to describe the steps of the procedure.

[0098] The second region of the feedback area, which is located on the right side and has the heading “After assembly”, shows the desired result of the objects of the current procedure step with their attached components from previous completed procedure steps. To generate the compound model, the system iterates backwards through all previous steps to append associated components to the model. The third region is only visible when an invalid attachment action was detected. It renders a 3D model of the incorrect attachment to show the user 115 exactly what he did wrong. In Figure 14 the green light was not placed under the blue light but on top of the white light. The fourth region displays information or warning texts that can be added to individual steps during recording.

[0099] Figures 15A and 15B depicts an example of the learning process. In

Figure 15 A, a worker has started the learning application. In Figure 15B, the user 115 grabs the first element and attempts to place it on top of the base. The application provides a warning that the user 115 has made an error.

[00100] Figure 16 depicts an example system 100 for authoring and learning a task using AR. The system 100 includes an AR device 125 and a server 145.

The AR device includes a display 102 and at least one camera 110. The server 145 includes a memory 106 and a processor 104. The processor 104 and memory 106 may be part of the AR device 125, a computer, server 145, workstation, or other system. A workstation or computer without the system 100 may be used with the system 100. Additional, different, or fewer components may be provided. For example, a computer network is included for remote processing or storage. As another example, a user input device (e.g., keyboard, buttons, sliders, dials, trackball, mouse, microphone, or other device).

[00101] The system 100 is configured to visually capture different work steps just by ‘watching’ a worker using the camera 110. The only sensor that is needed is a camera / head mounted device. No further sensors are needed, and the user 115 does not have to change his/her routine. The seen instructions are then translated into a persistent knowledge representation and can be repeated by a novice user 115. The user 115 receives direct feedback after a work step is completed. The user 115 cannot proceed if the task was not executed correctly.

The underlying semantic model also allows for the recording / storage of multiple approaches of performing the same task.

[00102] The display 102 is a CRT, LCD, projector, plasma, printer, tablet, smart phone or other now known or later developed display device for displaying the output. In an embodiment, the display 102 is part of a head mounted display (HMD 125). The HMD 125 allows for hands free interaction with objects. The camera 110 is any type of camera 110. The camera 110 is configured to capture a stream of images of a first-person view. The camera 110 may be included in an HMD 125 described above. In an embodiment there may more than one camera used to capture the images. The camera 110 and the display 102 are configured to provide AR.

[00103] The processor 104 is a control processor, image processor, general processor, digital signal processor, three-dimensional data processor, graphics processing unit, application specific integrated circuit, field programmable gate array, artificial intelligence processor, digital circuit, analog circuit, combinations thereof, or other now known or later developed device for processing surface data. The processor 104 is a single device, a plurality of devices, or a network. For more than one device, parallel or sequential division of processing may be used.

In one embodiment, the processor 104 is a control processor or other processor of a system 100. The processor 104 operates pursuant to and is configured by stored instructions, hardware, and/or firmware to perform various acts described herein. The processor 104 may be configured as a component in the HMD 125 or may, for example, be part of a separate computer system such as a server 145. The processor 104 may implement an object recognition engine that is configured to detect one or more objects in an image. The object recognition engine may be configured as a neural network.

[00104] The acquired image data, labeled image data, networks, network structures, and/or other information are stored in a non-transitory computer readable memory, such as the memory 106. The memory 106 is an external storage device, RAM, ROM, database, and/or a local memory (e.g., solid state drive or hard drive). The same or different non-transitory computer readable media may be used for the instructions and other data. The memory 106 may be implemented using a database management system (DBMS) and residing on a memory, such as a hard disk, RAM, or removable media. Alternatively, the memory 106 is internal to the processor 104 (e.g. cache).

[00105] The memory stores configuration data for the object detection engine, for example, the structure and weights of a neural network. The memory is configured to store multiple sequences for a specific procedure in the semantic data model 135.

[00106] The processor 104 is configured to receive a continuous stream of images from a camera, recorded in first-person perspective. The processor is configured to analyze object types, their location on the frame and associated detection accuracy. The response from the processor is combined with information from the semantic data model 135 stored in memory, which is then used to detect actions using a developed activity recognition algorithm stored in memory. A structured set of work instructions is generated and stored in the memory for the task. The structured set of work instructions may then be used in learning mode. [00107] The memory may be configured to store a semantic model. The semantic model includes tasks, procedures, and action with corresponding subclasses. The semantic model provides detailed information about the object being assembled, such as its functionality, or its relationship with other equipment. New classes representing the unknown object may be automatically created in the semantic model and can be later reused for other assembly tasks. Once the worker has performed his / her task a different worker can use the same setup to learn the task and receive direct feedback based on what the camera and the object detection algorithm detect. The learning worker can either open a certain task, e.g. light house assembly verbally or gather objects and have the system will load the task based on the recognized objects. The system may detect if objects are missing or objects do not match the selected task as the object detection algorithm is ‘watching’ the objects and work steps. Each object that is captures in an image by the camera is then mapped and matched to the previously semantically stored data. In case the existing semantic models provide deeper knowledge about the object being assembled, or its parts, the worker in training could learn not only the sequence of steps to complete the assembly task, but also its use, and functionality. [00108] The instructions for implementing the processes, methods, and/or techniques discussed herein are provided on non-transitory computer-readable storage media or memories, such as a cache, buffer, RAM, removable media, hard drive, or other computer readable storage media (e.g., the memory 106). The instructions are executable by the processor 104 or another processor. Computer readable storage media include various types of volatile and nonvolatile storage media. The functions, acts or tasks illustrated in the figures or described herein are executed in response to one or more sets of instructions stored in or on computer readable storage media. The functions, acts or tasks are independent of the instructions set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firmware, micro code, and the like, operating alone or in combination.

[00109] In one embodiment, the instructions are stored on a removable media device for reading by local or remote systems. In other embodiments, the instructions are stored in a remote location for transfer through a computer network. In yet other embodiments, the instructions are stored within a given computer, CPU, GPU, or system. Because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present embodiments are programmed.

[00110] Various improvements described herein may be used together or separately. Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.